arxiv:1606.01614v4 [cs.cl] 17 apr 2017 · training data or ﬁne-grained annotations such as ......

Adversarial Deep Averaging Networksfor Cross-Lingual Sentiment Classification

Xilun Chen†[email protected]

Yu Sun†[email protected]

Ben Athiwaratkun‡[email protected]

Claire Cardie†[email protected]

Kilian Weinberger†[email protected]

†Dept. of Computer Science, Cornell University, Ithaca NY, USA‡Dept. of Statistical Science, Cornell University, Ithaca NY, USA

Abstract

In recent years deep neural networks haveachieved great success in sentiment clas-sification for English, thanks in part tothe availability of copious annotated re-sources. Unfortunately, most other lan-guages do not enjoy such an abundanceof annotated data for sentiment analy-sis. To tackle this problem, we proposean Adversarial Deep Averaging Network(ADAN) to transfer sentiment knowledgelearned from labeled English data to low-resource languages where only unlabeleddata exists. ADAN is a “Y-shaped” net-work with two discriminative branches: asentiment classifier and an adversarial lan-guage identification scorer. Both branchestake input from a shared feature extrac-tor that aims to learn hidden representa-tions that capture the underlying sentimentof the text and are invariant across lan-guages. Experiments on Chinese and Ara-bic sentiment classification demonstratethat ADAN significantly outperforms sev-eral baselines, including a strong pipelineapproach that relies on state-of-the-art Ma-chine Translation.

1 Introduction

There has been significant progress on Englishsentiment analysis in recent years using modelsbased on neural networks (Socher et al., 2013;Irsoy and Cardie, 2014a; Le and Mikolov, 2014;Tai et al., 2015; Iyyer et al., 2015). Most of these,however, rely on a massive amount of labeledtraining data or fine-grained annotations such asthe Stanford Sentiment Treebank (Socher et al.,2013), which provides sentiment annotations foreach phrase in the parse tree of every sentence.

For many other languages, however, only a limitednumber of sentiment annotations exist. Therefore,previous research in sentiment analysis for low-resource languages focuses on inducing sentimentlexicons (Mohammad et al., 2016b) or training lin-ear classifiers on small domain-specific datasetswith hundreds to a few thousand instances (Tanand Zhang, 2008; Lee and Renganathan, 2011).Although some prior work tries to alleviate thescarcity of sentiment annotations by leveraging la-beled English data (Wan, 2008, 2009; Lu et al.,2011; Mohammad et al., 2016a), these methodsrely on external knowledge such as bilingual lex-icons or machine translation (MT) that are expen-sive to obtain. Some of these papers (Zhou et al.,2016; Wan, 2009) also require translating the en-tire English training set, which is too demandingif using large amount of English training data.

To aid the creation of sentiment classificationsystems in such low-resource languages, we pro-pose a framework that leverages the abundant re-sources for a source language (here, English, de-noted as SOURCE) to produce sentiment analy-sis models for a target language (TARGET). Ourframework is unsupervised in the sense that it re-quires only unlabeled text in the target language.In particular, we propose ADAN, an end-to-end ad-versarial neural network (Goodfellow et al., 2014;Ganin and Lempitsky, 2015). It uses labeled datato train a sentiment classifier for the source lan-guage, and simultaneously transfers the learnedsentiment analysis knowledge to the target lan-guage. Our trained system then directly operateson TARGET texts to predict their sentiment.

We hypothesize that an ideal model for cross-lingual sentiment analysis should learn featuresthat both perform well on sentiment classificationfor the SOURCE, and are invariant with respect tothe shift in language. Therefore, ADAN has twodiscriminative components: i) a sentiment clas-

arX

iv:1

606.

0161

4v4

[cs

.CL

] 1

7 A

pr 2

017

Eng

The movie was awesome.

BilingualWord

Embeddings

Softmax

Chn

这里服务很好。

F(x)avg

P(x)

Q(x)

Sentiment

Language

Very Positive

Positive

Neutral

Negative

Very Negative

Joint Feature Extractor

Sentiment Classifier

Adversarial Language Identification Scorer

x

… … …

… …

… …

En (+∞)

Ch (-∞)

Figure 1: Adversarial Deep Averaging Network with Chinese as the target language. The sentiment classifier P and thelanguage scorer Q both take input from the feature extractor F , and are optimized to excel in their own tasks. Q has outputwidth 1, which is deemed as a scalar score indicating how likely a sample is from SOURCE (more in Section 2). F aims tolearn features that help P while hindering the adversarial Q, in order to learn features helpful for sentiment classification andinvariant across languages. Bilingual word embeddings (BWEs) will be discussed in Section 2.1.

sifier P for SOURCE; and ii) an adversarial lan-guage identification scorerQ that predicts a scalarindicating whether the input text x is from theSOURCE (higher score) or the TARGET (lower).The structure of the model is shown in Figure 1.The two classifiers take input from the shared fea-ture extractor F , which operates on the average ofthe bilingual word embeddings (BWEs) for an in-put text from either SOURCE or TARGET.

While P and Q each learn to excel in their owntask, F drives its parameters to extract hidden rep-resentations that help the sentiment prediction ofP and hamper the language identification of Q.Upon successful training, the joint features (out-puts of F) are thus encouraged to be both discrim-inative for sentiment analysis and invariant acrosslanguages. Since ADAN learns language-invariantfeatures by preventingQ from identifying the lan-guage of a sample, Q is hence “adversarial”.The intuition is that if Q cannot tell the languageof a given input sequence using the adversariallytrained features, then those features from F areeffectively language-invariant.

The model is exposed to both SOURCE andTARGET texts during training; SOURCE andTARGET data are passed through the languagescorer, while only the labeled SOURCE data passthrough the sentiment classifier. The feature ex-tractor and the sentiment classifier are then usedfor TARGET texts at test time. In this manner, wecan train ADANwith labeled SOURCE data and un-labeled TARGET text.

The idea of incorporating an adversary in neuralnetworks has achieved great success in computervision for image generation (Goodfellow et al.,

2014) and domain adaptation (Ganin and Lempit-sky, 2015). However, to our best knowledge, oursis the first to develop an adversarial network forlanguage adaptation, i.e. cross-lingual NLP tasks.In addition, inspired by Arjovsky et al. (2017), wemodify the traditional adversarial training methodproposed by Ganin and Lempitsky (2015), provid-ing improved performance with smoother training(Sec. 2.2).

We evaluate ADAN using English as SOURCEwith both Chinese and Arabic as TARGET, andfind that ADAN substantially outperforms i) train-on-source cross-lingual approaches trained usinglabeled SOURCE data; ii) closely related domainadaptation methods, and iii) approaches that em-ploy powerful MT systems. We further investigatethe semi-supervised setting, where a small amountof TARGET annotated data exists, and show thatADAN still beats all the baseline systems given thesame amount of TARGET supervision. Finally, weprovide analysis and visualization of ADAN, shed-ding light on how it manages to achieve its strongcross-lingual performance. Last but not least, westudy a key component in ADAN, the BilingualWord Embeddings, and demonstrate that ADAN’sperformance is robust with respect to the choiceof BWEs. Even with random initialized embed-dings, ADAN outperforms some of the BWE base-lines (Sec. 3.3.3).

2 The ADAN Model

2.1 Network Architecture

As illustrated in Figure 1, ADAN is a feed-forwardnetwork with two branches. There are three main

components in the network, a joint feature extrac-tor F that maps an input sequence x to the sharedfeature space, a sentiment classifier P that predictsthe sentiment label for x given the feature repre-sentation F(x), and a language scorer Q that alsotakes F(x) but predicts a scalar score indicatingwhether x is from SOURCE or TARGET.

An input document is modeled as a sequenceof words x = w1, . . . , wn, where each word wis represented by its word embedding vw (Turianet al., 2010). Because the same feature extractorF operates on both SOURCE and TARGET sen-tences, it is favorable if the word representationsfor both languages align approximately in a sharedspace. Thus, we employ bilingual word embed-dings (BWEs) (Zou et al., 2013; Gouws et al.,2015) to induce distributed word representationsthat encode semantic relatedness between wordsacross languages, so that similar words are closerin the embedded space regardless of language.

In some prior work, a parallel corpus is requiredto train the BWEs, making ADAN implicitly “su-pervised” in the target language. The same can besaid for previous work in cross-lingual sentimentclassification that requires a sophisticated MT sys-tem to link the two languages (Wan, 2009; Zhouet al., 2016). The latter require more direct bilin-gual supervision in the form of translated SOURCEtraining data. Thus, this approach is not generallyfeasible for tasks that require massive amounts of,or evolving, training data. In contrast, ADAN re-lies on a fixed set of domain-independent BWEs,and no change is necessary when the training datachanges. Moreover, even with random initial-ized embeddings, ADAN can still outperform somebaseline methods that use BWEs (see Sec. 3.3.3).

We adopt the Deep Averaging Network (DAN)by Iyyer et al. (2015) for the feature extractorF . Although other architectures could be em-ployed, we chose DAN because it is a simple neu-ral network model that yields surprisingly goodperformance for monolingual sentiment classifica-tion. For each document, DAN takes the arithmeticmean of the word vectors as input, and passes itthrough several fully-connected layers until a soft-max for classification. In ADAN, F first calculatesthe average of the word vectors in the input se-quence, then passes the average through a feed-forward network with ReLU nonlinearities. Theactivations of the last layer in F are consideredthe extracted features for the input and are then

passed on to P and Q. The sentiment classifierP and the language scorer Q are standard feed-forward networks. P has a softmax layer on topfor sentiment classification and Q ends with a lin-ear layer of output width 1 to assign a languageidentification score1.

2.2 Adversarial Training

Consider the distribution of the joint hidden fea-tures F for both SOURCE and TARGET instances:

P srcF , P (F(x)|x ∈ SOURCE)

P tgtF , P (F(x)|x ∈ TARGET)

As mentioned above, we train F to make thesetwo distributions as close as possible to learnlanguage-invariant features for better cross-lingualgeneralization. Departing from previous researchin adversarial training (Ganin and Lempitsky,2015), in this work we minimize the Wasser-stein distance, following Arjovsky et al. (2017).As argued by Arjovsky et al. (2017), existingapproaches to training adversarial networks areequivalent to minimizing the Jensen-Shannon dis-tance between two distributions, in our case P srcFand P tgtF . And because Jensen-Shannon suffersfrom discontinuities, providing less useful gradi-ents for training F , Arjovsky et al. (2017) pro-pose instead to minimize the Wasserstein distanceand demonstrate its improved stability for hyper-parameter selection.

As a result, we too minimize the Wassersteindistance between P srcF and P tgtF according to theKantorovich-Rubinstein duality (Villani, 2008):

W (P srcF , P tgtF ) = (1)

sup‖g‖L≤1

Ef(x)∼P src

F

[g(f(x))]− Ef(x′)∼P tgt

F

[g(f(x′))

]where the supremum (maximum) is taken over theset of all 1-Lipschitz2 functions g. In order to (ap-proximately) calculate W (P srcF , P tgtF ), we use thelanguage scorer Q as the function g in (1), whoseobjective is then to seek the supremum in (1) toestimate W (P srcF , P tgtF ). To make Q a Lipschitzfunction (up to a constant), the parameters of Qare always clipped to a fixed range. Let Q be pa-rameterized by θq, then the objective Jq of Q be-

1Q simply tries to maximize scores for SOURCE texts andminimize for TARGET, and the scores are not bounded.

2A function g is 1-Lipschitz iff. |g(x)− g(y)| ≤ |x− y|for all x and y.

comes:

Jq(θf ) ≡ (2)

maxθq

EF(x)∼P src

F

[Q(F(x))]− EF(x′)∼P tgt

F

[Q(F(x′))

]Intuitively, Q tries to output higher scores forSOURCE instances and lower scores for TARGET.

For the sentiment classifier P parameterized byθp, we use the traditional cross-entropy loss, de-noted as Lp(y, y), where y and y are the predictedlabel distribution and the true label, respectively.Lp is the negative log-likelihood that P predictsthe correct sentiment label. We therefore seek theminimum of the following loss function for P:

Jp(θf ) ≡ minθp

E(x,y)

[Lp(P(F(x)), y)] (3)

Finally, the joint feature extractor F parameter-ized by θf strives to minimize both the sentimentclassifier loss Jp and W (P srcF , P tgtF ) ≈ Jq:

Jf ≡ minθf

Jp(θf ) + λJq(θf ) (4)

where λ is a hyper-parameter that balances the twobranches P and Q.

As proved by Arjovsky et al. (2017) and ob-served in our experiments, minimizing the Wasser-stein distance is much more stable w.r.t. hyperpa-rameter selection, saving the hassle of carefullyvarying λ during training (Ganin and Lempitsky,2015). In addition, traditional adversarial train-ing methods need to laboriously coordinate the al-ternating training of the two competing compo-nents (Goodfellow et al., 2014) by setting a hy-perparameter k, which indicates the number of it-erations one component is trained before trainingthe other. Unfortunately, performance can degradesubstantially if k is not properly set. However, inour case, delicate tuning of k is no longer nec-essary since W (P srcF , P tgtF ) is approximated bymaximizing (2); thus, training Q to optimum us-ing a large k can provide better performance (butis slower to train). In our experiments, F and Pare trained together and Q is trained separately.We use λ = 0.1 and k = 5 (train 5 Q iterationsperF and P iteration), and the performance is sta-ble over a large set of hyperparameters (See Sec-tion 3.3.4).

3 Experiments and Discussions

To demonstrate the effectiveness of our model, weexperiment on Chinese and Arabic sentiment clas-sification, using English as SOURCE for both. For

all data used in experiments, tokenization is doneusing Stanford CoreNLP (Manning et al., 2014).

3.1 DataLabeled English Data. We use a balanced datasetof 700k Yelp reviews from Zhang et al. (2015)with their sentiment ratings as labels (scale 1-5).We also adopt their train-validation split: 650k re-views for training and 50k form a validation set.Labeled Chinese Data. Since ADAN does not re-quire labeled Chinese data for training, this an-notated data is solely used to validate the perfor-mance of our model. 10k balanced Chinese hotelreviews from Lin et al. (2015) are used as valida-tion set for model selection and parameter tuning.The results are reported on a separate test set ofanother 10k hotel reviews. For Chinese, the dataare annotated with 5 labels (−−, −, 0, +, ++).Unlabeled Chinese Data. For the unlabeledTARGET data used in training ADAN, we use an-other 150k unlabeled Chinese hotel reviews.English-Chinese Bilingual Word Embeddings.For Chinese, we used the pre-trained bilingualword embeddings (BWE) by Zou et al. (2013).Their work provides 50-dimensional embeddingsfor 100k English words and another set of 100kChinese words. For more experiments and discus-sions on BWE, see Section 3.3.3.Labeled Arabic Data. We use the BBN Ara-bic Sentiment Analysis dataset (Mohammad et al.,2016a) for Arabic sentiment classification. Thedataset contains 1200 sentences from social me-dia posts annotated with 3 sentiment labels (−, 0,+). The dataset also provides machine translatedtext to English. Since the label set does not matchwith the English dataset, we map all the rating 4and 5 English instances to + and the rating 1 and2 instances to −, while the rating 3 sentences areconverted to 0.Unlabeled Arabic Data. For Arabic, no addi-tional unlabeled data is used. We only use thetext from the annotated data (without labels) dur-ing training.English-Arabic Bilingual Word Embeddings.For Arabic, since no pre-trained BWE is avail-able, we train a 300d BilBOWA BWE (Gouwset al., 2015) on the United Nations corpus (Ziem-ski et al., 2016).

3.2 Cross-Lingual Sentiment ClassificationOur main results are shown in Table 1, whichshows very similar trends for Chinese and Ara-

Setting ApproachAccuracy

Chinese Arabic

Train-on-source-onlyLogistic Regression 30.58% 45.83%DAN 29.11% 48.00%

Domain Adaptation mSDA (Chen et al., 2012) 31.44% 48.33%

Machine TranslationLogistic Regression + MT 34.01% 51.67%DAN + MT 39.66% 52.50%

Ours ADAN (CN: 50d, AR:300d) 42.95%† 55.33%† p < 0.001 under a McNemar test.

Table 1: ADAN performance for Chinese (5-cls) and Arabic (3-cls) sentiment classification without usinglabeled TARGET data. All systems use BWE to map SOURCE and TARGET words into the same space.

bic. Note first that in all of our experiment set-tings, traditional features like bag of words can-not be directly used since SOURCE and TARGEThave completely different vocabularies. There-fore, bilingual word embeddings (BWE) are usedas the input representation for all systems to mapwords from both SOURCE and TARGET into thesame feature space. In addition, some existingCLSC methods (Wan, 2009; Zhou et al., 2016)need to translate the entire English training set intoeach target language, which are prohibitive in oursetting since our training set has 650k samples.

We start by considering two baselines that trainonly on the SOURCE language, English, and relysolely on the BWE to classify the TARGET. Thefirst variation uses a standard supervised learn-ing algorithm, Logistic Regression (LR), shownin Row 1 in Table 1. In addition, we evaluate anon-adversarial variation of ADAN, just the DANportion of our model (Row 2), which is one of thestate-of-the-art neural models for sentiment classi-fication. We can see from Table 1 that, in compar-ison to ADAN (bottom line), BWE by itself doesnot suffice to transfer knowledge of English sen-timent classification to TARGET, and the perfor-mance of DAN is poor. On Chinese, even LR per-forms slightly better, despite that DAN outperformsLR by a large margin on English sentiment classi-fication (not shown in table). This might suggestthat fitting tightly on the English data does not nec-essarily entail good performance on TARGET dueto the distributional discrepancy.

We next compare ADAN with domain adapta-tion baselines, since it can be viewed as a gener-alization of the cross-lingual task. Nonetheless,domain adaptation methods did not yield satisfac-tory results for our task. TCA (Pan et al., 2011) didnot work since it required quadratic space in terms

of the number of samples (650k). SDA (Glo-rot et al., 2011) and the subsequent mSDA (Chenet al., 2012) are proven very effective for cross-domain sentiment classification on Amazon re-views. However, as shown in Table 1 (Row 3),mSDA did not perform competitively. We specu-late that this is because many domain adaptationmodels including mSDA were designed for theuse of bag-of-words features, which are ill-suitedin our task where the two languages have com-pletely different vocabularies. In summary, thissuggests that even strong domain adaptation algo-rithms cannot be used out of the box for our taskto get satisfactory results.

Finally, we evaluate ADAN against MachineTranslation baselines (Row 4-5) that (1) translatethe TARGET text into English and then (2) use thebetter of the train-on-source-only models for senti-ment classification. Previous studies (Banea et al.,2008; Salameh et al., 2015) on sentiment analysisfor Arabic and European languages claim this MTapproach to be very competitive and can some-times match the state-of-the-art system trained onthat language. For Chinese, where translated textwas not provided, we use the commercial GoogleTranslate engine3, which is highly engineered,trained on enormous resources, and arguably oneof the best MT systems. As shown in Table 1, ourADAN model substantially outperforms the MTbaseline on both languages, indicating that our ad-versarial model can successfully perform cross-lingual sentiment analysis without any annotateddata on the target language.

3.3 Analysis and Discussions

Since the Arabic dataset is small and may producenoisy results, we chose Chinese as an example for

3https://translate.google.com

https://translate.google.com

our further analysis.

3.3.1 Semi-supervised Learning

In practice, it is usually not very difficult to obtainat least a little bit of annotated data. ADAN can bereadily adapted to exploit such extra labeled datain the target language, by letting those labeled in-stances pass through the sentiment classifier P asthe English samples do during training. We sim-ulate this semi-supervised scenario by adding la-beled Chinese reviews for training. We start fromadding 100 labeled reviews and keep doubling thenumber until 12800. As shown in Figure 2, whenadding the same number of labeled reviews, ADANcan better utilize the extra supervision and out-perform the DAN baseline trained with combineddata, as well as the supervised DAN using only la-beled Chinese reviews. The margin is naturallydecreasing as more supervision is incorporated,but ADAN is still superior when adding 12800 la-beled reviews. On the other hand, the DAN withtranslation baseline seems not able to effectivelyutilize the added supervision in Chinese, and theperformance only starts to show a slightly increas-ing trend when adding 6400 or more labeled re-views. One possible reason is that when addinginto the training data a small number of Englishreviews translated from the labeled Chinese data,the training signals they produce might be lost inthe vast number of English training samples, thusnot effectively improving the performance. An-other interesting find is that it seems a very smallamount of supervision (e.g. 100 labels) could sig-nificantly help DAN. However, with the same num-ber of labeled reviews, ADAN still outperforms theDAN baseline.

Figure 2: ADAN performance for Chinese inthe semi-supervised setting when using variousamount of labeled Chinese data.

3.3.2 Qualitative Analysis and Visualizations

To qualitatively demonstrate how ADAN bridgesthe distributional discrepancies between Englishand Chinese instances, t-SNE (Van der Maatenand Hinton, 2008) visualizations of the activationsat various layers are shown in Figure 3. We ran-domly select 1000 reviews from the Chinese andEnglish validation sets respectively, and plot thet-SNE of the hidden node activations at three loca-tions in our model: the averaging layer, the end ofthe joint feature extractor, and the last hidden layerin the sentiment classifier before softmax. Thetrain-on-English model is the DAN baseline in Ta-ble 1. Note that there is actually only one “branch”in this baseline model, but in order to compare toADAN, we conceptually treat the first three layersas the feature extractor.

Figure 3a shows that BWE alone does not suf-fice to bridge the gaps between the distributionsof the two languages. Furthermore, we can seein Figure 3b that the distributional discrepanciesbetween Chinese and English are significantly re-duced after passing through the joint feature ex-tractor (F), and the learned feature in ADANbrings the distributions in the two languages dra-matically closer compared to the monolinguallytrained baseline. This is measured by the Aver-aged Hausdorff Distance (Shapiro and Blaschko,2004; Schutze et al., 2010), which is used to mea-sure the distance between two sets of points. Fig-ure 3 annotates each sub-figure with the AHD be-tween the English and Chinese reviews.

Finally, when looking at the last hidden layeractivations in the sentiment classifier of the base-line model (Figure 3c), there are several notableclusters of the red dots (English data) that roughlycorrespond to the class labels. These English clus-ters are the areas where the classifier is the mostconfident in making decisions. However, mostChinese samples are not close to one of those clus-ters due to the distributional diversion and maythus cause degraded performance in Chinese sen-timent classification. On the other hand, the Chi-nese samples are more in line with the Englishones in ADAN, which results in the accuracy boostover the baseline model. In Figure 3, a pair of sim-ilar English and Chinese 5-star reviews are high-lighted to visualize how the distribution evolves atvarious points of the network. We can see in 3cthat the Chinese review goes close to the “positiveEnglish cluster” in ADAN, while in the baseline, it

(a) Averaging Layer Outputs (b) Joint Hidden Features (c) Sentiment Branch Outputs!"#T

rain

on

Engl

ish

2. A

DAN

I have been here twice and both times have been great. They really have a nice service staff & very Attentive! Food is pretty good as well! They seem to be always busy but super glad you are there with them. Well done!

!"#$%,&'(),*+,-,./0123#,456789:;#<=>?@ABC@DEFAG<HI?J<KFFL<MFDICN<OP??FPALEAKB<@?I<QI@PDERPCS<@AL<BI?TEUI<EB<TI?J<KFFLV<WECC<BD@J<@K@EAN<>MI?I<EB<RFFL<WEDMEA<XYZN[

=@[<>MI<@TI?@KIL<\]^<@CFAI<EB<AFD<BP_UEIAD<DF<Q?ELKI<DMI<C@AKP@KI<K@`

=Q[<aD<DMI<IAL<FR<DMI<bFEAD<RI@DP?I<IcD?@UDF?S<adae<KIAI?@DIB<@<ZF?I<ZEcIL<LEBD?EQPDEFA<QIDWIIA<

C@AKP@KIB<UFZ`@?IL<DF<DMI<Q@BICEAI

Avg Hausdorff Dist = 0.24 Avg Hausdorff Dist = 0.98 Avg Hausdorff Dist = 0.25Avg Hausdorff Dist = 0.24 Avg Hausdorff Dist = 0.22 Avg Hausdorff Dist = 0.08

=U[<fA<DMI<C@BD<MELLIA<C@JI?<EA<DMI<BIADEZIAD<UC@BBEgI?<QIRF?I<BFRDZ@cS<ÂKCEBM<?ITEIWB<RF?Z<UCPBDI?B<Q@BIL<FA<C@QICB<=EA<DMI<Ì?EZIDI?[<EA<QFDM<BJBDIZBN<>MI<hMEAIBI<?ITEIWB<EA<DMI<Q@BICEAI<LF<AFD<@CEKA<WICC<WEDM<DMI<ÂKCEBM<

UCPBDI?B<@AL<DMEB<`?FQCIZ<EB<@CCITE@DIL<EA<adae

Figure 3: t-SNE Visualizations of activations at various layers for the train-on-source-only baseline model (top) and ADAN(bottom). Better viewed in color and zoom in for more details. The distributions of the two languages are brought much closerin ADAN as they are represented deeper in the network (left to right) measured by the Averaged Hausdorff Distance (discussedlater). The green circles are two 5-star example reviews (shown below the figure) that illustrate how the distribution evolves.

stays away from dense English clusters where thesentiment classifier trained on English data are notconfident to make predictions.

3.3.3 Impact of Bilingual Word EmbeddingsIn this section we discuss the effect of the bilingualword embeddings. We start by feeding the systemswith random initialized WEs, shown in Table 2.ADAN with random WE outperforms the DAN andmSDA baselines using BWE and matches the per-formance of the LR+MT baseline (Table 1), sug-gesting ADAN successfully extracts features thatcould be used for cross-lingual classification taskssimilar to BWE without any bitext.

With the introduction of BWE, the performanceof ADAN is further boosted. Therefore, it seemsthe quality of BWE plays an important role incross-lingual classification. To investigate the im-pact of BWE, we also trained a 100d BilBOWABWE (Gouws et al., 2015) using the UN parallelcorpus for Chinese. All systems achieve slightlylower performance compared to the pre-trainedBWE, yet ADAN still outperforms other baselinemethods (Table 2), demonstrating that ADAN’s ef-fectiveness is relatively robust with respect to the

choice of BWE. For the reason why all systemsshow inferior results with BilBOWA, we conjec-ture that BilBOWA may have slightly reducedquality since it does not require word alignmentsduring training as by Zou et al. (2013). By onlytraining on sentence-aligned corpus, BilBOWA re-quires less resource and is much faster to train, po-tentially at the expense of quality.

Model Random BilBOWA Pre-trainedDAN 21.66% 28.75% 29.11%DAN+MT 37.78% 38.17% 39.66%ADAN 34.44% 40.51% 42.95%

Table 2: Model performance for various (B)WEchoices for Chinese.

3.3.4 ADAN Hyperparameter StabilityIn this section, we show that the training of ADANis stable over a large set of hyperparameters, andprovide improved performance compared to tradi-tional adversarial training method by Ganin andLempitsky (2015).

We implemented a variant of ADAN similarto the adversarial domain adaptation network

k

lambda lambda

ADAN without Wasserstein Distance ADAN

Figure 4: A grid search on k and lambda for ADAN (right) and the Ganin and Lempitsky (2015) variant(left). Numbers indicate the accuracy on the Chinese development set.

by Ganin and Lempitsky (2015). In particular,Q is now a binary classifier with a softmax layeron top which classifies if an input text x is fromSOURCE or TARGET, given its hidden featuresF(x). For training, Q is connected to F witha GradientReversalLayer (Ganin and Lempitsky,2015) in between, which preserves the input dur-ing the a forward pass but multiplies the gradientsby −λ during a backward pass. λ is a hyperpa-rameter similar to that in ADAN that balances theeffects P andQ have on F respectively. This way,the entire network can be trained in its entiretyusing standard backpropagation. In addition, asmentioned in Section 2.2, the training of F and Qmight not be fully in sync, and efforts need to bemade to coordinate the adversarial training. Thisis achieved by setting λ to a non-zero value onlyonce out of k batches. Here, k is again a hyperpa-rameter similar to that in ADANwhich correspondsthe training of F and Q. When λ = 0, the gra-dients from Q will not be back-propagated to F .This allowsQmore iterations to adapt to F beforeF makes another adversarial update.

To verify the superiority of ADAN, we conducta grid search over the two hyperparameters:k (number of Q iterations per F itertion) andλ (the balance factor between P and Q). Weexperiment with k ∈ {1, 2, 4, 8, 16}, and λ ∈{0.00625, 0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}.Figure 4 reports the accuracy on the Chinesedevelopment set of both ADAN variants. In Fig-ure 4, it can be clearly seen that ADAN achieveshigher accuracy while being much more stable

than the Ganin and Lempitsky (2015) variant,suggesting that ADAN overcomes the well-knownproblem that adversarial training is sensitive tohyperparameter tuning.

3.4 Implementation DetailsFor all our experiments on both languages, thefeature extractor F has three fully-connected hid-den layers with ReLU non-linearities, while bothP and Q have two. All hidden layers contains900 hidden units. This choice is more or lessad-hoc, and the performance could potentiallybe improved with more careful model selection.Batch Normalization (Ioffe and Szegedy, 2015)is used in each hidden layer in P and Q. Fdoes not use BN. F and P are optimized togetherby Adam (Kingma and Ba, 2015) with a learn-ing rate of 0.05 for Chinese and 0.01 for Ara-bic experiments. Q is trained with Adam withlearning rate of 0.00005. The weights of Q areclipped to [−0.01, 0.01]. ADAN is implemented inTorch7 (Collobert et al., 2011). We train ADANfor 30 epochs and use early stopping to select thebest model on the validation set.

4 Related Work

Cross-lingual Sentiment Analysis is motivatedby the lack of high-quality labeled data in manynon-English languages (Mihalcea et al., 2007;Banea et al., 2008, 2010). For Chinese and Ara-bic in particular, there are several representativeworks (Wan, 2008, 2009; He et al., 2010; Lu et al.,2011; Mohammad et al., 2016a). Our work is

comparable to these papers in objective but verydifferent in method. The work by Wan uses ma-chine translation to directly convert English train-ing data to Chinese; this is one of our baselines. Luet al. (2011) instead uses labeled data from bothlanguages to improve the performance on both.Domain Adaptation tries to learn effective clas-sifiers for which the training and test samplesare from different underlying distributions (Blitzeret al., 2007; Pan et al., 2011; Glorot et al., 2011;Chen et al., 2012; Liu et al., 2015). This can bethought of as a generalization of cross-lingual textclassification. However, one main difference isthat, when applied to text classification tasks suchas sentiment analysis, most work assumes a com-mon feature space such as bag of words, which isnot available in the cross-lingual setting. See Sec-tion 3.2 for experiments on this. In addition, mostworks in domain adaptation evaluate on adapt-ing product reviews across domains (e.g. books toelectronics), where the divergence in distributionis less significant than that between two languages.Adversarial Networks have enjoyed much suc-cess in computer vision (Goodfellow et al., 2014;Ganin and Lempitsky, 2015), but to our bestknowledge, have not yet achieved comparable suc-cess in NLP. We are the first to apply adversarialtraining to cross-lingual NLP tasks. A series ofwork in image generation has used architecturessimilar to ours, by pitting a neural image generatoragainst a discriminator that learns to classify realversus generated images (Goodfellow et al., 2014;Denton et al., 2015). More relevant to this work,adversarial architectures have produced the state-of-the-art in unsupervised domain adaptation forimage object recognition: Ganin and Lempitsky(2015) train with many labeled source images andunlabeled target images, similar to our setup. Inaddition, some recent work (Arjovsky et al., 2017;Gulrajani et al., 2017) propose improved methodsfor training Generative Adversarial Nets.

5 Conclusion and Future Work

In this work, we presented ADAN, an adversar-ial deep averaging network for cross-lingual sen-timent classification, which, for the first time,applies adversarial training to cross-lingual NLP.ADAN leverages the abundant resources on En-glish to help sentiment analysis on other languageswhere little or no annotated data exist. We validateour hypothesis by empirical experiments on Chi-

nese and Arabic sentiment classification, wherewe have labeled English data and only unlabeleddata in the target language. Experiments show thatADAN outperforms several baselines including do-main adaptation models and a highly competitiveMT baseline. We further show that even with-out any bilingual resources, ADAN trained withrandom initialized embeddings can still achievemeaningful cross-lingual performance. In addi-tion, we show that in the presence of labeled datain the target language, ADAN can naturally incor-porate this additional supervision and yields evenmore competitive results.

For future work, we plan to apply our adver-sarial training framework to other NLP adaptationtasks, where explicit MLE training is not feasibledue to the lack of direct supervision. For instance,our framework is not limited to text classificationtasks, and can be extended to phrase level opin-ion mining (Irsoy and Cardie, 2014b) by extractingphrase-level opinion expressions from sentencesusing deep recurrent neural networks. Our frame-work can be applied to these phrase-level modelsfor languages where labeled data might not ex-ist. In another direction, our adversarial frame-work for cross-lingual text categorization can beused in conjunction with not only DAN, but alsomany other neural models such as LSTM, etc.

ReferencesM. Arjovsky, S. Chintala, and L. Bottou.

2017. Wasserstein GAN. ArXiv e-printshttps://arxiv.org/abs/1701.07875.

Carmen Banea, Rada Mihalcea, and Janyce Wiebe.2010. Multilingual subjectivity: Are more lan-guages better? In Proceedings of the 23rd Inter-national Conference on Computational Linguistics(Coling 2010). Coling 2010 Organizing Commit-tee, pages 28–36. http://aclweb.org/anthology/C10-1004.

Carmen Banea, Rada Mihalcea, Janyce Wiebe, andSamer Hassan. 2008. Multilingual subjectivityanalysis using machine translation. In Proceed-ings of the 2008 Conference on Empirical Meth-ods in Natural Language Processing. Associa-tion for Computational Linguistics, pages 127–135.http://aclweb.org/anthology/D08-1014.

John Blitzer, Mark Dredze, and Fernando Pereira.2007. Biographies, bollywood, boom-boxes andblenders: Domain adaptation for sentiment classi-fication. In Proceedings of the 45th Annual Meetingof the Association of Computational Linguistics. As-

https://arxiv.org/abs/1701.07875


http://aclweb.org/anthology/C10-1004




http://aclweb.org/anthology/D08-1014



http://aclweb.org/anthology/P07-1056



sociation for Computational Linguistics, pages 440–447. http://aclweb.org/anthology/P07-1056.

Minmin Chen, Zhixiang Xu, Kilian Weinberger, andFei Sha. 2012. Marginalized denoising autoen-coders for domain adaptation. In John Langford andJoelle Pineau, editors, Proceedings of the 29th Inter-national Conference on Machine Learning (ICML-12), Omnipress, New York, NY, USA, ICML ’12,pages 767–774. http://icml.cc/2012/papers/416.pdf.

Ronan Collobert, Koray Kavukcuoglu, andClement Farabet. 2011. Torch7: A matlab-like environment for machine learning. InBigLearn, NIPS Workshop. http://cs.nyu.edu/ ko-ray/files/2011 torch7 nipsw.pdf.

Emily L Denton, Soumith Chintala, arthur szlam,and Rob Fergus. 2015. Deep generative imagemodels using a laplacian pyramid of adversarialnetworks. In C. Cortes, N. D. Lawrence, D. D.Lee, M. Sugiyama, and R. Garnett, editors, Ad-vances in Neural Information Processing Systems28, Curran Associates, Inc., pages 1486–1494.http://papers.nips.cc/paper/5773-deep-generative-image-models-using-a-laplacian-pyramid-of-adversarial-networks.pdf.

Yaroslav Ganin and Victor Lempitsky. 2015. Un-supervised domain adaptation by backprop-agation. In Proceedings of the 32nd Inter-national Conference on Machine Learning.JMLR Workshop and Conference Proceedings.http://jmlr.org/proceedings/papers/v37/ganin15.pdf.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Domain adaptation for large-scale senti-ment classification: A deep learning approach. InLise Getoor and Tobias Scheffer, editors, Proceed-ings of the 28th International Conference on Ma-chine Learning (ICML-11). ACM, New York, NY,USA, ICML ’11, pages 513–520. http://www.icml-2011.org/papers/342 icmlpaper.pdf.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generativeadversarial nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors, Advances in Neural Information ProcessingSystems 27, Curran Associates, Inc., pages 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

Stephan Gouws, Yoshua Bengio, and Greg Cor-rado. 2015. BilBOWA: Fast bilingual dis-tributed representations without word align-ments. In Proceedings of the 32nd Inter-national Conference on Machine Learning.http://jmlr.org/proceedings/papers/v37/gouws15.pdf.

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, andA. Courville. 2017. Improved Training of Wasser-stein GANs. ArXiv e-prints .

Yulan He, Harith Alani, and Deyu Zhou. 2010.Exploring english lexicon knowledge for chi-nese sentiment analysis. In CIPS-SIGHAN JointConference on Chinese Language Processing.http://aclweb.org/anthology/W10-4116.

Sergey Ioffe and Christian Szegedy. 2015.Batch normalization: Accelerating deep net-work training by reducing internal covariateshift. In Proceedings of The 32nd Inter-national Conference on Machine Learning.http://jmlr.org/proceedings/papers/v37/ioffe15.pdf.

Ozan Irsoy and Claire Cardie. 2014a. Deep re-cursive neural networks for compositionalityin language. In Z. Ghahramani, M. Welling,C. Cortes, N.D. Lawrence, and K.Q. Weinberger,editors, Advances in Neural Information Process-ing Systems 27, Curran Associates, Inc., pages2096–2104. http://papers.nips.cc/paper/5551-deep-recursive-neural-networks-for-compositionality-in-language.pdf.

Ozan Irsoy and Claire Cardie. 2014b. Opinion min-ing with deep recurrent neural networks. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing. pages 720–728.http://aclweb.org/anthology/D14-1080.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers). Asso-ciation for Computational Linguistics, pages 1681–1691. https://doi.org/10.3115/v1/P15-1162.

Diederik Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In Inter-national Conference on Learning Representations.https://arxiv.org/abs/1412.6980.

Quoc Le and Tomas Mikolov. 2014. Dis-tributed representations of sentences and doc-uments. In Proceedings of the 31st Inter-national Conference on Machine Learning.http://jmlr.org/proceedings/papers/v32/le14.html.

Huey Yee Lee and Hemnaath Renganathan. 2011. Chi-nese sentiment analysis using maximum entropy. InProceedings of the Workshop on Sentiment Analy-sis where AI meets Psychology (SAAIP 2011). AsianFederation of Natural Language Processing, pages89–93. http://aclweb.org/anthology/W11-3713.

Yiou Lin, Hang Lei, Jia Wu, and Xiaoyu Li. 2015. Anempirical study on sentiment classification of chi-nese review using word embedding. In Proceedingsof the 29th Pacific Asia Conference on Language,Information and Computation: Posters. pages 258–266. http://aclweb.org/anthology/Y15-2030.


http://icml.cc/2012/papers/416.pdf



http://cs.nyu.edu/~koray/files/2011_torch7_nipsw.pdf




http://papers.nips.cc/paper/5773-deep-generative-image-models-using-a-laplacian-pyramid-of-adversarial-networks.pdf






http://jmlr.org/proceedings/papers/v37/ganin15.pdf




http://www.icml-2011.org/papers/342_icmlpaper.pdf




http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf




http://jmlr.org/proceedings/papers/v37/gouws15.pdf




http://aclweb.org/anthology/W10-4116



http://jmlr.org/proceedings/papers/v37/ioffe15.pdf




http://papers.nips.cc/paper/5551-deep-recursive-neural-networks-for-compositionality-in-language.pdf









https://doi.org/10.3115/v1/P15-1162

https://doi.org/10.3115/v1/P15-1162

https://doi.org/10.3115/v1/P15-1162




http://jmlr.org/proceedings/papers/v32/le14.html







http://aclweb.org/anthology/Y15-2030




Biao Liu, Minlie Huang, Jiashen Sun, and XuanZhu. 2015. Incorporating domain and sen-timent supervision in representation learn-ing for domain adaptation. In InternationalJoint Conference on Artificial Intelligence.http://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/10722.

Bin Lu, Chenhao Tan, Claire Cardie, and Ben-jamin K. Tsou. 2011. Joint bilingual senti-ment classification with unlabeled parallel cor-pora. In Proceedings of the 49th Annual Meet-ing of the Association for Computational Lin-guistics: Human Language Technologies. Associa-tion for Computational Linguistics, pages 320–330.http://aclweb.org/anthology/P11-1033.

Christopher D. Manning, Mihai Surdeanu, JohnBauer, Jenny Finkel, Steven J. Bethard,and David McClosky. 2014. The StanfordCoreNLP natural language processing toolkit.In Association for Computational Linguistics(ACL) System Demonstrations. pages 55–60.http://www.aclweb.org/anthology/P/P14/P14-5010.

Rada Mihalcea, Carmen Banea, and Janyce Wiebe.2007. Learning multilingual subjective lan-guage via cross-lingual projections. In Pro-ceedings of the 45th Annual Meeting of the As-sociation of Computational Linguistics. Associa-tion for Computational Linguistics, pages 976–983.http://aclweb.org/anthology/P07-1123.

Saif M. Mohammad, Mohammad Salameh, andSvetlana Kiritchenko. 2016a. How trans-lation alters sentiment. Journal of Arti-ficial Intelligence Research 55(1):95–130.http://dl.acm.org/citation.cfm?id=3013558.3013562.

Saif M. Mohammad, Mohammad Salameh, and Svet-lana Kiritchenko. 2016b. Sentiment lexiconsfor arabic social media. In Proceedings of10th edition of the the Language Resources andEvaluation Conference (LREC). http://www.lrec-conf.org/proceedings/lrec2016/pdf/234 Paper.pdf.

S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. 2011.Domain adaptation via transfer component analysis.IEEE Transactions on Neural Networks 22(2):199–210. https://doi.org/10.1109/TNN.2010.2091281.

Mohammad Salameh, Saif Mohammad, and SvetlanaKiritchenko. 2015. Sentiment after translation: Acase-study on arabic social media posts. In Proceed-ings of the 2015 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies. Associa-tion for Computational Linguistics, pages 767–777.https://doi.org/10.3115/v1/N15-1078.

Oliver Schutze, Xavier Esquivel, Adriana Lara,and Carlos A Coello Coello. 2010. Mea-suring the averaged hausdorff distance tothe pareto front of a multi-objective opti-mization problem. Technical report, Tech-nical Report TR-OS-2010-02, CINVESTAV.

http://delta.cs.cinvestav.mx/ schuetze/technical reports/TR-OS-2010-02.pdf.

Michael D Shapiro and Matthew B Blaschko.2004. On hausdorff distance measures. Tech-nical report, Technical Report UM-CS-2004-071.https://web.cs.umass.edu/publication/docs/2004/UM-CS-2004-071.pdf.

Richard Socher, Alex Perelygin, Jean Wu, Ja-son Chuang, D. Christopher Manning, AndrewNg, and Christopher Potts. 2013. Recur-sive deep models for semantic compositional-ity over a sentiment treebank. In Proceed-ings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, pages 1631–1642.http://aclweb.org/anthology/D13-1170.

Sheng Kai Tai, Richard Socher, and D. ChristopherManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers).Association for Computational Linguistics, pages1556–1566. https://doi.org/10.3115/v1/P15-1150.

Songbo Tan and Jin Zhang. 2008. An empiri-cal study of sentiment analysis for chinese doc-uments. Expert Syst. Appl. 34(4):2622–2629.https://doi.org/10.1016/j.eswa.2007.05.028.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.2010. Word representations: A simple and gen-eral method for semi-supervised learning. In Pro-ceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics. Associa-tion for Computational Linguistics, pages 384–394.http://aclweb.org/anthology/P10-1040.

Laurens Van der Maaten and Geoffrey Hin-ton. 2008. Visualizing data using t-sne.Journal of Machine Learning Researchhttp://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf.

Cedric Villani. 2008. Optimal transport: old and new,volume 338. Springer Science & Business Media.

Xiaojun Wan. 2008. Using bilingual knowl-edge and ensemble techniques for unsupervisedchinese sentiment analysis. In Proceedingsof the 2008 Conference on Empirical Meth-ods in Natural Language Processing. Associa-tion for Computational Linguistics, pages 553–561.http://aclweb.org/anthology/D08-1058.

Xiaojun Wan. 2009. Co-training for cross-lingualsentiment classification. In Proceedings ofthe Joint Conference of the 47th Annual Meet-ing of the ACL and the 4th International JointConference on Natural Language Processing

http://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/\paper/view/10722









http://www.aclweb.org/anthology/P/P14/P14-5010






http://dl.acm.org/citation.cfm?id=3013558.3013562



http://www.lrec-conf.org/proceedings/lrec2016/pdf/234_Paper.pdf




https://doi.org/10.1109/TNN.2010.2091281

https://doi.org/10.1109/TNN.2010.2091281

https://doi.org/10.3115/v1/N15-1078

https://doi.org/10.3115/v1/N15-1078

https://doi.org/10.3115/v1/N15-1078

http://delta.cs.cinvestav.mx/~schuetze/technical_reports/TR-OS-2010-02.pdf






https://web.cs.umass.edu/publication/docs/2004/UM-CS-2004-071.pdf







https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.1016/j.eswa.2007.05.028







http://www.jmlr.org/papers/volume9/vandermaaten08a/\vandermaaten08a.pdf









of the AFNLP: Volume 1 - Volume 1. Associ-ation for Computational Linguistics, Strouds-burg, PA, USA, ACL ’09, pages 235–243.http://dl.acm.org/citation.cfm?id=1687878.1687913.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for textclassification. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Sys-tems 28, Curran Associates, Inc., pages 649–657.http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf.

Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016.Cross-lingual sentiment classification with bilingualdocument representation learning. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers). Association for Computational Linguistics,pages 1403–1412. https://doi.org/10.18653/v1/P16-1133.

Michał Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The united nations parallel corpus.In Language Resources and Evaluation (LREC16).

Y. Will Zou, Richard Socher, Daniel Cer, andD. Christopher Manning. 2013. Bilingual word em-beddings for phrase-based machine translation. InProceedings of the 2013 Conference on EmpiricalMethods in Natural Language Processing. Associ-ation for Computational Linguistics, pages 1393–1398. http://aclweb.org/anthology/D13-1141.


http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf




https://doi.org/10.18653/v1/P16-1133

https://doi.org/10.18653/v1/P16-1133

https://doi.org/10.18653/v1/P16-1133

https://doi.org/10.18653/v1/P16-1133




arxiv:1606.01614v4 [cs.cl] 17 apr 2017 · training data or ﬁne-grained annotations such as ......

Documents