adversarial contrastive estimation · proceedings of the 56th annual meeting of the association for...

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1021–1032Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

1021

Adversarial Contrastive Estimation

Avishek Joey Bose1,2,∗ † Huan Ling1,2,∗ † Yanshuai Cao1,∗

1Borealis AI 2University of Toronto{joey.bose,huan.ling}@mail.utoronto.ca

{yanshuai.cao}@borealisai.com

Abstract

Learning by contrasting positive and neg-ative samples is a general strategy adoptedby many methods. Noise contrastiveestimation (NCE) for word embeddingsand translating embeddings for knowledgegraphs are examples in NLP employingthis approach. In this work, we viewcontrastive learning as an abstraction ofall such methods and augment the neg-ative sampler into a mixture distributioncontaining an adversarially learned sam-pler. The resulting adaptive sampler findsharder negative examples, which forcesthe main model to learn a better represen-tation of the data. We evaluate our pro-posal on learning word embeddings, orderembeddings and knowledge graph embed-dings and observe both faster convergenceand improved results on multiple metrics.

1 Introduction

Many models learn by contrasting losses on ob-served positive examples with those on some fic-titious negative examples, trying to decrease somescore on positive ones while increasing it on neg-ative ones. There are multiple reasons why suchcontrastive learning approach is needed. Com-putational tractability is one. For instance, in-stead of using softmax to predict a word for learn-ing word embeddings, noise contrastive estima-tion (NCE) (Dyer, 2014; Mnih and Teh, 2012)can be used in skip-gram or CBOW word em-bedding models (Gutmann and Hyvarinen, 2012;Mikolov et al., 2013; Mnih and Kavukcuoglu,2013; Vaswani et al., 2013). Another reason is

∗authors contributed equally†Work done while author was an intern at Borealis AI

modeling need, as certain assumptions are best ex-pressed as some score or energy in margin basedor un-normalized probability models (Smith andEisner, 2005). For example, modeling entity re-lations as translations or variants thereof in a vec-tor space naturally leads to a distance-based scoreto be minimized for observed entity-relation-entitytriplets (Bordes et al., 2013).

Given a scoring function, the gradient of themodel’s parameters on observed positive examplescan be readily computed, but the negative phaserequires a design decision on how to sample data.In noise contrastive estimation for word embed-dings, a negative example is formed by replacinga component of a positive pair by randomly select-ing a sampled word from the vocabulary, resultingin a fictitious word-context pair which would beunlikely to actually exist in the dataset. This nega-tive sampling by corruption approach is also usedin learning knowledge graph embeddings (Bordeset al., 2013; Lin et al., 2015; Ji et al., 2015; Wanget al., 2014; Trouillon et al., 2016; Yang et al.,2014; Dettmers et al., 2017), order embeddings(Vendrov et al., 2016), caption generation (Dai andLin, 2017), etc.

Typically the corruption distribution is the samefor all inputs like in skip-gram or CBOW NCE,rather than being a conditional distribution thattakes into account information about the inputsample under consideration. Furthermore, the cor-ruption process usually only encodes a humanprior as to what constitutes a hard negative sam-ple, rather than being learned from data. For thesetwo reasons, the simple fixed corruption processoften yields only easy negative examples. Easynegatives are sub-optimal for learning discrimina-tive representation as they do not force the modelto find critical characteristics of observed positivedata, which has been independently discovered inapplications outside NLP previously (Shrivastava

1022

et al., 2016). Even if hard negatives are occasion-ally reached, the infrequency means slow conver-gence. Designing a more sophisticated corruptionprocess could be fruitful, but requires costly trial-and-error by a human expert.

In this work, we propose to augment the sim-ple corruption noise process in various embeddingmodels with an adversarially learned conditionaldistribution, forming a mixture negative samplerthat adapts to the underlying data and the em-bedding model training progress. The resultingmethod is referred to as adversarial contrastive es-timation (ACE). The adaptive conditional modelengages in a minimax game with the primary em-bedding model, much like in Generative Adversar-ial Networks (GANs) (Goodfellow et al., 2014a),where a discriminator net (D), tries to distinguishsamples produced by a generator (G) from realdata (Goodfellow et al., 2014b). In ACE, the mainmodel learns to distinguish between a real posi-tive example and a negative sample selected bythe mixture of a fixed NCE sampler and an adver-sarial generator. The main model and the genera-tor takes alternating turns to update their parame-ters. In fact, our method can be viewed as a con-ditional GAN (Mirza and Osindero, 2014) on dis-crete inputs, with a mixture generator consisting ofa learned and a fixed distribution, with additionaltechniques introduced to achieve stable and con-vergent training of embedding models.

In our proposed ACE approach, the conditionalsampler finds harder negatives than NCE, whilebeing able to gracefully fall back to NCE when-ever the generator cannot find hard negatives. Wedemonstrate the efficacy and generality of the pro-posed method on three different learning tasks,word embeddings (Mikolov et al., 2013), orderembeddings (Vendrov et al., 2016) and knowledgegraph embeddings (Ji et al., 2015).

2 Method

2.1 Background: contrastive learningIn the most general form, our method applies tosupervised learning problems with a contrastiveobjective of the following form:

L(ω) = Ep(x+,y+,y−) lω(x+, y+, y−) (1)

where lω(x+, y+, y−) captures both the model

with parameters ω and the loss that scores apositive tuple (x+, y+) against a negative one(x+, y−). Ep(x+,y+,y−)(.) denotes expectation

with respect to some joint distribution over pos-itive and negative samples. Furthermore, bythe law of total expectation, and the fact thatgiven x+, the negative sampling is not depen-dent on the positive label, i.e. p(y+, y−|x+) =p(y+|x+)p(y−|x+), Eq. 1 can be re-written as

Ep(x+)[Ep(y+|x+)p(y−|x+) lω(x+, y+, y−)] (2)

Separable lossIn the case where the loss decomposes into a sumof scores on positive and negative tuples such aslω(x

+, y+, y−) = sω (x+, y+)−sω (x+, y−), then

Expression. 2 becomes

Ep+(x)[Ep+(y|x) sω (x, y)− Ep−(y|x) sω (x, y)](3)

where we moved the + and − to p for notationalbrevity. Learning by stochastic gradient descentaims to adjust ω to pushing down sω (x, y) onsamples from p+ while pushing up sω (x, y) onsamples from p−. Note that for generality, thescoring function for negative samples, denoted bysω, could be slightly different from sω. For in-stance, s could contain a margin as in the case ofOrder Embeddings in Sec. 4.2.

Non separable lossEq. 1 is the general form that we would like toconsider because for certain problems, the lossfunction cannot be separated into sums of termscontaining only positive (x+, y+) and terms withnegatives (x+, y−). An example of such a non-separable loss is the triplet ranking loss (Schroffet al., 2015): lω = max(0, η + sω (x

+, y+) −sω (x

+, y−)), which does not decompose due tothe rectification.

Noise contrastive estimationThe typical NCE approach in tasks such as wordembeddings (Mikolov et al., 2013), order embed-dings (Vendrov et al., 2016), and knowledge graphembeddings can be viewed as a special case of Eq.2 by taking p(y−|x+) to be some unconditionalpnce(y).

This leads to efficient computation during train-ing, however, pnce(y) sacrifices the sampling effi-ciency of learning as the negatives produced usinga fixed distribution are not tailored toward x+, andas a result are not necessarily hard negative exam-ples. Thus, the model is not forced to discoverdiscriminative representation of observed positive

1023

data. As training progresses, more and more neg-ative examples are correctly learned, the probabil-ity of drawing a hard negative example diminishesfurther, causing slow convergence.

2.2 Adversarial mixture noise

To remedy the above mentioned problem of afixed unconditional negative sampler, we proposeto augment it into a mixture one, λpnce(y) + (1−λ)gθ(y|x), where gθ is a conditional distributionwith a learnable parameter θ and λ is a hyper-parameter. The objective in Expression. 2 canthen be written as (conditioned on x for notationalbrevity):

L(ω, θ;x) = λEp(y+|x)pnce(y−) lω(x, y+, y−)

+ (1− λ)Ep(y+|x)gθ(y−|x) lω(x, y+, y−) (4)

We learn (ω, θ) in a GAN-style minimax game:

minω

maxθV (ω, θ) = min

ωmaxθ

Ep+(x) L(ω, θ;x)

(5)The embedding model behind lω(x, y

+, y−) issimilar to the discriminator in (conditional) GAN(or critic in Wasserstein (Arjovsky et al., 2017)or Energy-based GAN (Zhao et al., 2016), whilegθ(y|x) acts as the generator. Henceforth, wewill use the term discriminator (D) and embeddingmodel interchangeably, and refer to gθ as the gen-erator.

2.3 Learning the generator

There is one important distinction to typical GAN:gθ(y|x) defines a categorical distribution over pos-sible y values, and samples are drawn accordingly;in contrast to typical GAN over continuous dataspace such as images, where samples are gener-ated by an implicit generative model that warpsnoise vectors into data points. Due to the discretesampling step, gθ cannot learn by receiving gradi-ent through the discriminator. One possible solu-tion is to use the Gumbel-softmax reparametriza-tion trick (Jang et al., 2016; Maddison et al.,2016), which gives a differentiable approximation.However, this differentiability comes at the cost ofdrawing N Gumbel samples per each categoricalsample, where N is the number of categories. Forword embeddings, N is the vocabulary size, andfor knowledge graph embeddings, N is the num-ber of entities, both leading to infeasible computa-tional requirements.

Instead, we use the REINFORCE (Williams,1992) gradient estimator for∇θL(θ, x):

(1−λ)E[−lω(x, y+, y−)∇θ log(gθ(y−|x))

](6)

where the expectation E is with respect top(y+, y−|x) = p(y+|x)gθ(y−|x), and the dis-criminator loss lω(x, y+, y−) acts as the reward.

With a separable loss, the (conditional) valuefunction of the minimax game is:

L(ω, θ;x) = Ep+(y|x) sω (x, y)

− Epnce(y) sω (x, y)− Egθ(y|x) sω (x, y) (7)

and only the last term depends on the generatorparameter ω. Hence, with a separable loss, the re-ward is −s(x+, y−). This reduction does not hap-pen with a non-separable loss, and we have to uselω(x, y

+, y−).

2.4 Entropy and training stabilityGAN training can suffer from instability and de-generacy where the generator probability masscollapses to a few modes or points. Much workhas been done to stabilize GAN training in thecontinuous case (Arjovsky et al., 2017; Gulrajaniet al., 2017; Cao et al., 2018). In ACE, if thegenerator gθ probability mass collapses to a fewcandidates, then after the discriminator success-fully learns about these negatives, gθ cannot adaptto select new hard negatives, because the REIN-FORCE gradient estimator Eq. 6 relies on gθ beingable to explore other candidates during sampling.Therefore, if the gθ probability mass collapses, in-stead of leading to oscillation as in typical GAN,the min-max game in ACE reaches an equilibriumwhere the discriminator wins and gθ can no longeradapt, then ACE falls back to NCE since the nega-tive sampler has another mixture component fromNCE.

This behavior of gracefully falling back to NCEis more desirable than the alternative of stalledtraining if p−(y|x) does not have a simple pncemixture component. However, we would still liketo avoid such collapse, as the adversarial samplesprovide greater learning signals than NCE sam-ples. To this end, we propose to use a regularizerto encourage the categorical distribution gθ(y|x)to have high entropy. In order to make the the reg-ularizer interpretable and its hyperparameters easyto tune, we design the following form:

Rent(x) = min(0, c−H(gθ(y|x))) (8)

1024

whereH(gθ(y|x)) is the entropy of the categoricaldistribution gθ(y|x), and c = log(k) is the entropyof a uniform distribution over k choices, and k isa hyper-parameter. Intuitively, Rent expresses theprior that the generator should spread its mass overmore than k choices for each x.

2.5 Handling false negativesDuring negative sampling, p−(y|x) could actuallyproduce y that forms a positive pair that exists inthe training set, i.e., a false negative. This possi-bility exists in NCE already, but since pnce is notadaptive, the probability of sampling a false nega-tive is low. Hence in NCE, the score on this falsenegative (true observation) pair is pushed up lessin the negative term than in the positive term.

However, with the adaptive sampler, gω(y|x),false negatives become a much more severe issue.gω(y|x) can learn to concentrate its mass on a fewfalse negatives, significantly canceling the learn-ing of those observations in the positive phase.The entropy regularization reduces this problem asit forces the generator to spread its mass, hence re-ducing the chance of a false negative.

To further alleviate this problem, whenevercomputationally feasible, we apply an additionaltwo-step technique. First, we maintain a hash mapof the training data in memory, and use it to effi-ciently detect if a negative sample (x+, y−) is anactual observation. If so, its contribution to theloss is given a zero weight in ω learning step. Sec-ond, to upate θ in the generator learning step, thereward for false negative samples are replaced bya large penalty, so that the REINFORCE gradientupdate would steer gθ away from those samples.The second step is needed to prevent null compu-tation where gθ learns to sample false negativeswhich are subsequently ignored by the discrimi-nator update for ω.

2.6 Variance ReductionThe basic REINFORCE gradient estimator ispoised with high variance, so in practice one of-ten needs to apply variance reduction techniques.The most basic form of variance reduction is tosubtract a baseline from the reward. As long asthe baseline is not a function of actions (i.e., sam-ples y− being drawn), the REINFORCE gradi-ent estimator remains unbiased. More advancedgradient estimators exist that also reduce vari-ance (Grathwohl et al., 2017; Tucker et al., 2017;Liu et al., 2018), but for simplicity we use the

self-critical baseline method (Rennie et al., 2016),where the baseline is b(x) = lω(y

+, y?, x), orb(x) = −sω(y?, x) in the separable loss case, andy? = argmaxigθ(yi|x). In other words, the base-line is the reward of the most likely sample accord-ing to the generator.

2.7 Improving exploration in gθ byleveraging NCE samples

In Sec. 2.4 we touched on the need for sufficientexploration in gθ. It is possible to also leveragenegative samples from NCE to help the gener-ator learn. This is essentially off-policy explo-ration in reinforcement learning since NCE sam-ples are not drawn according to gθ(y|x). The gen-erator learning can use importance re-weightingto leverage those samples. The resulting REIN-FORCE gradient estimator is basically the sameas Eq. 6 except that the rewards are reweighted bygθ(y

−|x)/pnce(y−), and the expectation is withrespect to p(y+|x)pnce(y−). This additional off-policy learning term provides gradient informa-tion for generator learning if gθ(y−|x) is not zero,meaning that for it to be effective in helping ex-ploration, the generator cannot be collapsed at thefirst place. Hence, in practice, this term is onlyused to further help on top of the entropy regular-ization, but it does not replace it.

3 Related Work

Smith and Eisner (2005) proposed contrastive es-timation as a way for unsupervised learning oflog-linear models by taking implicit evidence fromuser-defined neighborhoods around observed dat-apoints. Gutmann and Hyvarinen (2010) intro-duced NCE as an alternative to the hierarchicalsoftmax. In the works of Mnih and Teh (2012) andMnih and Kavukcuoglu (2013), NCE is applied tolog-bilinear models and Vaswani et al. (2013) ap-plied NCE to neural probabilistic language models(Yoshua et al., 2003). Compared to these previousNCE methods that rely on simple fixed samplingheuristics, ACE uses an adaptive sampler that pro-duces harder negatives.

In the domain of max-margin estimation forstructured prediction (Taskar et al., 2005), lossaugmented MAP inference plays the role of find-ing hard negatives (the hardest). However, this in-ference is only tractable in a limited class of mod-els such structured SVM (Tsochantaridis et al.,2005). Compared to those models that use exact

1025

maximization to find the hardest negative config-uration each time, the generator in ACE can beviewed as learning an approximate amortized in-ference network. Concurrently to this work, Tuand Gimpel (2018) proposes a very similar frame-work, using a learned inference network for Struc-tured prediction energy networks (SPEN) (Be-langer and McCallum, 2016).

Concurrent with our work, there have beenother interests in applying the GAN to NLP prob-lems (Fedus et al., 2018; Wang et al., 2018; Caiand Wang, 2017). Knowledge graph models natu-rally lend to a GAN setup, and has been the sub-ject of study in Wang et al. (2018) and Cai andWang (2017). These two concurrent works aremost closely related to one of the three tasks onwhich we study ACE in this work. Besides a moregeneral formulation that applies to problems be-yond those considered in Wang et al. (2018) andCai and Wang (2017), the techniques introducedin our work on handling false negatives and en-tropy regularization lead to improved experimen-tal results as shown in Sec. 5.4.

4 Application of ACE on three tasks

4.1 Word Embeddings

Word embeddings learn a vector representation ofwords from co-occurrences in a text corpus. NCEcasts this learning problem as a binary classifica-tion where the model tries to distinguish positiveword and context pairs, from negative noise sam-ples composed of word and false context pairs.The NCE objective in Skip-gram (Mikolov et al.,2013) for word embeddings is a separable loss ofthe form:

L = −∑wt∈V

[log p(y = 1|wt, w+c )

+K∑c=1

log p(y = 0|wt, w−c )](9)

Here, w+c is sampled from the set of true con-

texts and w−c ∼ Q is sampled k times from afixed noise distribution. Mikolov et al. (2013) in-troduced a further simplification of NCE, called“Negative Sampling” (Dyer, 2014). With respectto our ACE framework, the difference betweenNCE and Negative Sampling is inconsequential,so we continue the discussion using NCE. A draw-back of this sampling scheme is that it favorsmore common words as context. Another issue

is that the negative context words are sampled inthe same way, rather than tailored toward the ac-tual target word. To apply ACE to this problemwe first define the value function for the minimaxgame, V (D,G), as follows:

V (D,G) = Ep+(wc)[logD(wc, wt)]

−Epnce(wc)[− log(1−D(wc, wt))]

−Egθ(wc|wt)[− log(1−D(wc, wt))]

(10)

with D = p(y = 1|wt, wc) and G = gθ(wc|wt).

Implementation detailsFor our experiments, we train all our models ona single pass of the May 2017 dump of the En-glish Wikipedia with lowercased unigrams. Thevocabulary size is restricted to the top 150k mostfrequent words when training from scratch whilefor finetuning we use the same vocabulary as Pen-nington et al. (2014), which is 400k of the mostfrequent words. We use 5 NCE samples for eachpositive sample and 1 adversarial sample in a win-dow size of 10 and the same positive subsamplingscheme proposed by Mikolov et al. (2013). Learn-ing for both G and D uses Adam (Kingma andBa, 2014) optimizer with its default parameters.Our conditional discriminator is modeled usingthe Skip-Gram architecture, which is a two layerneural network with a linear mapping between thelayers. The generator network consists of an em-bedding layer followed by two small hidden lay-ers, followed by an output softmax layer. The firstlayer of the generator shares its weights with thesecond embedding layer in the discriminator net-work, which we find really speeds up convergenceas the generator does not have to relearn its own setof embeddings. The difference between the dis-criminator and generator is that a sigmoid nonlin-earity is used after the second layer in the discrim-inator, while in the generator, a softmax layer isused to define a categorical distribution over nega-tive word candidates. We find that controlling thegenerator entropy is critical for finetuning exper-iments as otherwise the generator collapses to itsfavorite negative sample. The word embeddingsare taken to be the first dense matrix in the dis-criminator.

4.2 Order Embeddings Hypernym Prediction

As introduced in Vendrov et al. (2016), orderedrepresentations over hierarchy can be learned by

1026

order embeddings. An example task for such or-dered representation is hypernym prediction. Ahypernym pair is a pair of concepts where the firstconcept is a specialization or an instance of thesecond.

For completeness, we briefly describe order em-beddings, then analyze ACE on the hypernym pre-diction task. In order embeddings, each entity isrepresented by a vector in RN , the score for apositive ordered pair of entities (x, y) is definedby sω(x, y) = ||max(0, y − x)||2 and, score fora negative ordered pair (x+, y−) is defined bysω(x

+, y−) = max{0, η−s(x+, y−)}, where is ηis the margin. Let f(u) be the embedding functionwhich takes an entity as input and outputs en em-bedding vector. We define P as a set of positivepairs and N as negative pairs, the separable lossfunction for order embedding task is defined by:

L=∑

(u,v)∈P

sω(f(u), f(v)))+∑

(u,v)∈N

s(f(u), f(v))

(11)

Implementation detailsOur generator for this task is just a linear fully con-nected softmax layer, taking an embedding vectorfrom discriminator as input and outputting a cate-gorical distribution over the entity set. For the dis-criminator, we inherit all model setting from Ven-drov et al. (2016): we use 50 dimensions hiddenstate and bash size 1000, a learning rate of 0.01and the Adam optimizer. For the generator, we usea batch size of 1000, a learning rate 0.01 and theAdam optimizer. We apply weight decay with rate0.1 and entropy loss regularization as described inSec. 2.4. We handle false negative as described inSec. 2.5. After cross validation, variance reduc-tion and leveraging NCE samples does not greatlyaffect the order embedding task.

4.3 Knowledge Graph Embeddings

Knowledge graphs contain entity and relation dataof the form (head entity, relation, tail entity), andthe goal is to learn from observed positive entityrelations and predict missing links (a.k.a. linkprediction). There have been many works onknowledge graph embeddings, e.g. TransE (Bor-des et al., 2013), TransR (Lin et al., 2015), TransH(Wang et al., 2014), TransD (Ji et al., 2015), Com-plex (Trouillon et al., 2016), DistMult (Yang et al.,2014) and ConvE (Dettmers et al., 2017). Many ofthem use a contrastive learning objective. Here we

take TransD as an example, and modify its noisecontrastive learning to ACE, and demonstrate sig-nificant improvement in sample efficiency and linkprediction results.

Implementation detailsLet a positive entity-relation-entity triplet be de-noted by ξ+ = (h+, r+, t+), and a negative tripletcould either have its head or tail be a negative sam-ple, i.e. ξ− = (h−, r+, t+) or ξ− = (h+, r+, t−).In either case, the general formulation in Sec. 2.1still applies. The non-separable loss function takeson the form:

l = max(0, η + sω(ξ+)− sω(ξ−)) (12)

The scoring rule is:

s = ‖h⊥ + r− t⊥‖ (13)

where r is the embedding vector for r, and h⊥ isprojection of the embedding of h onto the spaceof r by h⊥ = h + rph

>p h, where rp and hp are

projection parameters of the model. t⊥ is definedin a similar way through parameters t, tp and rp.

The form of the generator gθ(t−|r+, h+) is cho-sen to be fθ(h⊥,h⊥ + r), where fθ is a feedfor-ward neural net that concatenates its two input ar-guments, then propagates through two hidden lay-ers, followed by a final softmax output layer. As afunction of (r+, h+), gθ shares parameter with thediscriminator, as the inputs to fθ are the embed-ding vectors. During generator learning, only θ isupdated and the TransD model embedding param-eters are frozen.

5 Experiments

We evaluate ACE with experiments on wordembeddings, order embeddings, and knowledgegraph embeddings tasks. In short, wheneverthe original learning objective is contrastive (alltasks except Glove fine-tuning) our results con-sistently show that ACE improves over NCE. Insome cases, we include additional comparisons tothe state-of-art results on the task to put the sig-nificance of such improvements in context: thegeneric ACE can often make a reasonable base-line competitive with SOTA methods that are op-timized for the task.

For word embeddings, we evaluate modelstrained from scratch as well as fine-tuned Glovemodels (Pennington et al., 2014) on word similar-ity tasks that consist of computing the similarity

1027

Figure 1: Left: Order embedding Accuracy plot.Right: Order embedding discriminator Loss plot on NCEsampled negative pairs and positive pairs.

Figure 2: loss curve on NCE negative pairs and ACEnegative pairs. Left: without entropy and weight decay.Right: with entropy and weight decay

Figure 3: Left: Rare Word, Right: WS353 similarity scores during the firstepoch of training.

Figure 4: Training from scratch losseson the Discriminator

between word pairs where the ground truth is anaverage of human scores. We choose the Rareword dataset (Luong et al., 2013) and WordSim-353 (Finkelstein et al., 2001) by virtue of our hy-pothesis that ACE learns better representations forboth rare and frequent words. We also qualita-tively evaluate ACE word embeddings by inspect-ing the nearest neighbors of selected words.

For the hypernym prediction task, followingVendrov et al. (2016), hypernym pairs are createdfrom the WordNet hierarchy’s transitive closure.We use the released random development split andtest split from Vendrov et al. (2016), which bothcontain 4000 edges.

For knowledge graph embeddings, we useTransD (Ji et al., 2015) as our base model, andperform ablation study to analyze the behavior ofACE with various add-on features, and confirmthat entropy regularization is crucial for good per-formance in ACE. We also obtain link predictionresults that are competitive or superior to the state-of-arts on the WN18 dataset (Bordes et al., 2014).

5.1 Training Word Embeddings from scratch

In this experiment, we empirically observe thattraining word embeddings using ACE convergessignificantly faster than NCE after one epoch. Asshown in Fig. 3 both ACE (a mixture of pnce andgθ) and just gθ (denoted by ADV) significantlyoutperforms the NCE baseline, with an absoluteimprovement of 73.1% and 58.5% respectively onRW score. We note similar results on WordSim-353 dataset where ACE and ADV outperforms

NCE by 40.4% and 45.7%. We also evaluateour model qualitatively by inspecting the nearestneighbors of selected words in Table. 1. We firstpresent the five nearest neighbors to each word toshow that both NCE and ACE models learn sen-sible embeddings. We then show that ACE em-beddings have much better semantic relevance ina larger neighborhood (nearest neighbor 45-50).

5.2 Finetuning Word Embeddings

We take off-the-shelf pre-trained Glove embed-dings which were trained using 6 billion tokens(Pennington et al., 2014) and fine-tune them us-ing our algorithm. It is interesting to note that theoriginal Glove objective does not fit into the con-trastive learning framework, but nonetheless wefind that they benefit from ACE. In fact, we ob-serve that training such that 75% of the words ap-pear as positive contexts is sufficient to beat thelargest dimensionality pre-trained Glove model onword similarity tasks. We evaluate our perfor-mance on the Rare Word and WordSim353 data.As can be seen from our results in Table 2, ACE onRW is not always better and for the 100d and 300dGlove embeddings is marginally worse. How-ever, on WordSim353 ACE does considerably bet-ter across the board to the point where 50d Gloveembeddings outperform the 300d baseline Glovemodel.

5.3 Hypernym Prediction

As shown in Table 3, with ACE training, ourmethod achieves a 1.5% improvement on accu-

1028

Queen King Computer Man WomanSkip-Gram NCE Top 5 princess prince computers woman girl

king queen computing boy manempress kings software girl prostitutepxqueen emperor microcomputer stranger personmonarch monarch mainframe person divorcee

Skip-Gram NCE Top 45-50 sambiria eraric hypercard angiomata suitorphongsri mumbere neurotechnology someone nymphomaniacsafrit empress lgp bespectacled barmaidmcelvoy saxonvm pcs hero redheadedtsarina pretender keystroke clown jew

Skip-Gram ACE Top 5 princess prince software woman girlprince vi computers girl herselfelizabeth kings applications tells manduke duke computing dead loverconsort iii hardware boy tells

Skip-Gram ACE Top 45-50 baron earl files kid auntabbey holy information told maidthrone cardinal device revenge wifemarie aragon design magic ladyvictoria princes compatible angry bride

Table 1: Top 5 Nearest Neighbors of Words followed by Neighbors 45-50 for different Models.

RW WS353Skipgram Only NCE baseline 18.90 31.35

Skipgram + Only ADV 29.96 58.05

Skipgram + ACE 32.71 55.00

Glove-50 (Recomputed based on(Pennington et al., 2014)) 34.02 49.51



Glove-50 + ACE 35.60 60.46

Glove-100 + ACE 36.51 63.29

Glove-300 + ACE 40.57 66.50

Table 2: Spearman score (ρ ∗ 100) on RW andWS353 Datasets. We trained a skipgram modelfrom scratch under various settings for only 1epoch on wikipedia. For finetuned models we re-computed the scores based on the publicly avail-able 6B tokens Glove models and we finetuned un-til roughly 75% of the vocabulary was seen.

racy over Vendrov et al. (2016) without tunningany of the discriminator’s hyperparameters. Wefurther report training curve in Fig. 1, we reportloss curve on randomly sampled pairs. We stressthat in the ACE model, we train random pairs andgenerator generated pairs jointly, as shown in Fig.2, hard negatives help the order embedding modelconverges faster.

5.4 Ablation Study and Improving TransD

To analyze different aspects of ACE, we performan ablation study on the knowledge graph em-bedding task. As described in Sec. 4.3, the base

Method Accuracy (%)order-embeddings 90.6order-embeddings + Our ACE 92.0

Table 3: Order Embedding Performance

model (discriminator) we apply ACE to is TransD(Ji et al., 2015). Fig. 5 shows validation per-formance as training progresses. All variants ofACE converges to better results than base NCE.Among ACE variants, all methods that include en-tropy regularization significantly outperform with-out entropy regularization. Without the self crit-ical baseline variance reduction, learning couldprogress faster at the beginning but the final per-formance suffers slightly. The best performance isobtained without the additional off-policy learningof the generator.

Table. 4 shows the final test results on WN18link prediction task. It is interesting to note thatACE improves MRR score more significantly thanhit@10. As MRR is a lot more sensitive to the toprankings, i.e., how the correct configuration ranksamong the competitive alternatives, this is consis-tent with the fact that ACE samples hard negativesand forces the base model to learn a more discrim-inative representation of the positive examples.

1029

Figure 5: Ablation study: measuring validationMean Reciprocal Rank (MRR) on WN18 datasetas training progresses.

MRR hit@10

ACE(Ent+SC) 0.792 0.945ACE(Ent+SC+IW) 0.768 0.949NCE TransD (ours) 0.527 0.947NCE TransD ((Ji et al., 2015)) - 0.925KBGAN(DISTMULT) ((Cai and Wang, 2017)) 0.772 0.948KBGAN(COMPLEX) ((Cai and Wang, 2017)) 0.779 0.948Wang et al. ((Wang et al., 2018)) - 0.93

COMPLEX ((Trouillon et al., 2016)) 0.941 0.947

Table 4: WN18 experiments: the first portion ofthe table contains results where the base model isTransD, the last separated line is the COMPLEXembedding model (Trouillon et al., 2016), whichachieves the SOTA on this dataset. Among allTransD based models (the best results in this groupis underlined), ACE improves over basic NCE andanother GAN based approach KBGAN. The gapon MRR is likely due to the difference betweenTransD and COMPLEX models.

5.5 Hard Negative Analysis

To better understand the effect of the adversarialsamples proposed by the generator we plot the dis-criminator loss on both pnce and gθ samples. Inthis context, a harder sample means a higher lossassigned by the discriminator. Fig. 4 shows thatdiscriminator loss for the word embedding task ongθ samples are always higher than on pnce sam-ples, confirming that the generator is indeed sam-pling harder negatives.For Hypernym Prediction task, Fig.2 shows dis-criminator loss on negative pairs sampled fromNCE and ACE respectively. The higher the lossthe harder the negative pair is. As indicated in theleft plot, loss on the ACE negative terms collapses

faster than on the NCE negatives. After addingentropy regularization and weight decay, the gen-erator works as expected.

6 Limitations

When the generator softmax is large, the currentimplementation of ACE training is computation-ally expensive. Although ACE converges fasterper iteration, it may converge more slowly onwall-clock time depending on the cost of the soft-max. However, embeddings are typically used aspre-trained building blocks for subsequent tasks.Thus, their learning is usually the pre-computationstep for the more complex downstream modelsand spending more time is justified, especiallywith GPU acceleration. We believe that the com-putational cost could potentially be reduced viasome existing techniques such as the “augmentand reduce” variational inference of (Ruiz et al.,2018), adaptive softmax (Grave et al., 2016), orthe “sparsely-gated” softmax of Shazeer et al.(2017), but leave that to future work.

Another limitation is on the theoretical front.As noted in Goodfellow (2014), GAN learningdoes not implement maximum likelihood estima-tion (MLE), while NCE has MLE as an asymp-totic limit. To the best of our knowledge, moredistant connections between GAN and MLE train-ing are not known, and tools for analyzing theequilibrium of a min-max game where players areparametrized by deep neural nets are currently notavailable to the best of our knowledge.

7 Conclusion

In this paper, we propose Adversarial ContrastiveEstimation as a general technique for improvingsupervised learning problems that learn by con-trasting observed and fictitious samples. Specifi-cally, we use a generator network in a conditionalGAN like setting to propose hard negative exam-ples for our discriminator model. We find that amixture distribution of randomly sampling neg-ative examples along with an adaptive negativesampler leads to improved performances on a va-riety of embedding tasks. We validate our hypoth-esis that hard negative examples are critical to op-timal learning and can be proposed via our ACEframework. Finally, we find that controlling theentropy of the generator through a regularizationterm and properly handling false negatives is cru-cial for successful training.

1030

ReferencesMartin Arjovsky, Soumith Chintala, and Leon Bot-

tou. 2017. Wasserstein GAN. arXiv preprintarXiv:1701.07875.

David Belanger and Andrew McCallum. 2016. Struc-tured prediction energy networks. In InternationalConference on Machine Learning, pages 983–992.

Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2014. A semantic matching energyfunction for learning with multi-relational data. Ma-chine Learning, 94(2):233–259.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in neural informationprocessing systems, pages 2787–2795.

Liwei Cai and William Yang Wang. 2017. Kbgan: Ad-versarial learning for knowledge graph embeddings.arXiv preprint arXiv:1711.04071.

Yanshuai Cao, Gavin Weiguang Ding, Kry Yik-ChauLui, and Ruitong Huang. 2018. Improving GANtraining via binarized representation entropy (BRE)regularization. In International Conference onLearning Representations.

Bo Dai and Dahua Lin. 2017. Contrastive learning forimage captioning. In Advances in Neural Informa-tion Processing Systems, pages 898–907.

Tim Dettmers, Pasquale Minervini, Pontus Stene-torp, and Sebastian Riedel. 2017. Convolutional2d knowledge graph embeddings. arXiv preprintarXiv:1707.01476.

Chris Dyer. 2014. Notes on noise contrastive es-timation and negative sampling. arXiv preprintarXiv:1410.8251.

William Fedus, Ian Goodfellow, and Andrew M Dai.2018. MaskGAN: Better text generation via fillingin the . arXiv preprint arXiv:1801.07736.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proceedings of the 10th inter-national conference on World Wide Web, pages 406–414. ACM.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014a. Generativeadversarial nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors, Advances in Neural Information ProcessingSystems 27, pages 2672–2680.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014b. Generativeadversarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.

Ian J Goodfellow. 2014. On distinguishability crite-ria for estimating generative models. arXiv preprintarXiv:1412.6515.

Will Grathwohl, Dami Choi, Yuhuai Wu, GeoffRoeder, and David Duvenaud. 2017. Backpropaga-tion through the void: Optimizing control variatesfor black-box gradient estimation. arXiv preprintarXiv:1711.00123.

Edouard Grave, Armand Joulin, Moustapha Cisse,David Grangier, and Herve Jegou. 2016. Efficientsoftmax approximation for GPUs. arXiv preprintarXiv:1609.04309.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-cent Dumoulin, and Aaron C Courville. 2017. Im-proved training of wasserstein gans. In Advancesin Neural Information Processing Systems, pages5769–5779.

Michael Gutmann and Aapo Hyvarinen. 2010. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In Proceedingsof the Thirteenth International Conference on Artifi-cial Intelligence and Statistics, pages 297–304.

Michael U Gutmann and Aapo Hyvarinen. 2012.Noise-contrastive estimation of unnormalized sta-tistical models, with applications to natural imagestatistics. Journal of Machine Learning Research,13(Feb):307–361.

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-ical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144.

Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, andJun Zhao. 2015. Knowledge graph embedding viadynamic mapping matrix. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers), volume 1, pages 687–696.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. 2015. Learning entity and relation em-beddings for knowledge graph completion. In AAAI,volume 15, pages 2181–2187.

Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, JianPeng, and Qiang Liu. 2018. Action-dependent con-trol variates for policy optimization via stein iden-tity. In International Conference on Learning Rep-resentations.

Thang Luong, Richard Socher, and Christopher DManning. 2013. Better word representations withrecursive neural networks for morphology. InCoNLL, pages 104–113.

https://openreview.net/forum?id=BkLhaGZRW



http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

https://openreview.net/forum?id=H1mCp-ZRZ



1031

Chris J Maddison, Andriy Mnih, and Yee Whye Teh.2016. The concrete distribution: A continuousrelaxation of discrete random variables. arXivpreprint arXiv:1611.00712.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

M. Mirza and S. Osindero. 2014. Conditional Genera-tive Adversarial Nets. ArXiv e-prints.

Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In Advances in neural information pro-cessing systems, pages 2265–2273.

Andriy Mnih and Yee Whye Teh. 2012. A fast and sim-ple algorithm for training neural probabilistic lan-guage models. arXiv preprint arXiv:1206.6426.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel. 2016. Self-criticalsequence training for image captioning. arXivpreprint arXiv:1612.00563.

Francisco JR Ruiz, Michalis K Titsias, Adji B Dieng,and David M Blei. 2018. Augment and reduce:Stochastic inference for large categorical distribu-tions. arXiv preprint arXiv:1802.04220.

Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. 2015. Facenet: A unified embedding forface recognition and clustering. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 815–823.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. 2017. Outrageously large neural networks:The sparsely-gated mixture-of-experts layer. arXivpreprint arXiv:1701.06538.

Abhinav Shrivastava, Abhinav Gupta, and Ross Gir-shick. 2016. Training region-based object detectorswith online hard example mining. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 761–769.

Noah A Smith and Jason Eisner. 2005. Contrastive es-timation: Training log-linear models on unlabeleddata. In Proceedings of the 43rd Annual Meetingon Association for Computational Linguistics, pages354–362. Association for Computational Linguis-tics.

Ben Taskar, Vassil Chatalbashev, Daphne Koller, andCarlos Guestrin. 2005. Learning structured predic-tion models: A large margin approach. In Proceed-ings of the 22nd international conference on Ma-chine learning, pages 896–903. ACM.

Theo Trouillon, Johannes Welbl, Sebastian Riedel, EricGaussier, and Guillaume Bouchard. 2016. Com-plex embeddings for simple link prediction. In In-ternational Conference on Machine Learning, pages2071–2080.

Ioannis Tsochantaridis, Thorsten Joachims, ThomasHofmann, and Yasemin Altun. 2005. Large mar-gin methods for structured and interdependent out-put variables. Journal of machine learning research,6(Sep):1453–1484.

Lifu Tu and Kevin Gimpel. 2018. Learning approx-imate inference networks for structured prediction.In International Conference on Learning Represen-tations.

George Tucker, Andriy Mnih, Chris J Maddison, JohnLawson, and Jascha Sohl-Dickstein. 2017. Rebar:Low-variance, unbiased gradient estimates for dis-crete latent variable models. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 30, pages 2627–2636. Curran Associates, Inc.

Ashish Vaswani, Yinggong Zhao, Victoria Fossum,and David Chiang. 2013. Decoding with large-scale neural language models improves translation.In Proceedings of the 2013 Conference on Empiri-cal Methods in Natural Language Processing, pages1387–1392.

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and RaquelUrtasun. 2016. Order-embeddings of images andlanguage. In International Conference on LearningRepresentations.

Peifeng Wang, Shuangyin Li, and Rong Pan. 2018. In-corporating GAN for negative sampling in knowl-edge representation learning. In The Thirty-SecondAAAI Conference on Artificial Intelligence (AAAI-18).

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence,pages 1112–1119. AAAI Press.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2014. Embedding entities andrelations for learning and inference in knowledgebases. arXiv preprint arXiv:1412.6575.

http://arxiv.org/abs/1411.1784

http://arxiv.org/abs/1411.1784

http://papers.nips.cc/paper/6856-rebar-low-variance-unbiased-gradient-estimates-for-discrete-latent-variable-models.pdf



1032

Bengio Yoshua, Ducharme Rejean, Vincent Pascal, andJauvin Christian. 2003. A neural probabilistic lan-guage model. Journal of Machine Learning Re-search.

Junbo Zhao, Michael Mathieu, and Yann LeCun. 2016.Energy-based generative adversarial network. arXivpreprint arXiv:1609.03126.

adversarial contrastive estimation · proceedings of the 56th annual meeting of the association for...

Documents