word embeddings and their use in sentence classiﬁcation tasks … · 2016-10-27 · word...

Word Embeddings and Their Use InSentence Classification Tasks

Amit Mandelbaum Adi Shalev

Hebrew University of [email protected] [email protected]

October 27, 2016

Abstract

This paper has two parts. In the first part we discuss word embeddings. We discuss the need for them, someof the methods to create them, and some of their interesting properties. We also compare them to imageembeddings and see how word embedding and image embedding can be combined to perform different tasks.In the second part we implement a convolutional neural network trained on top of pre-trained word vectors.The network is used for several sentence-level classification tasks, and achieves state-of-art (or comparable)results, demonstrating the great power of pre-trainted word embeddings over random ones.

I. Introduction

There are some definitions for what WordEmbeddings are, but in the most generalnotion, word embeddings are the numer-

ical representation of words, usually in a shapeof a vector in <d. Being more specific, wordembeddings are unsupervisedly learned wordrepresentation vectors whose relative similari-ties correlate with semantic similarity. In com-putational linguistics they are often referred asdistributional semantic model or distributed repre-sentations.

The theoretical foundations of word embed-dings can be traced back to the early 1950’sand in particular in the works of Zellig Harris,John Firth, and Ludwig Wittgenstein. The ear-liest attempts at using feature representationsto quantify (semantic) similarity used hand-crafted features. A good example is the workon semantic differentials [Osgood, 1964]. Theearly 1990’s saw the rise of automatically gen-erated contextual features, and the rise of DeepLearning methods for Natural Language Pro-cessing (NLP) in the early 2010’s helped to in-crease their popularity, to the point that, thesedays, word embeddings are the most popular

research area in NLP 1.This work will be divided into two parts.

In the first part we will discuss the need forword embeddings, some of the methods tocreate them, and some interesting features ofthose embeddings. We also compare them toimage embeddings (usually referred as imagefeatures) and see how word embedding andimage embedding can be combined to performdifferent tasks.

In the second part of this paper we willpresent our implementation of ConvolutionalNeural Networks for Sentence Classification[Kim ,2014]. This work which became verypopular is a very good demonstration of thepower of pre-trained word embeddings. Us-ing a relatively simple model, the authors wereable to achieve state-of-art (or comparable) re-sults, for several sentence-level classificationtasks. In this part we will present the model,discuss the results and compare them to thoseof the original article. We will also extend andtest the model on some datasets that were notused in the original article. Finally, we will

1In 2015 the dominating subject at EMNLP ("Empiri-cal Methods in NLP") conference was word embeddings,source: http://sebastianruder.com/word-embeddings-1/

1

arX

iv:1

610.

0822

9v1

[cs

.LG

] 2

6 O

ct 2

016

Word Embeddings for Sentence Classification Tasks • July 2016

propose some extensions for the model whichmight be a good proposition for future work.

II. Word Embeddings

i. Motivation

It is obvious that every mathematical systemor algorithm needs some sort of numeric inputto work with. However, while images andaudio naturally come in the form of rich, high-dimensional vectors (i.e. pixel intensity forimages and power spectral density coefficientsfor audio data), words are treated as discreteatomic symbols.

The naive way of converting words to vectorsmight assign each word a one-hot vector in<|V| where |V| being vocabulary size. Thisvector will be all zeros except one unique indexfor each word. Representing words in this wayleads to substantial data sparsity and usuallymeans that we may need more data in order tosuccessfully train statistical models.

Figure 1: Density of different data sources.

What mentioned above raise the need forcontinuous, vector space representations ofwords that contain data that can be leveragedby models. To be more specific we want seman-tically similar words to be mapped to nearbypoints, thus making the representation carryuseful information about the word actual mean-ing.

ii. Word Embeddings Methods

Word embeddings models can be divided intomain categories:

• Count-based methods• Predictive methods

Models in both categories share, in at leastsome way, the assumption that words that ap-pear in the same contexts share semantic mean-ing.

One of the most influential early worksin count-based methods is the LSI/LSA(Latent Semantic Indexing/Analysis)[Deerwester et al., 1990] method. Thismethod is based on the Firth’s hypothesis from1957 [Firth, 1957] that the meaning of a wordis defined "by the company it keeps". Thishypothesis leads to a very simple albeit a veryhigh-dimensional word embedding. Formally,each word can be represented as a vector in <N

where N is the unique number of words in agiven dictionary (in practice N=100,000). Then,by taking a very large corpus (e.g. Wikipedia),let Count5(w1, w2) be the number of timesw1 and w2 occur within a distance 5 of eachother in the corpus. Then the word embeddingfor a word w is a vector of dimension N,with one coordinate for each dictionary word.The coordinate corresponding to word w2 isCount5(w, w2).

The problem with the resulting embeddingis that it uses extremely high-dimensional vec-tors. In the LSA article, is was empiricallydiscovered that these embeddings can be re-duced to vectors R300 by doing a rank-300 SVDon the NxN original embeddings matrix.

This method was later refined with reweight-ing heuristics, such as taking the loga-rithm, or Pointwise Mutual Information (PMI)[Kenneth et al., 1990] on the count, which is avery popular method.

The second family of methods, sometimesalso referred as neural probabilistic language mod-els, had theoretical and some practical appear-ance as early as 1986 [Hinton, 1986], but firstto show the utility of pre-trained word em-beddings were arguably Collobert and Westonin 2008 [Collobert and Weston, 2008]. Unlikecount-based models, predictive models try topredict a word from its neighbors in terms oflearned small, dense embedding vectors.

Two of the most popular methods whichappeared recently are the Glove (GlobalVectors for Word Representation) method

2


[Pennington et. al., 2014], which is an unsuper-vised learning method, although not predictivein the common sense, and Word2Vec, a familyof energy based predictive models, presentedby [Mikolov et. al., 2013]. As Word2Vec is theembedding method used in our work it shallbe briefly discussed here.

iii. Word2Vec

Word2vec is a particularly computationally-efficient predictive model for learning wordembeddings from raw text. It comes in twoflavors, the Continuous Bag-of-Words model(CBOW) and the Skip-Gram model. Algo-rithmically, these models are similar, exceptthat CBOW predicts target words (e.g. ’mat’)from source context words (’the cat sits onthe’), while the skip-gram does the inverseand predicts source context-words from thetarget words. In the skip-gram model (see fig-

Figure 2: The Skip-gram model architecture

ure 2) a neural network is trained over a largecorpus in where the training objective is tolearn word vector representations that are goodat predicting the nearby words. The method

is also using a simplified version of NCE[Gutmann and HyvÃd’rinen, 2012] called Neg-ative sampling where the objective function isdefined as follows:

logσ(v′TwOvwI ) +

k

∑i=1

Ewi∼Pn(w)[σ(−v′TwivwI )]

(1)where v′w and vw are the "input" and "output"vector representations of w, σ is the sigmoidfunction but can also be seen as the networkparameters function, and Pn is some noise prob-ability used to sample random words. In thearticle they recommend k to be between 5 to 20,while the context of predicted words should be5 or 10. This above objective is later put in theSkip-Gram objective (equtaion 2) to produceoptimal word embeddings.

1T

T

∑t=1

∑−c≤j≤c,j 6=0

logp(wt+j|wt) (2)

This objective enables the model to differentiatedata from noise by means of logistic regression,thus learning high-quality vector representa-tions.

The CBOW does exactly the same but the di-rection is inverted. In other words the CBOWtrains a binary logistic classifier where, given awindow of context words, gives a higher proba-bility to "correct" if the next word is correct anda higher probability to "incorrect" if the nextword is a random sampled one. Notice thatCBOW smoothes over a lot of the distributionalinformation (by treating an entire context asone observation). For the most part, this turnsout to be a useful thing for smaller datasets.However, skip-gram treats each context-targetpair as a new observation, and this tends to dobetter when we have larger datasets.

Finally the vector we used in our work hada dimension of 300. The Network was trainedon the Google News dataset which contains 30billion training words, with negative samplingas mentioned above. These embeddings can befound online2.

A lot of follow-up work was done on theWord2Vec method. One interesting work was

2code.google.com/p/word2vec

3


Figure 3: Left: Word2Vec t-SNE [Maaten and Hinton, 2008] visualization of our implementation, using text8 datasetand a window size of 5. Only 400 words are visualized. Right: Zooming in of the rectangle in the left figure.

done by [Goldberg and Levy, 2014] where ex-periments and theory were used to suggest thatthese newer methods are related to the olderPMI based models, but with new hyperparam-eters and/or term reweightings. In this projectappendix you can find a simplified versionof Word2Vec we implemented in TensorFlowarchitecture using the text8 dataset3 and theSkip-Gram model. See figure 3 for visualizedresults.

iv. Word Embeddings Properties

Similarity: The simplest property of embed-dings obtained by all the methods describedabove is that similar words tend to have sim-ilar vectors. More formally, the similarity be-tween two words (as rated by humans on a[-1,1] scale) correlates with the cosine similar-ity between those words’ vectors. The fact that

Figure 4: What words have embeddings closest to agiven word? From [Collobert et al., 2011]

words embedding are related to their context-words stand behind the similarity property

3http://mattmahoney.net/dc/textdata

as naturally, similar words tend to appear insimilar context. This, however creates theproblem that antonyms (e.g. cold and hotetc.) also appear with the same context whilethey are, by definition, have opposite mean-ing. In [Mikolov et. al., 2013] the score of the(accept,reject) pair is 0.73, and the score of(long,short) is 0.71.

The problem of antonyms was tackled di-rectly by [Schwartz et al., 2015]. In this arti-cle, the authors introduce a symmetric patternbased approach to word representation whichis particularly suitable for capturing word sim-ilarity. Symmetric patterns are a special typeof patterns that contain exactly two wildcardsand that tend to be instantiated by wildcardpairs such that each member of the pair cantake the X or the Y position. For example, thesymmetry of the pattern "X or Y" is exempli-fied by the semantically plausible expressions"cats or dogs" and "dogs or cats". Specificallyit was found that two patterns are particularlyindicative of antonymy - "from X to Y" and"either X or Y".

Using their model the authors were ableto achieve a ρ score of 0.56 on the simlex999dataset [Hill et al., 2016], improving state-of-the-art word2vec skip-gram model results byas much as 5.5-16.7%. Furthermore, the au-thors demonstrated the adaptability of theirmodel to antonym judgment specifications.

Linear analogy relationships: A moreinteresting property of recent embeddings[Mikolov et. al., 2013] is that they can solve

4


analogy relationships via linear algebra. Thisis despite the fact that those embeddings arebeing produced via nonlinear methods. Forexample, vqueen is the most similar answer tothe vking− vmen + vwomen equation. It turns out,though, that much more sophisticated relation-ships are also encoded in this way as we cansee in figure 5 below.

Figure 5: Relationship pairs in a word embedding. From[Mikolov et. al., 2013]

An interesting theoretical work on non-linearembeddings (especially PMI) was done by[Arora et al., 2015]. In their article they sug-gest that the creation of a textual corpus isdriven by the random walk of a discourse vec-tor ct ∈ <d, which is a unit vector whose direc-tion in space represents what is being talkedabout. Each word has a (time-invariant) la-tent vector vw ∈ <d that captures its correla-tions with the discourse vector. Using a wordproduction model they predict that words oc-curring at successive time steps will also tendto have vectors that are close together, thusexplaining why similar words have similar vec-tors.

Using the above model the authors introducethe "RELATIONS = DIRECTIONS" notion forlinear analogies. The authors claim that foreach relation R, some direction µR can be foundwhich satisfies some equation. This leads tothe finding that given enough examples of arelationship R, it is possible to compute µR us-ing SVD and then given a pair of words with arealtion R and a word c, find the best analogywith word d by finding the pair c and d suchthat vc− vd has highest possible projection overµR. In this way, thay also explain that low di-mension of the vectors has a "purifying" effect

that reduces the effect of the overfitting comingfrom the PMI approximation, thus achievingmuch better results than higher dimensionalvectors.

v. Word Embeddings Extensions

In this last subsection we will review two inter-esting works that extend the word embeddingconcept to phrases and sentences using differ-ent approaches.

In [Mitchell and Lapata, 2008] the authorsaddress the problem that vector-based mod-els are typically directed at representing wordsin isolation and methods for constructing rep-resentations for phrases or sentences have re-ceived little attention in the literature. Theauthors suggests the use of two compositionoperations, multiplication and addition (andtheir combination). This way the authors areable to combine word embeddings into phraseor sentences embeddings while taking into ac-count important properties like word orderand semantic relationship between words (i.e.semantic composition types).

In MIL (Multi Instance Transfer Learning)[Kotzias et al., 2014] the authors propose a neu-ral network model which learns embeddingat increasing level of hierarchy, starting fromword embeddings, going to sentences and end-ing with entire document embeddings. The au-thors then use transfer learning by pulling thesentence or word embedding that were trainedas part of the document embeddings and usethem for sentence or word review classificationor similarity tasks (See figure 6 below).

III. Word Embeddings vs. Image

Embeddings

i. Image Embeddings

Image embeddings, or image features, werewildly used for most image processing andclassification tasks until the early 2010’s. Thefeatures ranged from simple histograms oredge maps to the more sophisticated andvery popular SIFT [Lowe, 1999] and HOG

5


Figure 6: Deep multi-instance transfer learningapproach for review data, taken from[Kotzias et al., 2014]

[Dalal and Triggs, 2005]. However, recentyears have seen the rise of Deep Learningfor image classification, especially since 2012when the AlexNet [Krizhevsky et al., 2012] ar-ticle was published. As those ConvolutionalNeural Networks (CNN) operated directly onthe images, it was suggested that these net-works learn the best image features for the spe-cific task that they are trained for, thus obviat-ing the need for specific hand-crafted features.

Figure 7: The CNN architecture of AlexNet

In recent years though, an extensive researchwas done on the nature and usage of the ker-nels and features learned by CNN’s. Exten-sive study of CNN feature layers was donein [Zeiler and Fergus, 2014] where they empir-ically confirmed that each convolutional layerof the CNN learns a set of filters. Their ex-periments also confirm that filters complexityand expressive power is rising from layer tolayer (i.e. as the network goes deeper) startingfrom simple edge detectors to complex objectsdetectors like eyes, flowers, faces and more.

The authors also suggest using the pre-trainedone before last layer as a feature map, or ImageEmbeddings input for simpler SVM classifiers.

Another popular work was done a bitearlier in [Yangqing et al., 2014] where theyalso used a pre-trained CNN features as abase for visual recognition tasks. This workwas followed by several works with one ofthem being considered the philosophical fa-ther of the algorithm we implement later. In[Razavian et al., 2014] the authors used theone before last layer of a network similar toAlexNet that was pre-trained on ImageNet[Russakovsky et al., 2015] as image embed-dings. The authors were able to acheive state-of-art results on several recognition tasks, us-ing simple classifiers like SVM. The result wassurprising due to the fact that the CNN modelwas originally optimized for the task of objectclassification in ILSVRC 2013 dataset. Never-theless, it showed itself to be a strong competi-tor to the more sophisticated and highly tunedstate-of-the-art methods.

These works and others suggested that givena large enough database of images, a CNN canlearn an image embedding which captures the"essence" of the picture and can be used lateras an input to different tasks, similar to whatis done with word embeddings.

ii. Similarities and Differences

As we saw earlier Word embedding and Im-age embeddings are similar in the sense thatwhile they are being learned as part of a spe-cific task, they can be successfully used laterfor a variety of other tasks. Also, in both cases,similar images or words will usually have sim-ilar embeddings. However Word embeddingsand image embeddings differ in some aspects.

The first difference is that while word embed-dings depends mostly on the words surround-ing the given word, image embeddings usuallyrely on the specific image itself. This might ex-plain the fact that linear analogies does not ap-pear naturally in images. An interesting workwas done in [Reed et al., 2015] where a neuralnetwork is trained to make visual analogies

6


and learn to make them based on appearance,rotation, 3D pose, and various object attributes.

Another difference is that while word em-beddings are usually low-ranked, image em-beddings might have same or even higher di-mension then the original image. Those em-beddings are still useful as they contain a lotof information that is extracted from the imageand can be used easily.

Lastly, we notice that word embeddings aretrained on a specific corpus where the finalembedding results come as the form of word-vectors. This limits the embedding to be validonly for words that were found in the originalcorpus while other words will need to be ini-tialized as random vectors (as also done in ourwork). In images on the other hand, the em-beddings come as a pre-trained model wherefeatures or embeddings can be pulled for anysort of image by feeding the model with theimage, making image embeddings models abit more robust (although they might subjectedto other constraints like size and image type).

iii. Joint Word-Image Embeddings

To conclude this part we will review someof the recent work done in the exciting areaof joint word-image embeddings. The firstimmediate usage of joint word-image embed-dings is image annotations or image label-ing. An early notable work was done by

[Weston, et al., 2010] where a representationof images and representation of annotationswhere both mapped to a joint feature space bylearning a mapping which optimizes top-of-the-list ranking measures for images and an-notations. This method, however, learns linearmappings from image features to the embed-ding space, and the available labels were onlythose provided in the image training set. Itcould thus not generalize to new classes.

In 2014 DeVise (A Deep Visual-SemanticEmbedding Model) model was shown by[Frome et al., 2013]. This work which con-tinued earlier work [Socher et al., 2013], com-bined image embedding and word embeddingtrained separately into joint similarity metric(see figure 8). This enabled them to give perfor-mance comparable to a state-of-the-art softmaxbased model on a flat object classification met-ric, while simultaneously making more seman-tically reasonable errors. Their model was alsoable to make correct predictions across thou-sands of previously unseen classes by lever-aging semantic knowledge elicited only fromun-annotated text.

Another line of works which combines im-age and words embeddings is the image cap-tioning area. In this area the embeddings areusually not combined into a joint space butrather used together to create captions for im-ages. In [Karpathy and Fei Fei, 2015] an image

Figure 8: : (a) Left: a visual object categorization network with a softmax output layer; Right: a skip-gram languagemodel; Center: the joint model, which is initialized with parameters pre-trained at the lower layers of theother two models. (b) t-SNE visualization [19] of a subset of the ILSVRC 2012 1K label embeddings learnedusing skip-gram. Taken from [Frome et al., 2013]

7


Figure 9: Image captiones generated with Deep Visual-Semantic model Taken from [Karpathy and Fei Fei, 2015]

features pulled from a pre-trained CNN arefed into a Recurrent Neural Network (RNN)which uses word embeddings in order to gen-erate a captioning for the image, based on theimage features and previous words (see fig-ure 9). This sort of combination appears inmost image captioning works or video actionrecognition tasks.

Finally, a slightly more sophisticated methodcombining RNN’s and Fisher Vectors can befound in [Lev et al., 2015] where the authorswere able to achieve state-of-art results on bothimage captioning and video action recognitiontasks, using transfer learning on the embed-dings learned for the image captioning tasks.

IV. CNN for Sentence

Classification Model

In this section and the following we are goingto represent our implementation of The Convo-lutional Neural Networks for Sentence Classifi-cation model [Kim ,2014] and our results. Thismodel gained much popularity since it wasfirst introduced in late 2014, mainly because itprovides a very strong demonstration for thepower of pre-trained word embeddings.

The model and results were examined in de-tail in [Zhang and Wallace, 2015] where theytest many types of configurations for the model,including different sizes and number of filters,different activation units and different wordembbeddings.

A partial implementation of the model was

done in Theano framework by the authors4

and another simplified version of the modelwas done in TensorFlow5. In our work weused small parts of the mentioned codes, how-ever most of the code had to be re-written andexpanded in order to perform a true implmen-tation of the article’s model.

i. Model details

The model architecture, shown in figure 10,is a slight variant of the CNN architecture of[Collobert et al., 2011]. Formally, let xi ∈ <k

be the k-dimensional word vector correspond-ing to the i-th word in the sentence. Let n bethe length (in number of words) of the longestsentence in the dataset, and let lh be the widthof the widest filter in the network. Then, theinput to the network is a k× (n + lh − 1) ma-trix, which is a concatenation of the word em-beddings vectors of each sentences, paddedby lh − 1 zero vectors in the beginning andsome more zero vectors in the end so there aren + lh − 1 vectors eventually.

The input of the network is convolved withfilters of different widths (i.e. number of wordsin the window) and different sizes (i.e. numberof features). For example, a feature ci is gen-erated from a window of words xi:i+h−1 by afilter with width h is:

ci = f (wxi:i+h−1 + b) (3)

where w are the filter weights, b is a bias term

4https://github.com/yoonkim/CNN_sentence5https://github.com/dennybritz/cnn-text-

classification-tf

8


Figure 10: Model architecture with two channels for an example sentence. Taken from [Kim ,2014]

and f is a non-linear function like ReLU. Thisprocess is done for all filters and for all wordsto create a number of feature maps for eachfilter. Next, those features maps are then max-pooled (so we can deal with different sentencesizes) and finally connected to a soft-max clas-sification layer.

For regularization we employ dropout[Hinton et al., 2014] on the penultimate layer.This entails randomly (under some probabil-ity) setting values in the weight vector to0. In the original article they also employedconstraint on l2 norms of this layer, however[Zhang and Wallace, 2015] found that it hadnegligible contribution to results and thereforewas not used here.

Training of the network is done by minimiz-ing the corss-entropy loss between predictedlabels (soft-max layer) and correct ones. Theparameters to be estimated include the weightvector(s), of the filter(s), the bias term in the ac-tivation function, the weight vector of the soft-max function and (optionally) the word embed-dings. Optimization is performed using SGD[Rumelhart et al., 1988] and back-propagation,with a small mini-batch size specified later.

V. Datasets

We test our model on various benchmarks.Some of them used in the original article whileothers are extension we do to the original work.

You can see the dataset statistics summary intable 1 below.

• MR: Movie reviews with one sentenceper review. Classification involvesdetecting positive/negative reviews[Pang and Lee, 2005]6

• SST-1: Stanford Sentiment Treebank-anextension of MR but with train/dev/testsplits provided and fine-grained la-bels (very positive, positive, neutral,negative, very negative), re-labeled by[Socher et al., 2013] 7

• SST-2: Same as SST-1 but with neutralreviews removed and binary labels.

• Subj: Subjectivity dataset where the taskis to classify a sentence as being subjectiveor objective [Pang and Lee, 2004].

• TREC: TREC question dataset-taskinvolves classifying a question into 6 ques-tion types (whether the question is aboutperson, location, numeric information,

6https://www.cs.cornell.edu/people/pabo/movie-review-data/

7http://nlp.stanford.edu/sentiment/ Data is actuallyprovided at the phrase-level and hence we train the modelon both phrases and sentences but only score on sentencesat test time, as in [Socher et al., 2013] Thus the training setis an order of magnitude larger than listed in table 1.

9


etc.) [Li and Roth, 2002]8.

• Irony: [Wallace et al., 2014] This contains16,006 sentences from reddit labeled asironic (or not). The dataset is imbalanced(relatively few sentences are ironic).Thus before training, we under-samplednegative instances to make classes sizesequal. This dataset was not used inthe original article but was tested in[Zhang and Wallace, 2015].

• Opi: Opinions dataset, which comprisessentences extracted from user reviewson a given topic, e.g. "sound quality ofipod nano". There are 51 such topicsand each topic contains approximately100 sentences. The test is to classifywhich opinion belongs to which topic[Ganesan et al., 2010]. This dataset wasnot used in the original article but wastested in [Zhang and Wallace, 2015].

• Tweet: Tweets from 10 different authors.Classification involves classifying whichTweet belongs to which author9. Thisdataset was not used in the original article.

• Polite: Sentnces taken Wikipedia editors’logs which have 25 ranges of politeness[Danescu-Niculescu-Mizil et al., 2013].We narrowed it to 2 binary classes (po-lite/inpolite). This dataset was not usedin the original article.

VI. Experimental Setup

i. Hyperparameters and Training

In our implementation of the model we experi-mented with a lof of different configurations.Eventually, since results differences were mi-nor, we decided to use the same architectureand parameters mentioned in the original ar-ticle for all experiments, with some changes

8http://cogcomp.cs.illinois.edu/Data/QA/QC/9http://www.cs.huji.ac.il/ nogazas/pages/projects.html,

Thanks to Noga Zaslavsky

Data c l N |V| |Vpre| TestMR 2 20 10662 18765 16488 CVSST-1 5 18 11855 17836 16262 2210SST-2 2 19 9613 16185 14838 1821Subj 2 23 10000 21323 17913 CVTREC 6 10 5952 9592 9125 500Irony 2 75 1074 6138 5718 CVOpi 51 38 7086 7310 6538 CVTweet 10 39 25552 33438 17023 5964Polite 2 53 4353 10135 7951 CV

Table 1: Summary statistics for the datasets after tok-enization. c: Number of target classes. l: Av-erage sentence length. N: Dataset size. |V|:Vocabulary size. |Vpre|: Number of wordspresent in the set of pre-trained word vectors.Test: Test set size (CV means there was no stan-dard train/test split and thus 10-fold CV wasused).

mentioned below. Below is a list of parame-ters and specifications that were used for allexperiments:

• Word embeddings: We used the pre-trained Word2Vec [Mikolov et. al., 2013]mentioned earlier. Each word embeddingis in <300. For words that are not found inWord2Vec we randomly initialized themwith a uniform distribution in the rangeof [-0.5,0.5].

• Filters: We used filters with window sizesof [3,4,5] with 100 features each. Foractivation we used ReLU.

• Dropout rate: 0.5.

• Mini-Batch size: 50.

• Optimizer: While AdaDelte optimizer[Zeiler, 2012] was used in the originalarticle. We decided to use the more recentADAM optimizer [Kingma and Ba, 2014],as it seemed to converge much faster (i.e.needed less epochs on training) and insome cases improved the results.

• Learning rate: 0.001. We lower it to

10


MR SST-1 SST-2 Subj TRECModel Orig Ours Orig Ours Orig Ours Orig Ours Orig OursCNN-rand 76.1 76.4 45.0 41 82.7 80.2 89.6 91.1 91.2 97.6CNN-static 81.0 80.3 45.5 48.1 86.8 85.4 93.0 92.5 92.8 98.2CNN-non-static 81.5 80.5 48.0 47.3 87.2 85.8 93.4 93 93.6 98.6

Table 2: Results on datasets that were tested in [Kim ,2014] (Orig above).

0.0005 after 8 epchs and to 0.00005 after16 epochs. Notice that the original articledidn’t mention the learning rate.

• Number of epochs: This was also notmentioned in the original article butcan be found in the authors code10. Weused 25 epochs for the static version (seeModel Variations below). For non staticwe used either 4 (MR, SST1, SST2, Subj),10 (Polite), 16 (Twitter, Opi), or 25 (TREC).For the random version we used 25 exceptfor Tweet where we used 10, and MR andSST-1 where we used 4.

• l2-loss: We added l2-loss with λ = 0.15on the weights and biases of the final layer.Altough this was not done in the originalartcle, we found it to slightly improve theresults.

As mentioned earlier, we decided not to usel2 constrains on the norms due to their negligi-ble contribution.

ii. Model Variations

We experiment with several variants of themodel like in the original article.

• CNN-rand: Our baseline model where allwords are randomly initialized and thenmodified during training.

10We note here that in the original article they used earlystopping with a dev set. However, the early stoppingparameters are not mentioned and experiments demandeda lot of coding which is behind the scope of this project.We do assume that the 25 number used in the code mightbe close enough to the actual number used in the article.

• CNN-static: model with pre-trained vec-tors from word2vec. All words-includingthe unknown ones that are randomlyinitialized-are kept static and only theother parameters of the model are learned.

• CNN-non-static: Same as above but thepretrained vectors are fine-tuned for eachtask.

The authors also used a multi-channel modelwhere one channel is static and the other is not.However, experiments showed that on mostdatasets, this did not improve the results. Asimplementing this would have required a lotmore coding, we decided to drop it.

VII. Results and discussion

In this section we will compare the results wegot in our implementation to the ones achievedin the original article. Full results can be foundin the original article, and we do note thatmost of them are state-of-art results, or com-parable. For datasets that were not present inthe original article we shall compare with otherachieved results, whether ours or others’.

In table 2 above we can see a comparisonbetween our results and the ones in the origi-nal article [Kim ,2014]. We can see that overall,our results are comparable (and sometimes bet-ter) to the ones in the original article. We alsosee that like in the original article, our base-line model with all randomly initialized words(CNN-rand) does not perform well on its own(on most cases). These results suggest thatthe pre-trained vectors are good, ’universal’feature extractors and can be utilized acrossdatasets. Finetuning the pre-trained vectors

11


Model Opi Irony Tweet PoliteRandom 64.8 60.2 89.1 66.2Static 65.3 60.5 83.8 66.2Non-Static 66.4 62.1 89.2 65.7ConvSent 64.9 67.112 - -SVM+TF-IDF - - 92.5 -

Table 3: Resutls for datasets that were not usedin the original article. Convesent is[Zhang and Wallace, 2015].

for each task gives still further improvements(CNN-non-static).

The differences in some of our results canbe related to the different optimizer we used,and the fact that we did not use early stopping.We do note that our results (at least on thenon-static version) were achieved with muchless training than the original article11. We alsonote that on the TREC dataset we were able toachieve a new state-of-art results, improvingthe current one (95%) by 3.6%. Both thesebenefits can be related to the use of ADAM[Kingma and Ba, 2014] optimizer.

On table 3 we can see our results for datasetsthat were not used in the original article. Wealso compare them to other results where ap-plicable.

On the Opi and Irony dataset we note thatthe general line of improved results with pre-trained vectors is maintained. On the Opidataset we were also able to achieve a newstate-of-art result. We were also able to achievecomparable results on the Irony dataset. No-tice that the other reported result is AUC andnot accuracy.

The other two results are interesting. On theTweet dataset we notice that random vectorsactually perform a lot better than pre-trainedstatic ones. The reason is that on this dataset,almost half of the vocabulary was not found inthe Word2Vec embeddings. This makes sense,as tweets usually contain a lot of marks (forexample :-) ) and hashtags which would natu-rally will not be available in embeddings that

11That, if we take the 25 epochs in the code we men-tioned earlier as an indication to the nubmer of epochstraining used in the original article

were trained on news. This makes the staticversion a bad choice as it keep the embeddingsrandom during training.

On this dataset we also applied a simpleSVM classifier on the TF-IDF features of eachtweet. This simple classifier produced muchbetter results, as TF-IDF features are sensitiveto uniqe words in a tweet (like hashtags), thatusually indicates which is the author, thus mak-ing classification easier.

On the Poilte dataset we notice that resultsdoes not matter on the choice of model. Theresults themselves are also not very good. Thisresults needs further inspection but they mightsuggests that this model is not fitted for thistask or that politeness is a complicated task forautomatic classification.

VIII. Conclusions and Future

Directions

In this work we reviewed word embeddings.We saw their origins, discussed the differentmethods for creating them, and saw some oftheir interesting properties. We think that wordembeddings is a very exciting topic for bothresearch and applications, and expect that alot research is going to be carried for betterunderstanding of their properties and bettercreation methods.

In this work we also compared Image fea-tures and word embeddings and saw how theycan be combined to build learning algorithmsthat can finally gain a good understanding ofpictures and scenes. This area is just in its be-ginning and we expect a lot of work to be car-ried towards creating a hybrid system whichgains understanding of both vision and lan-guage, and that combines those understand-ings together to the benefit of both fields.

Finally, we saw that despite little tuning ofhyperparameters, a simple CNN with one layerof convolution, trained on top of Word2Vecembeddings, performs remarkably well on sen-tence classification tasks. These results addto the well-established evidence that unsuper-vised pre-training of word vectors is an impor-tant ingredient in deep learning for NLP.

12


To conclude this work we propose here twolines for future work that we think might beinteresting to check. First, in the spirit of[Kotzias et al., 2014], we notice that in our net-work, the one before last layer is actually learn-ing sentence embeddings. It might be interest-ing to train the network on some classificationtask with a relatively large dataset, and thenuse the learned sentence embeddings in thesame fashion word embeddings are used inour work. For example we can train the net-work on the MR task and then take the learnedsentence embeddings and use them as an em-bedding input to some document classificationtask. We can then check if this method achievesimprovement over models that try to classifydocuments using only pre-trained word em-beddings.

The second line of research is in the spirit of[Zeiler and Fergus, 2014]. ConvNets visualiza-tion helped to gain a lot of insights about imagesturctre and how features in increasing levelof complexity are combined to create images.It might be interesting to apply those samemethod of visualization to the filters used inour, or similar works and see if the ConvNetfilters learn some interesting semantic proper-ties or compositions that can give insights onthe structure of language and how computers(or even humans) percept them.

References

[Arora et al., 2015] Arora, Sanjeev, Yuanzhi Li,Yingyu Liang, Tengyu Ma, and AndrejRisteski. "Rand-walk: A latent variablemodel approach to word embeddings."arXiv preprint arXiv:1502.03520 (2015).

[Collobert and Weston, 2008] Collobert, Ro-nan, and Jason Weston. "A unified ar-chitecture for natural language process-ing: Deep neural networks with multitasklearning." Proceedings of the 25th inter-national conference on Machine learning.ACM. (2008).

[Collobert et al., 2011] Collobert, Ronan, JasonWeston, LÃl’on Bottou, Michael Karlen,

Koray Kavukcuoglu, and Pavel Kuksa."Natural language processing (almost)from scratch." Journal of Machine Learn-ing Research 12, no. Aug (2011): 2493-2537.

[Dalal and Triggs, 2005] Dalal, Navneet, andBill Triggs. "Histograms of oriented gra-dients for human detection." 2005 IEEEComputer Society Conference on Com-puter Vision and Pattern Recognition(CVPR’05). Vol. 1. IEEE, 2005.

[Danescu-Niculescu-Mizil et al., 2013]Danescu-Niculescu-Mizil, Cristian,Moritz Sudhof, Dan Jurafsky, JureLeskovec, and Christopher Potts. "Acomputational approach to politenesswith application to social factors." arXivpreprint arXiv:1306.6078 (2013).

[Deerwester et al., 1990] Deerwester, Scott, Su-san T. Dumais, George W. Furnas, ThomasK. Landauer, and Richard Harshman. "In-dexing by latent semantic analysis." Jour-nal of the American society for informa-tion science 41, no. 6 (1990): 391.

[Firth, 1957] Firth, J.R. (1957). "A synopsis oflinguistic theory 1930-1955". Studies inLinguistic Analysis (Oxford: PhilologicalSociety): 1-32. Reprinted in F.R. Palmer,ed. (1968). Selected Papers of J.R. Firth1952-1959. London: Longman.

[Frome et al., 2013] Frome, Andrea, Greg S.Corrado, Jon Shlens, Samy Bengio, JeffDean, and Tomas Mikolov. "Devise: Adeep visual-semantic embedding model."In Advances in neural information pro-cessing systems, pp. 2121-2129. 2013.

[Ganesan et al., 2010] Ganesan, Kavita,ChengXiang Zhai, and Jiawei Han."Opinosis: a graph-based approach toabstractive summarization of highlyredundant opinions." Proceedings ofthe 23rd international conference oncomputational linguistics. Association forComputational Linguistics, 2010.

13


[Goldberg and Levy, 2014] Goldberg, Yoav,and Omer Levy. "word2vec Explained: de-riving Mikolov et al.’s negative-samplingword-embedding method." arXiv preprintarXiv:1402.3722 (2014).

[Gutmann and HyvÃd’rinen, 2012] Gutmann,Michael U., and Aapo HyvÃd’rinen."Noise-contrastive estimation of un-normalized statistical models, withapplications to natural image statistics."Journal of Machine Learning Research13.Feb (2012): 307-361.

[Hill et al., 2016] Hill, Felix, Roi Reichart, andAnna Korhonen. "Simlex-999: Evaluatingsemantic models with (genuine) similar-ity estimation." Computational Linguistics(2016).

[Hinton, 1986] Hinton, Geoffrey E. "Dis-tributed representations." Parallel Dis-tributed Processing: Explorations in theMicrostructure of Cognition (1986).

[Hinton et al., 2014] Srivastava, Nitish, Geof-frey E. Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov."Dropout: a simple way to prevent neuralnetworks from overfitting." Journal of Ma-chine Learning Research 15, no. 1 (2014):1929-1958.

[Karpathy and Fei Fei, 2015] Karpathy, An-drej, and Li Fei-Fei. "Deep visual-semanticalignments for generating image de-scriptions." Proceedings of the IEEEConference on Computer Vision andPattern Recognition. 2015.

[Kenneth et al., 1990] Church, Kenneth Ward,and Patrick Hanks. "Word associationnorms, mutual information, and lexi-cography." Computational linguistics 16.1(1990): 22-29.

[Kim ,2014] Kim, Yoon. "Convolutional neu-ral networks for sentence classification."arXiv preprint arXiv:1408.5882 (2014).

[Kingma and Ba, 2014] Kingma, Diederik,and Jimmy Ba. "Adam: A method for

stochastic optimization." arXiv preprintarXiv:1412.6980 (2014).

[Kotzias et al., 2014] Kotzias, Dimitrios, MishaDenil, Phil Blunsom, and Nando deFreitas. "Deep multi-instance transferlearning." arXiv preprint arXiv:1411.3128(2014).

[Krizhevsky et al., 2012] Krizhevsky, Alex,Ilya Sutskever, and Geoffrey E. Hinton."Imagenet classification with deep convo-lutional neural networks." Advances inneural information processing systems.2012.

[Lev et al., 2015] Lev, Guy, Gil Sadeh, Ben-jamin Klein, and Lior Wolf. "RNNFisher Vectors for Action Recognitionand Image Annotation." arXiv preprintarXiv:1512.03958 (2015).

[Li and Roth, 2002] Li, Xin, and Dan Roth."Learning question classifiers." Proceed-ings of the 19th international conferenceon Computational linguistics-Volume 1.Association for Computational Linguis-tics, 2002.

[Lowe, 1999] Lowe, David G. "Object recogni-tion from local scale-invariant features."Computer vision, 1999. The proceedingsof the seventh IEEE international confer-ence on. Vol. 2. Ieee, 1999.

[Maaten and Hinton, 2008] Maaten, Laurensvan der, and Geoffrey Hinton. "Visualiz-ing data using t-SNE." Journal of MachineLearning Research 9.Nov (2008): 2579-2605.

[Mikolov et. al., 2013] Mikolov, Tomas, KaiChen, Greg Corrado, and Jeffrey Dean."Efficient estimation of word represen-tations in vector space." arXiv preprintarXiv:1301.3781 (2013).

[Mitchell et al., 2008] Mitchell, Tom M., Svet-lana V. Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L. Malave, Robert A.Mason, and Marcel Adam Just. "Predict-ing human brain activity associated with

14


the meanings of nouns." science 320, no.5880 (2008): 1191-1195.

[Mitchell and Lapata, 2008] Mitchell, Jeff, andMirella Lapata. "Vector-based Models ofSemantic Composition." ACL. 2008.

[Osgood, 1964] Osgood, Charles E. "Semanticdifferential technique in the comparativestudy of cultures." American Anthropolo-gist 66.3 (1964): 171-200.

[Pang and Lee, 2004] Pang, B., Lee, L. A sen-timental education: Sentiment analysisusing subjectivity summarization basedon minimum cuts. In Proceedings of the42nd annual meeting on Association forComputational Linguistics (p. 271). As-sociation for Computational Linguistics.2004.

[Pang and Lee, 2005] Pang, Bo, and LillianLee. "Seeing stars: Exploiting class re-lationships for sentiment categorizationwith respect to rating scales." Proceedingsof the 43rd annual meeting on associationfor computational linguistics. Associationfor Computational Linguistics, 2005.

[Pennington et. al., 2014] Pennington, Jeffrey,Richard Socher, and Christopher D. Man-ning. "Glove: Global Vectors for WordRepresentation." EMNLP. Vol. 14. (2014).

[Razavian et al., 2014] Sharif Razavian, Ali,Hossein Azizpour, Josephine Sullivan,and Stefan Carlsson. "CNN features off-the-shelf: an astounding baseline forrecognition." In Proceedings of the IEEEConference on Computer Vision and Pat-tern Recognition Workshops, pp. 806-813.2014.

[Reed et al., 2015] Reed, Scott E., Yi Zhang,Yuting Zhang, and Honglak Lee. "Deepvisual analogy-making." In Advances inNeural Information Processing Systems,pp. 1252-1260. 2015.

[Rumelhart et al., 1988] Rumelhart, David E.,Geoffrey E. Hinton, and Ronald J.

Williams. "Learning representations byback-propagating errors." Cognitive mod-eling 5.3 (1988): 1.

[Russakovsky et al., 2015] Russakovsky, Olga,Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huanget al. "Imagenet large scale visual recog-nition challenge." International Journal ofComputer Vision 115, no. 3 (2015): 211-252.

[Salton et al., 1975] Salton, Gerard, AnitaWong, and Chung-Shu Yang. "A vectorspace model for automatic indexing."Communications of the ACM 18.11 (1975):613-620.

[Schwartz et al., 2015] Schwartz, Roy, Roi Re-ichart, and Ari Rappoport. "Symmetricpattern based word embeddings for im-proved word similarity prediction." Proc.of CoNLL. 2015.

[Socher et al., 2013] Socher, Richard, AlexPerelygin, Jean Y. Wu, Jason Chuang,Christopher D. Manning, Andrew Y. Ng,and Christopher Potts. "Recursive deepmodels for semantic compositionality overa sentiment treebank." In Proceedings ofthe conference on empirical methods innatural language processing (EMNLP),vol. 1631, p. 1642. 2013.

[Socher et al., 2013] Socher, Richard, MilindGanjoo, Christopher D. Manning, andAndrew Ng. "Zero-shot learning throughcross-modal transfer." In Advances in neu-ral information processing systems, pp.935-943. 2013.

[Wallace et al., 2014] Byron C Wallace, LauraKertz Do Kook Choe, and Eugene Char-niak. Humans require context to inferironic intent (so computers probably do,too). In Proceedings of the Annual Meet-ing of the Association for ComputationalLinguistics (ACL), 2014, pages 512-516.

[Weston, et al., 2010] Weston, Jason, SamyBengio, and Nicolas Usunier. "Large scale

15


image annotation: learning to rank withjoint word-image embeddings." Machinelearning 81.1 (2010): 21-35.

[Yangqing et al., 2014] Jia, Yangqing, EvanShelhamer, Jeff Donahue, Sergey Karayev,Jonathan Long, Ross Girshick, SergioGuadarrama, and Trevor Darrell. "Caffe:Convolutional architecture for fast featureembedding." In Proceedings of the 22ndACM international conference on Multi-media, pp. 675-678. ACM, 2014.

[Zhang and Wallace, 2015] Zhang, Ye, and By-ron Wallace. "A Sensitivity Analysis of(and Practitioners’ Guide to) Convolu-tional Neural Networks for Sentence Clas-sification."arXiv preprint arXiv:1510.03820(2015).

[Zeiler, 2012] Zeiler, Matthew D. "ADADELTA:an adaptive learning rate method." arXivpreprint arXiv:1212.5701 (2012).

[Zeiler and Fergus, 2014] Zeiler, Matthew D.,and Rob Fergus. "Visualizing and under-standing convolutional networks." Euro-pean Conference on Computer Vision.Springer International Publishing, 2014.

16

word embeddings and their use in sentence classiﬁcation tasks … · 2016-10-27 · word...

Documents