incorporating syntactic and semantic information in …...incorporating syntactic and semantic...

Incorporating Syntactic and Semantic Information inWord Embeddings using Graph Convolutional Networks

Shikhar Vashishth1 Manik Bhandari2∗ Prateek Yadav3∗

Piyush Rai4 Chiranjib Bhattacharyya1 Partha Talukdar11Indian Institute of Science, 2Carnegie Mellon University

4Microsoft Research, 4IIT Kanpur{shikhar,chiru,ppt}@iisc.ac.in, [email protected]

[email protected], [email protected]

AbstractWord embeddings have been widely adoptedacross several NLP applications. Most exist-ing word embedding methods utilize sequen-tial context of a word to learn its embedding.While there have been some attempts at uti-lizing syntactic context of a word, such meth-ods result in an explosion of the vocabularysize. In this paper, we overcome this prob-lem by proposing SynGCN, a flexible GraphConvolution based method for learning wordembeddings. SynGCN utilizes the dependencycontext of a word without increasing the vo-cabulary size. Word embeddings learned bySynGCN outperform existing methods on var-ious intrinsic and extrinsic tasks and providean advantage when used with ELMo. We alsopropose SemGCN, an effective framework forincorporating diverse semantic knowledge forfurther enhancing learned word representa-tions. We make the source code of bothmodels available to encourage reproducible re-search.

1 Introduction

Representing words as real-valued vectors is aneffective and widely adopted technique in NLP.Such representations capture properties of wordsbased on their usage and allow them to generalizeacross tasks. Meaningful word embeddings havebeen shown to improve performance on severalrelevant tasks, such as named entity recognition(NER) (Bengio et al., 2013), parsing (Socher et al.,2013), and part-of-speech (POS) tagging (Ma andHovy, 2016). Using word embeddings for initial-izing Deep Neural Networks has also been foundto be quite useful (Collobert et al., 2011; Johnsonet al., 2017; Strubell et al., 2018).

Most popular methods for learning word em-beddings are based on the distributional hypoth-esis, which utilizes the co-occurrence statistics

∗Contributed equally to the work.

from sequential context of words for learningword representations (Mikolov et al., 2013a; Pen-nington et al., 2014). More recently, this approachhas been extended to include syntactic contexts(Levy and Goldberg, 2014) derived from depen-dency parse of text. Higher order dependencieshave also been exploited by Komninos and Man-andhar (2016); Li et al. (2018). Syntax-basedembeddings encode functional similarity (in-placesubstitutable words) rather than topical similarity(topically related words) which provides an advan-tage on specific tasks like question classification(Komninos and Manandhar, 2016). However, cur-rent approaches incorporate syntactic context byconcatenating words with their dependency rela-tions. For instance, in Figure 1 scientists_subj,water_obj, and mars_nmod needs to be includedas a part of vocabulary for utilizing the depen-dency context of discover. This severely expandsthe vocabulary, thus limiting the scalability ofmodels on large corpora. For instance, in Levyand Goldberg (2014) and Komninos and Man-andhar (2016), the context vocabulary explodesto around 1.3 million for learning embeddings of220k words.

Incorporating relevant signals from semanticknowledge sources such as WordNet (Miller,1995), FrameNet (Baker et al., 1998), and Para-phrase Database (PPDB) (Pavlick et al., 2015) hasbeen shown to improve the quality of word embed-dings. Recent works utilize these by incorporat-ing them in a neural language modeling objectivefunction (Yu and Dredze, 2014; Alsuhaibani et al.,2018), or as a post-processing step (Faruqui et al.,2014; Mrkšic et al., 2016). Although existing ap-proaches improve the quality of word embeddings,they require explicit modification for handling dif-ferent types of semantic information.

Recently proposed Graph Convolutional Net-works (GCN) (Defferrard et al., 2016; Kipf and

Scientists

discover

water

on

ContextEmbedding

GCNEmbedding

SynGCN (Sentence-level)

Mars

obj

waterapple

chair

OutputLayer

WTarget

Sentence

nmod

subj

case

Figure 1: Overview of SynGCN: SynGCN employs Graph Convolution Network for utilizing dependency contextfor learning word embeddings. For each word in vocabulary, the model learns its representation by aiming topredict each word based on its dependency context encoded using GCNs. Please refer Section 5 for more details.

Welling, 2016) have been found to be useful forencoding structural information in graphs. Eventhough GCNs have been successfully employedfor several NLP tasks such as machine transla-tion (Bastings et al., 2017), semantic role labeling(Marcheggiani and Titov, 2017), document dat-ing (Vashishth et al., 2018a) and text classifica-tion (Yao et al., 2018), they have so far not beenused for learning word embeddings, especiallyleveraging cues such as syntactic and semantic in-formation. GCNs provide flexibility to representdiverse syntactic and semantic relationships be-tween words all within one framework, withoutrequiring relation-specific special handling as inprevious methods. Recognizing these benefits, wemake the following contributions in this paper.

1. We propose SynGCN, a Graph Convolutionbased method for learning word embeddings.Unlike previous methods, SynGCN utilizessyntactic context for learning word representa-tions without increasing vocabulary size.

2. We also present SemGCN, a framework for in-corporating diverse semantic knowledge (e.g.,synonymy, antonymy, hyponymy, etc.) inlearned word embeddings, without requiringrelation-specific special handling as in previousmethods.

3. Through experiments on multiple intrinsic andextrinsic tasks, we demonstrate that our pro-posed methods obtain substantial improvementover state-of-the-art approaches, and also yield

an advantage when used in conjunction withmethods such as ELMo (Peters et al., 2018).

The source code of both the methods hasbeen made available at http://github.com/malllabiisc/WordGCN.

2 Related Work

Word Embeddings: Recently, there has beenmuch interest in learning meaningful word repre-sentations such as neural language modeling (Ben-gio et al., 2003) based continuous-bag-of-words(CBOW) and skip-gram (SG) models (Mikolovet al., 2013a). This is further extended by Pen-nington et al. (2014) which learns embeddings byfactorizing word co-occurrence matrix to lever-age global statistical information. Other formula-tions for learning word embeddings include multi-task learning (Collobert et al., 2011) and rankingframeworks (Ji et al., 2015).

Syntax-based Embeddings: Dependencyparse context based word embeddings is firstintroduced by Levy and Goldberg (2014). Theyallow encoding syntactic relationships betweenwords and show improvements on tasks wherefunctional similarity is more relevant than topicalsimilarity. The inclusion of syntactic context isfurther enhanced through second-order (Komni-nos and Manandhar, 2016) and multi-order (Liet al., 2018) dependencies. However, in all theseexisting approaches, the word vocabulary isseverely expanded for incorporating syntactic

http://github.com/malllabiisc/WordGCN

http://github.com/malllabiisc/WordGCN

relationships.Incorporating Semantic Knowledge Sources:

Semantic relationships such as synonymy,antonymy, hypernymy, etc. from several semanticsources have been utilized for improving thequality of word representations. Existing methodseither exploit them jointly (Xu et al., 2014; Kielaet al., 2015; Alsuhaibani et al., 2018) or as apost-processing step (Faruqui et al., 2014; Mrkšicet al., 2016). SynGCN falls under the lattercategory and is more effective at incorporatingsemantic constraints (Section 9.2 and 9.3).

Graph Convolutional Networks: In this pa-per, we use the first-order formulation of GCNsvia a layer-wise propagation rule as proposed by(Kipf and Welling, 2016). Recently, some vari-ants of GCNs have also been proposed (Yadavet al., 2019; Vashishth et al., 2019). A detaileddescription of GCNs and their applications canbe found in Bronstein et al. (2017). In NLP,GCNs have been utilized for semantic role la-beling (Marcheggiani and Titov, 2017), machinetranslation (Bastings et al., 2017), and relation ex-traction (Vashishth et al., 2018b). Recently, Yaoet al. (2018) use GCNs for text classification byjointly embedding words and documents. How-ever, their learned embeddings are task specificwhereas in our work we aim to learn task agnosticword representations.

3 Background: Graph ConvolutionalNetworks

In this section, we will provide a brief overviewof Graph Convolutional Networks (GCNs) (Def-ferrard et al., 2016; Kipf and Welling, 2016) andits extension to directed labeled graphs.

3.1 GCN on Directed Labeled GraphsLet G = (V, E ,X ) be a directed graph where V isthe set of nodes (|V| = n), E indicates the edgeset, and X ∈ Rn×d denotes the d-dimensional in-put node features. An edge from node u to v withlabel luv is denoted by (u, v, luv). As the infor-mation need not always propagate only along thedirection of the edge, following Marcheggiani andTitov (2017), we include inverse edges (v, u, l−1

uv )in E . Embedding hk+1

v ∈ Rd of a node v afterk-GCN layers is given as follows.

hk+1v = f

∑u∈N+(v)

(W k

luvhku + bkluv

)

Here, W kluv∈ Rd×d and bluv ∈ Rd are label

specific model parameters, N+(v) = N (v) ∪ {v}is the set of immediate neighbors of v (includingv itself), and hku ∈ Rd is hidden representation ofnode u after k − 1 layers.

Edge Label Gating Mechanism: In real-worldgraphs, some of the edges might be erroneous orirrelevant for the downstream task. This is pre-dominant in automatically constructed graphs likedependency parse of text. To address this issue,we employ edge-wise gating (Marcheggiani andTitov, 2017) in GCNs. For each node v, we calcu-late a relevance score gkluv ∈ R for all the edges inwhich v participates. The score is computed inde-pendently for each layer as shown below.

gkluv= σ

(W k

luvhku + bkluv

)Here, W k

luv∈ R1×d and bkluv

∈ R are trainableparameters and σ(·) is the sigmoid function. Theupdated GCN propagation rule for the kth layercan be written as shown below.

hk+1v = f

∑u∈N+(v)

gkluv×(W k

luvhku + bkluv

) (1)

4 Methods Overview

The task of learning word representations in anunsupervised setting can be formulated as fol-lows: Given a text corpus, the aim is to learna d-dimensional embedding for each word in thevocabulary. Most of the distributional hypothe-sis based approaches only utilize sequential con-text for each word in the corpus. However, thisbecomes suboptimal when the relevant contextwords lie beyond the window size. For instancein Figure 1, a relevant context word discover forMars is missed if the chosen window size is lessthan 3. On the contrary, a large window size mightallow irrelevant words to influence word embed-dings negatively.

Using dependency based context helps to alle-viate this problem. However, all existing syntacticcontext based methods (Levy and Goldberg, 2014;Komninos and Manandhar, 2016; Li et al., 2018)severely expand vocabulary size (as discussed inSection 1) which limits their scalability to a largecorpus. To eliminate this drawback, we proposeSynGCN which employs Graph Convolution Net-works to better encode syntactic information in

embeddings. We prefer GCNs over other graphencoding architectures such as Tree LSTM (Taiet al., 2015) as GCNs do not restrict graphs to betrees and have been found to be more effective atcapturing global information (Zhang et al., 2018).Moreover, they give substantial speedup as they donot involve recursive operations which are difficultto parallelize. The overall architecture is shown inFigure 1, for more details refer to Section 5.

Enriching word embeddings with semanticknowledge helps to improve their quality for sev-eral NLP tasks. Existing approaches are either in-capable of utilizing these diverse relations or needto be explicitly modeled for exploiting them. Inthis paper, we propose SemGCN which automati-cally learns to utilize multiple semantic constraintsby modeling them as different edge types. It canbe used as a post-processing method similar toFaruqui et al. (2014); Mrkšic et al. (2016). Wedescribe it in more detail in Section 6.

5 SynGCN

In this section, we provide a detailed descrip-tion of our proposed method, SynGCN. Follow-ing Mikolov et al. (2013b); Levy and Goldberg(2014); Komninos and Manandhar (2016), we sep-arately define target and context embeddings foreach word in the vocabulary as parameters in themodel. For a given sentence s = (w1, w2, . . . , wn),we first extract its dependency parse graph Gs =

(Vs, Es) using Stanford CoreNLP parser (Manninget al., 2014). Here, Vs = {w1, w2, . . . , wn} and Esdenotes the labeled directed dependency edges ofthe form (wi, wj , lij), where lij is the dependencyrelation of wi to wj .

Similar to Mikolov et al. (2013b)’s continuous-bag-of-words (CBOW) model, which defines thecontext of a word wi as Cwi = {wi+j : −c ≤ j ≤c, j 6= 0} for a window of size c, we define thecontext as its neighbors in Gs, i.e., Cwi

= N (wi).Now, unlike CBOW which takes the sum of thecontext embedding of words in Cwi to predict wi,we apply directed Graph Convolution Network (asdefined in Section 3) on Gs with context embed-dings of words in s as input features. Thus, foreach word wi in s, we obtain a representation hk+1

i

after k-layers of GCN using Equation 1 which wereproduce below for ease of readability (with one

exception as described below).

hk+1i = f

∑j∈N (i)

gklij ×(W k

lijhkj + bklij

)Please note that unlike in Equation 1, we use

N (i) instead of N+(i) in SynGCN, i.e., we do notinclude self-loops in Gs. This helps to avoid over-fitting to the initial embeddings, which is unde-sirable in the case of SynGCN as it uses randominitialization. We note that similar strategy hasbeen followed by Mikolov et al. (2013b). Further-more, to handle erroneous edges in automaticallyconstructed dependency parse graph, we performedge-wise gating (Section 3.1) to give importanceto relevant edges and suppress the noisy ones. Theembeddings obtained are then used to calculate theloss as described in Section 7.

SynGCN utilizes syntactic context to learn moremeaningful word representations. We validate thisin Section 9.1. Note that, the word vocabulary re-mains unchanged during the entire learning pro-cess, this makes SynGCN more scalable comparedto the existing approaches.

Note that, SynGCN is a generalization ofCBOW model, as shown below.

Theorem 1. SynGCN is a generalization ofContinuous-bag-of-words (CBOW) model.

Proof. The reduction can be obtained as follows. For agiven sentence s, take the neighborhood of each wordwi in Gs as it sequential context, i.e.,N (wi) = {wi+j :

−c ≤ j ≤ c, j 6= 0} ∀wi ∈ s. Now, if the number ofGCN layers are restricted to 1 and the activation func-tion is taken as identity (f(x) = x), then Equation 1reduces to

hi =∑

−c≤j≤c,j 6=0

(glij ×

(Wlijhj + bklij

)).

Finally, W klij

and bklij can be fixed to an identity matrix(I) and a zero vector (0), respectively, and edge-wisegating (glij ) can be set to 1. This gives

hi =∑

−c≤j≤c,j 6=0

(I · hj + 0) =∑

−c≤j≤c,j 6=0

hj ,

which is the hidden layer equation of CBOW model.

6 SemGCN

In this section, we propose another Graph Con-volution based framework, SemGCN, for incor-porating semantic knowledge in pre-trained word

SemGCN (Corpus-level)

water

food

snow

aqua

H2O

water

WTarget

ocean

liquid

synonym hyponym

hypernymsynonym synonym

Figure 2: Overview of SemGCN, our proposed GraphConvolution based framework for incorporating di-verse semantic information in learned embeddings.Double-headed edges denote two edges in both direc-tions. Please refer to Section 6 for more details.

embeddings. Most of the existing approaches likeFaruqui et al. (2014); Mrkšic et al. (2016) are re-stricted to handling symmetric relations like syn-onymy and antonymy. On the other hand, althoughrecently proposed (Alsuhaibani et al., 2018) is ca-pable of handling asymmetric information, it stillrequires manually defined relation strength func-tion which can be labor intensive and suboptimal.

SemGCN is capable of incorporating bothsymmetric as well as asymmetric informationjointly. Unlike SynGCN, SemGCN operates ona corpus-level directed labeled graph with wordsas nodes and edges representing semantic relation-ship among them from different sources. For in-stance, in Figure 2, semantic relations such as hy-ponymy, hypernymy and synonymy are representedtogether in a single graph. Symmetric informa-tion is handled by including a directed edge inboth directions. Given the corpus level graph G,the training procedure is similar to that of Syn-GCN, i.e., predict the word w based on its neigh-bors in G. Inspired by Faruqui et al. (2014), wepreserve the semantics encoded in pre-trained em-beddings by initializing both target and contextembeddings with given word representations andkeeping target embeddings fixed during training.SemGCN uses Equation 1 to update node embed-dings. Please note that in this case N+(v) is usedas the neighborhood definition to preserve the ini-tial learned representation of the words.

7 Training Details

Given the GCN representation (ht) of a word (wt),the training objective of SynGCN and SemGCN isto predict the target word given its neighbors in thegraph. Formally, for each method we maximizethe following objective1.

E =

|V |∑t=1

logP (wt|wt1, w

t2 . . . w

tNt

)

where, wt is the target word and wt1, w

t2 . . . w

tNt

are its neighbors in the graph. The probabilityP (wt|wt

1, wt2 . . . w

tNt

) is calculated using the soft-max function, defined as

P (wt|wt1, w

t2 . . . w

tNt

) =exp(vTwt

ht)∑|V |i=1 exp(v

Twiht)

.

Hence, E reduces to

E =

|V |∑t=1

vTwtht − log

|V |∑i=1

exp(vTwiht)

, (2)

where, ht is the GCN representation of the targetword wt and vwt

is its target embedding.The second term in Equation 2 is computa-

tionally expensive as the summation needs to betaken over the entire vocabulary. This can beovercome using several approximations like noise-contrastive estimation (Gutmann and Hyvärinen,2010) and hierarchical softmax (Morin and Ben-gio, 2005). In our methods, we use negative sam-pling as used by Mikolov et al. (2013b).

8 Experimental Setup

8.1 Dataset and Training

In our experiments, we use Wikipedia2 corpus fortraining the models. After discarding too long andtoo short sentences, we get an average sentencelength of nearly 20 words. The corpus consists of57 million sentences with 1.1 billion tokens and 1billion syntactic dependencies.

1We also experimented with joint SynGCN and SemGCNmodel but our preliminary experiments gave suboptimal per-formance as compared to the sequential model. This can beattributed to the fact that syntactic information is orders ofmagnitude greater than the semantic information available.Hence, the semantic constraints are not effectively utilized.We leave the analysis of the joint model as a future work.

2https://dumps.wikimedia.org/enwiki/20180301/

8.2 Baselines

For evaluating SynGCN (Section 5), we compareagainst the following baselines:• Word2vec is continuous-bag-of-words model

originally proposed by Mikolov et al. (2013b).• GloVe (Pennington et al., 2014), a log-bilinear

regression model which leverages global co-occurrence statistics of corpus.

• Deps (Levy and Goldberg, 2014) is a modifica-tion of skip-gram model which uses dependencycontext in place of sequential context.

• EXT (Komninos and Manandhar, 2016) is anextension of Deps which utilizes second-orderdependency context features.

SemGCN (Section 6) model is evaluated againstthe following methods:• Retro-fit (Faruqui et al., 2014) is a post-

processing procedure which uses similarity con-straints from semantic knowledge sources.

• Counter-fit (Mrkšic et al., 2016), a methodfor injecting both antonym and synonym con-straints into word embeddings.

• JointReps (Alsuhaibani et al., 2018), a jointword representation learning method which si-multaneously utilizes the corpus and KB.

8.3 Evaluation method

To evaluate the effectiveness of our proposedmethods, we compare them against the baselineson the following intrinsic and extrinsic tasks3:• Intrinsic Tasks:

Word Similarity is the task of evaluating close-ness between semantically similar words. Fol-lowing Komninos and Manandhar (2016); Pen-nington et al. (2014), we evaluate on Simlex-999 (Hill et al., 2015), WS353 (Finkelsteinet al., 2001), and RW (Luong et al., 2013)datasets.Concept Categorization involves grouping nom-inal concepts into natural categories. For in-stance, tiger and elephant should belong to mam-mal class. In our experiments, we evalute on AP(Almuhareb, 2006), Battig (Baroni and Lenci,2010), BLESS (Baroni and Lenci, 2011), ESSLI(Baroni et al., 2008) datasets.Word Analogy task is to predict word b2, giventhree words a1, a2, and b1, such that the relationb1 : b2 is same as the relation a1 : a2. We com-pare methods on MSR (Mikolov et al., 2013c)

3Details of hyperparameters are in supplementary.

and SemEval-2012 (Jurgens et al., 2012).• Extrinsic Tasks:

Named Entity Recognition (NER) is the task oflocating and classifying entity mentions intocategories like person, organization etc. Weuse Lee et al. (2018)’s model on CoNLL-2003dataset (Tjong Kim Sang and De Meulder,2003) for evaluation.Question Answering in Stanford Question An-swering Dataset (SQuAD) (Rajpurkar et al.,2016) involves identifying answer to a questionas a segment of text from a given passage. Fol-lowing Peters et al. (2018), we evaluate usingClark and Gardner (2018)’s model.Part-of-speech (POS) tagging aims at associatingwith each word, a unique tag describing its syn-tactic role. For evaluating word embeddings, weuse Lee et al. (2018)’s model on Penn TreebankPOS dataset (Marcus et al., 1994).Co-reference Resolution (Coref) involves identi-fying all expressions that refer to the same entityin the text. To inspect the effect of embeddings,we use Lee et al. (2018)’s model on CoNLL-2012 shared task dataset (Pradhan et al., 2012).

9 Results

In this section, we attempt to answer the followingquestions.Q1. Does SynGCN learn better word embeddings

than existing approaches? (Section 9.1)

Q2. Does SemGCN effectively handle diversesemantic information as compared to othermethods? (Section 9.2)

Q3. How does SemGCN perform compared toother methods when provided with the samesemantic constraints? (Section 9.3)

Q4. Does dependency context based embeddingencode complementary information com-pared to ELMo? (Section 9.4)

9.1 SynGCN Evaluation

The evaluation results on intrinsic tasks – wordsimilarity, concept categorization, and analogy –are summarized in Table 1. We report Spear-man correlation for word similarity and analogytasks and cluster purity for concept categorizationtask. Overall, we find that SynGCN, our proposedmethod, outperforms all the existing word embed-ding approaches in 9 out of 10 settings. The in-ferior performance of SynGCN and other depen-

Word Similarity Concept Categorization Word Analogy

Method WS353S WS353R SimLex999 RW AP Battig BLESS ESSLI SemEval2012 MSR

Word2vec 71.4 52.6 38.0 30.0 63.2 43.3 77.8 63.0 18.9 44.0GloVe 69.2 53.4 36.7 29.6 58.0 41.3 80.0 59.3 18.7 45.8Deps 65.7 36.2 39.6 33.0 61.8 41.7 65.9 55.6 22.9 40.3EXT 69.6 44.9 43.2 18.6 52.6 35.0 65.2 66.7 21.8 18.8

SynGCN 73.2 45.7 45.5 33.7 69.3 45.2 85.2 70.4 23.4 52.8

Table 1: SynGCN Intrinsic Evaluation: Performance on word similarity (Spearman correlation), concept cate-gorization (cluster purity), and word analogy (Spearman correlation). Overall, SynGCN outperforms other existingapproaches in 9 out of 10 settings. Please refer to Section 9.1 for more details.

Method POS SQuAD NER Coref

Word2vec 95.0±0.1 78.5±0.3 89.0±0.2 65.1±0.3GloVe 94.6±0.1 78.2±0.2 89.1±0.1 64.9±0.2Deps 95.0±0.1 77.8±0.3 88.6±0.3 64.8±0.1EXT 94.9±0.2 79.6±0.1 88.0±0.1 64.8±0.1

SynGCN 95.4±0.1 79.6±0.2 89.5±0.1 65.8±0.1

Table 2: SynGCN Extrinsic Evaluation: Comparisonon parts-of-speech tagging (POS), question answering(SQuAD), named entity recognition (NER), and co-reference resolution (Coref). SynGCN performs com-parable or outperforms all existing approaches on alltasks. Refer Section 9.1 for details.

dency context based methods on WS353R datasetcompared to sequential context based methods isconsistent with the observation reported in Levyand Goldberg (2014); Komninos and Manandhar(2016). This is because the syntactic context basedembeddings capture functional similarity ratherthan topical similarity (as discussed in Section1). On average, we obtain around 1.5%, 5.7%and 7.5% absolute increase in performance onword similarity, concept categorization and anal-ogy tasks compared to the best performing base-line. The results demonstrate that the learned em-beddings from SynGCN more effectively capturesemantic and syntactic properties of words.

We also evaluate the performance of differentword embedding approaches on the downstreamtasks as defined in Section 8.3. The experimen-tal results are summarized in Table 2. Overall, wefind that SynGCN either outperforms or performscomparably to other methods on all four tasks.Compared to the sequential context based meth-ods, dependency based methods perform superiorat question answering task as they effectively en-code syntactic information. This is consistent withthe observation of Peters et al. (2018).


X = SynGCN 95.4±0.1 79.6±0.2 89.5±0.1 65.8±0.1Retro-fit (X,1) 94.8±0.1 79.6±0.1 88.8±0.1 66.0±0.2Counter-fit (X,2) 94.7±0.1 79.8±0.1 88.3±0.3 65.7±0.3JointReps (X,4) 95.4±0.1 79.4±0.3 89.1±0.3 65.6±0.1

SemGCN (X,4) 95.5±0.1 80.4±0.1 89.5±0.1 66.1±0.1

Table 3: SemGCN Extrinsic Evaluation: Compari-son of different methods for incorporating diverse se-mantic constraints in SynGCN embeddings on all ex-trinsic tasks. Refer Section 9.2 for details.

9.2 Evaluation with Diverse SemanticInformation

In this section, we compare SemGCN against themethods listed in Section 8.2 for incorporating di-verse semantic information in pre-trained embed-dings. We use hypernym, hyponym, and antonym re-lations from WordNet, and synonym relations fromPPDB as semantic information. For each method,we provide the semantic information that it canutilize, e.g., Retro-fit can only make use of syn-onym relation4. In our results, M(X, R) denotes thefine-tuned embeddings obtained using method Mwhile taking X as initialization embeddings. R de-notes the types of semantic information used asdefined below.• R=1: Only synonym information.• R=2: Synonym and antonym information.• R=4: Synonym, antonym, hypernym and hy-

ponym information.For instance, Counter-fit (GloVe, 2) representsGloVe embeddings fine-tuned by Counter-fit usingsynonym and antonym information.

Similar to Section 9.1, the evaluation is per-formed on the three intrinsic tasks. Due to spacelimitations, we report results on one representa-tive dataset per task. The results are summarized

4Experimental results controlling for semantic informa-tion are provided in Section 9.3.

Init Embeddings (=X) Word2vec GloVe Deps EXT SynGCN

Datasets WS353 AP MSR WS353 AP MSR WS353 AP MSR WS353 AP MSR WS353 AP MSR

Performance of X 63.0 63.2 44.0 58.0 60.4 45.8 55.6 64.2 40.3 59.3 53.5 18.8 61.7 69.3 52.8

Retro-fit (X,1) 63.4 67.8 46.7 58.5 61.1 47.2 54.8 64.7 41.0 61.6 55.1 40.5 61.2 67.1 51.4Counter-fit (X,2) 60.3 62.9 31.4 53.7 62.5 29.6 46.9 60.4 33.4 52.0 54.4 35.8 55.2 66.4 31.7JointReps (X,4) 60.9 61.1 28.5 59.2 55.5 37.6 54.8 58.7 38.0 58.8 54.8 20.6 60.9 68.2 24.9

SemGCN (X,4) 64.8 67.8 36.8 63.3 63.2 44.1 62.3 69.3 41.1 62.9 67.1 52.1 65.3 69.3 54.4

Table 4: SemGCN Intrinsic Evaluation: Evaluation of different methods for incorporating diverse semanticconstraints initialized using various pre-trained embeddings (X). M(X, R) denotes the fine-tuned embeddings usingmethod M taking X as initialization embeddings. R denotes the type of semantic relations used as defined inSection 9.2. SemGCN outperforms other methods in 13 our of 15 settings. SemGCN with SynGCN gives the bestperformance across all tasks (highlighted using · ). Please refer Section 9.2 for details.

F1 score

X = SynGCN

Retro-fit (X,1)

Counter-fit (X,1)

JointReps (X,1)

SemGCN (X,1)

79.0 79.6 80.2 80.8

Figure 3: Comparison of different methods when pro-vided with the same semantic information (synonym)for fine tuning SynGCN embeddings. Results denotethe F1-score on SQuAD dataset. SemGCN gives con-siderable improvement in performance. Please referSection 9.3 for details.

in Table 4. We find that in 13 out of 15 settings,SemGCN outperforms other methods. Overall, weobserve that SemGCN, when initialized with Syn-GCN, gives the best performance on all the tasks(highlighted by · in Table 4).

For comparing performance on the extrinsictasks, we first fine-tune SynGCN embeddings us-ing different methods for incorporating semanticinformation. The embeddings obtained by thisprocess are then evaluated on extrinsic tasks, asin Section 9.1. The results are shown in Table3. We observe that while the other methods donot always consistently give improvement over thebaseline SynGCN, SemGCN is able to improveupon SynGCN in all settings (better or compara-ble). Overall, we observe that SynGCN along withSemGCN is the most suitable method for incorpo-rating both syntactic and semantic information.

9.3 Evaluation with Same SemanticInformation

In this section, we compare SemGCN againstother baselines when provided with the same se-


ELMo (E) 96.1±0.1 81.8±0.2 90.3±0.3 67.8±0.1E+SemGCN(SynGCN, 4) 96.2±0.1 82.4±0.1 90.9±0.1 68.3±0.1

Table 5: Comparison of ELMo with SynGCN andSemGCN embeddings on multiple extrinsic tasks. Foreach task, models use a linear combination of the pro-vided embeddings whose weights are learned. Re-sults show that our proposed methods encode comple-mentary information which is not captured by ELMo.Please refer Section 9.4 for more details.

mantic information: synonyms from PPDB. Sim-ilar to Section 9.2, we compare both on intrinsicand extrinsic tasks with different initializations.The evaluation results of fine-tuned SynGCN em-beddings by different methods on SQuAD areshown in the Figure 3. The remaining resultsare included in the supplementary (Table S1 andS2). We observe that compared to other meth-ods, SemGCN is most effective at incorporatingsemantic constraints across all the initializationsand outperforms others at both intrinsic and ex-trinsic tasks.

9.4 Comparison with ELMo

Recently, ELMo (Peters et al., 2018) has been pro-posed which fine-tunes word embedding based onsentential context. In this section, we evaluateSynGCN and SemGCN when given along withpre-trained ELMo embeddings. The results arereported in Table 5. The results show that de-pendency context based embeddings encode com-plementary information which is not captured byELMo as it only relies on sequential context.Hence, our proposed methods serves as an effec-tive combination with ELMo.

10 Conclusion

In this paper, we have proposed SynGCN, a graphconvolution based approach which utilizes syntac-tic context for learning word representations. Syn-GCN overcomes the problem of vocabulary explo-sion and outperforms state-of-the-art word embed-ding approaches on several intrinsic and extrinsictasks. We also propose SemGCN, a frameworkfor jointly incorporating diverse semantic infor-mation in pre-trained word embeddings. The com-bination of SynGCN and SemGCN gives the bestoverall performance. We make the source code ofboth models available to encourage reproducibleresearch.

Acknowledgments

We thank the anonymous reviewers for their con-structive comments. This work is supported in partby the Ministry of Human Resource Development(Government of India) and Google PhD Fellow-ship.

ReferencesAbdulrahman Almuhareb. 2006. Attributes in lexical

acquisition.

Mohammed Alsuhaibani, Danushka Bollegala,Takanori Maehara, and Ken-ichi Kawarabayashi.2018. Jointly learning word embeddings usinga corpus and a knowledge base. PLOS ONE,13(3):1–26.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The berkeley framenet project. In Proceed-ings of the 36th Annual Meeting of the Associationfor Computational Linguistics, ACL ’98, pages 86–90.

Marco Baroni, Stefan Evert, and Alessandro Lenci.2008. Esslli 2008 workshop on distributional lexi-cal semantics. Association for Logic, Language andInformation.

Marco Baroni and Alessandro Lenci. 2010. Distribu-tional memory: A general framework for corpus-based semantics. Comput. Linguist., 36(4):673–721.

Marco Baroni and Alessandro Lenci. 2011. How weblessed distributional semantic evaluation. In Pro-ceedings of the GEMS 2011 Workshop on GEometri-cal Models of Natural Language Semantics, GEMS’11, pages 1–10, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Simaan. 2017. Graph

convolutional encoders for syntax-aware neural ma-chine translation. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1957–1967. Association for Com-putational Linguistics.

Y. Bengio, A. Courville, and P. Vincent. 2013. Repre-sentation learning: A review and new perspectives.IEEE Transactions on Pattern Analysis and MachineIntelligence, 35(8):1798–1828.

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. J. Mach. Learn. Res., 3:1137–1155.

M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, andP. Vandergheynst. 2017. Geometric deep learning:Going beyond euclidean data. IEEE Signal Process-ing Magazine, 34(4):18–42.

Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 845–855. Associationfor Computational Linguistics.

Ronan Collobert, Jason Weston, Léon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. J. Mach. Learn. Res., 12:2493–2537.

Michaël Defferrard, Xavier Bresson, and Pierre Van-dergheynst. 2016. Convolutional neural networks ongraphs with fast localized spectral filtering. CoRR,abs/1606.09375.

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar,Chris Dyer, Eduard H. Hovy, and Noah A. Smith.2014. Retrofitting word vectors to semantic lexi-cons. CoRR, abs/1411.4166.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and EytanRuppin. 2001. Placing search in context: The con-cept revisited. In Proceedings of the 10th Interna-tional Conference on World Wide Web, WWW ’01,pages 406–414, New York, NY, USA. ACM.

Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In Proceedingsof the Thirteenth International Conference on Arti-ficial Intelligence and Statistics, volume 9 of Pro-ceedings of Machine Learning Research, pages 297–304, Chia Laguna Resort, Sardinia, Italy. PMLR.

Felix Hill, Roi Reichart, and Anna Korhonen. 2015.Simlex-999: Evaluating semantic models with gen-uine similarity estimation. Comput. Linguist.,41(4):665–695.

Shihao Ji, Hyokun Yun, Pinar Yanardag, Shin Mat-sushima, and S. V. N. Vishwanathan. 2015. Wor-drank: Learning word embeddings via robust rank-ing. CoRR, abs/1506.02761.

https://doi.org/10.1371/journal.pone.0193094

https://doi.org/10.1371/journal.pone.0193094

https://doi.org/10.3115/980845.980860

https://doi.org/10.1162/coli_a_00016



http://dl.acm.org/citation.cfm?id=2140490.2140491


http://aclweb.org/anthology/D17-1209



https://doi.org/10.1109/TPAMI.2013.50

https://doi.org/10.1109/TPAMI.2013.50



https://doi.org/10.1109/MSP.2017.2693418

https://doi.org/10.1109/MSP.2017.2693418

http://aclweb.org/anthology/P18-1078





http://arxiv.org/abs/1606.09375




https://doi.org/10.1145/371920.372094

https://doi.org/10.1145/371920.372094

http://proceedings.mlr.press/v9/gutmann10a.html



https://doi.org/10.1162/COLI_a_00237

https://doi.org/10.1162/COLI_a_00237




Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation. Transactions of the As-sociation for Computational Linguistics, 5:339–351.

David A. Jurgens, Peter D. Turney, Saif M. Moham-mad, and Keith J. Holyoak. 2012. Semeval-2012task 2: Measuring degrees of relational similarity.In Proceedings of the First Joint Conference on Lex-ical and Computational Semantics - Volume 1: Pro-ceedings of the Main Conference and the SharedTask, and Volume 2: Proceedings of the Sixth In-ternational Workshop on Semantic Evaluation, Se-mEval ’12, pages 356–364, Stroudsburg, PA, USA.Association for Computational Linguistics.

Douwe Kiela, Felix Hill, and Stephen Clark. 2015.Specializing word embeddings for similarity or re-latedness. In Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2044–2048. Association for Compu-tational Linguistics.

Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. CoRR, abs/1609.02907.

Alexandros Komninos and Suresh Manandhar. 2016.Dependency based embeddings for sentence classi-fication tasks.

Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers), pages687–692. Association for Computational Linguis-tics.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages302–308. Association for Computational Linguis-tics.

Chen Li, Jianxin Li, Yangqiu Song, and ZiweiLin. 2018. Training and evaluating improveddependency-based word embeddings. In AAAI.

Minh-Thang Luong, Richard Socher, and Christo-pher D. Manning. 2013. Better word representationswith recursive neural networks for morphology. InCoNLL, Sofia, Bulgaria.

Xuezhe Ma and Eduard H. Hovy. 2016. End-to-endsequence labeling via bi-directional lstm-cnns-crf.CoRR, abs/1603.01354.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 1506–1515. Associationfor Computational Linguistics.

Mitchell Marcus, Grace Kim, Mary AnnMarcinkiewicz, Robert MacIntyre, Ann Bies,Mark Ferguson, Karen Katz, and Britta Schas-berger. 1994. The penn treebank: Annotatingpredicate argument structure. In Proceedings ofthe Workshop on Human Language Technology,HLT ’94, pages 114–119, Stroudsburg, PA, USA.Association for Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013a. Distributed repre-sentations of words and phrases and their composi-tionality. In Proceedings of the 26th InternationalConference on Neural Information Processing Sys-tems - Volume 2, NIPS’13, pages 3111–3119, USA.Curran Associates Inc.

Tomas Mikolov, Scott Wen-tau Yih, and GeoffreyZweig. 2013b. Linguistic regularities in continu-ous space word representations. In Proceedings ofthe 2013 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL-HLT-2013). Association for Computational Linguistics.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013c. Linguistic regularities in continuous spaceword representations. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 746–751. Associa-tion for Computational Linguistics.

George A. Miller. 1995. Wordnet: A lexical databasefor english. Commun. ACM, 38(11):39–41.

Frederic Morin and Yoshua Bengio. 2005. Hierarchi-cal probabilistic neural network language model. InProceedings of the Tenth International Workshop onArtificial Intelligence and Statistics, AISTATS 2005,Bridgetown, Barbados, January 6-8, 2005.

Nikola Mrkšic, Diarmuid Ó Séaghdha, Blaise Thom-son, Milica Gašic, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-fitting word vectors tolinguistic constraints. In Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 142–148. Associa-tion for Computational Linguistics.

http://aclweb.org/anthology/Q17-1024





https://doi.org/10.18653/v1/D15-1242

https://doi.org/10.18653/v1/D15-1242




http://aclweb.org/anthology/N18-2108


https://doi.org/10.3115/v1/P14-2050

https://doi.org/10.3115/v1/P14-2050

http://www.aclweb.org/anthology/P/P14/P14-5010

http://www.aclweb.org/anthology/P/P14/P14-5010




https://doi.org/10.3115/1075812.1075835

https://doi.org/10.3115/1075812.1075835




https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/

https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/



https://doi.org/10.1145/219717.219748

https://doi.org/10.1145/219717.219748

http://www.gatsby.ucl.ac.uk/aistats/fullpapers/208.pdf

http://www.gatsby.ucl.ac.uk/aistats/fullpapers/208.pdf

https://doi.org/10.18653/v1/N16-1018

https://doi.org/10.18653/v1/N16-1018

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In In Proceedings of the Asso-ciation for Computational Linguistics (ACL-2015),pages 425–430. Association for Computational Lin-guistics.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unre-stricted coreference in ontonotes. In Joint Confer-ence on EMNLP and CoNLL - Shared Task, CoNLL’12, pages 1–40, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 2383–2392. Asso-ciation for Computational Linguistics.

Richard Socher, John Bauer, Christopher D. Manning,and Andrew Y. Ng. 2013. Parsing With Composi-tional Vector Grammars. In ACL.

Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semanticrole labeling. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 5027–5038. Association forComputational Linguistics.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguis-tics and the 7th International Joint Conference onNatural Language Processing (Volume 1: Long Pa-pers), pages 1556–1566. Association for Computa-tional Linguistics.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. In

Proceedings of the Seventh Conference on NaturalLanguage Learning at HLT-NAACL 2003 - Volume4, CONLL ’03, pages 142–147, Stroudsburg, PA,USA. Association for Computational Linguistics.

Shikhar Vashishth, Shib Sankar Dasgupta,Swayambhu Nath Ray, and Partha Talukdar.2018a. Dating documents using graph convolutionnetworks. In Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1605–1615. Association for Computational Linguistics.

Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga,Chiranjib Bhattacharyya, and Partha Talukdar.2018b. Reside: Improving distantly-supervisedneural relation extraction using side information.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages1257–1266. Association for Computational Linguis-tics.

Shikhar Vashishth, Prateek Yadav, Manik Bhandari,and Partha Talukdar. 2019. Confidence-based graphconvolutional networks for semi-supervised learn-ing. In Proceedings of Machine Learning Research,volume 89 of Proceedings of Machine Learning Re-search, pages 1792–1801. PMLR.

Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, GangWang, Xiaoguang Liu, and Tie-Yan Liu. 2014. Rc-net: A general framework for incorporating knowl-edge into word representations.

Prateek Yadav, Madhav Nimishakavi, Naganand Ya-dati, Shikhar Vashishth, Arun Rajkumar, and ParthaTalukdar. 2019. Lovasz convolutional networks. InProceedings of Machine Learning Research, vol-ume 89 of Proceedings of Machine Learning Re-search, pages 1978–1987. PMLR.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2018.Graph Convolutional Networks for Text Classifica-tion. ArXiv e-prints, page arXiv:1809.05679.

Mo Yu and Mark Dredze. 2014. Improving lexicalembeddings with semantic knowledge. In Proceed-ings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 545–550. Association for Computa-tional Linguistics.

Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 2205–2215. Asso-ciation for Computational Linguistics.

https://doi.org/10.3115/v1/P15-2070

https://doi.org/10.3115/v1/P15-2070

https://doi.org/10.3115/v1/P15-2070

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162




https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264



https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.3115/v1/P15-1150

https://doi.org/10.3115/1119176.1119195

https://doi.org/10.3115/1119176.1119195





http://proceedings.mlr.press/v89/vashishth19a.html



https://www.microsoft.com/en-us/research/publication/rc-net-a-general-framework-for-incorporating-knowledge-into-word-representations/



http://proceedings.mlr.press/v89/yadav19a.html



https://doi.org/10.3115/v1/P14-2089

https://doi.org/10.3115/v1/P14-2089



incorporating syntactic and semantic information in …...incorporating syntactic and semantic...

Documents