deep gaussian embedding of graphs unsupervised …

DEEP GAUSSIAN EMBEDDING OF GRAPHS:UNSUPERVISED INDUCTIVE LEARNING VIA RANKING

Aleksandar Bojchevski, Stephan GunnemannTechnical University of Munich, Germany{a.bojchevski,guennemann}@in.tum.de

ABSTRACT

Methods that learn representations of nodes in a graph play a critical role in net-work analysis since they enable many downstream learning tasks. We proposeGraph2Gauss – an approach that can efficiently learn versatile node embeddingson large scale (attributed) graphs that show strong performance on tasks suchas link prediction and node classification. Unlike most approaches that repre-sent nodes as point vectors in a low-dimensional continuous space, we embedeach node as a Gaussian distribution, allowing us to capture uncertainty aboutthe representation. Furthermore, we propose an unsupervised method that han-dles inductive learning scenarios and is applicable to different types of graphs:plain/attributed, directed/undirected. By leveraging both the network structureand the associated node attributes, we are able to generalize to unseen nodes with-out additional training. To learn the embeddings we adopt a personalized rankingformulation w.r.t. the node distances that exploits the natural ordering of the nodesimposed by the network structure. Experiments on real world networks demon-strate the high performance of our approach, outperforming state-of-the-art net-work embedding methods on several different tasks. Additionally, we demonstratethe benefits of modeling uncertainty – by analyzing it we can estimate neighbor-hood diversity and detect the intrinsic latent dimensionality of a graph.

1 INTRODUCTION

Graphs are a natural representation for a wide variety of real-life data, from social and rating net-works (Facebook, Amazon), to gene interactions and citation networks (BioGRID, arXiv). Nodeembeddings are a powerful and increasingly popular approach to analyze such data (Cai et al.,2017). By operating in the embedding space, one can employ proved learning techniques and bypassthe difficulty of incorporating the complex node interactions. Tasks such as link prediction, nodeclassification, community detection, and visualization all greatly benefit from these latent node rep-resentations. Furthermore, for attributed graphs by leveraging both sources of information (networkstructure and attributes) one is able to learn more useful representations compared to approachesthat only consider the graph (Yang et al., 2015; Pan et al., 2016; Ganguly & Pudi, 2017).

All existing (attributed) graph embedding approaches represent each node by a single point in alow-dimensional continuous vector space. Representing the nodes simply as points, however, hasa crucial limitation: we do not have information about the uncertainty of that representation. Yetuncertainty is inherent when describing a node in a complex graph by a single point only. Imagine anode for which the different sources of information are conflicting with each other, e.g. pointing todifferent communities or even revealing contradicting underlying patterns. Such discrepancy shouldbe reflected in the uncertainty of its embedding. As a solution to this problem, we introduce a novelembedding approach that represents nodes as Gaussian distributions: each node becomes a fulldistribution rather than a single point. Thereby, we capture uncertainty about its representation.

To effectively capture the non-i.i.d. nature of the data arising from the complex interactions betweenthe nodes, we further propose a novel unsupervised personalized ranking formulation to learn theembeddings. Intuitively, from the point of view of a single node, we want nodes in its immediateneighborhood to be closest in the embedding space, while nodes multiple hops away should becomeincreasingly more distant. This ordering between the nodes imposed by the network structure w.r.t

1

the distances between their embeddings naturally leads to our ranking formulation. Taking into ac-count this natural ranking from each node’s point of view, we learn more powerful embeddings sincewe incorporate information about the network structure beyond first and second order proximity.

Furthermore, when node attributes (e.g. text) are available our method is able to leverage them toeasily generate embeddings for previously unseen nodes without additional training. In other words,Graph2Gauss is inductive, which is a significant benefit over existing methods that are inherentlytransductive and do not naturally generalize to unseen nodes. This desirable inductive propertycomes from the fact that we are learning an encoder that maps the nodes’ attributes to embeddings.

The main contributions of our approach are summarized as follows:

a) We embed nodes as Gaussian distributions allowing us to capture uncertainty.

b) Our unsupervised personalized ranking formulation exploits the natural ordering of thenodes capturing the network structure at multiple scales.

c) We propose an inductive method that generalizes to unseen nodes and is applicable todifferent types of graphs: plain/attributed, directed/undirected.

2 RELATED WORK

The focus of this paper is on unsupervised learning of node embeddings for which many differentapproaches have been proposed. For a comprehensive recent survey see Cai et al. (2017), Hamiltonet al. (2017), or Goyal & Ferrara (2017). Approaches such as DeepWalk and node2vec (Perozziet al., 2014; Grover & Leskovec, 2016) look at plain graphs and learn an embedding based on ran-dom walks by extending or adapting the Skip-Gram (Mikolov et al., 2013) architecture. LINE (Tanget al., 2015b) uses first- and second-order proximity and trains the embedding via negative sampling.SDNE (Wang et al., 2016) similarly has a component that preserves second-order proximity and ex-ploits first-order proximity to refine the representations. GraRep (Cao et al., 2015) is a factorizationbased method that considers local and global structural information.

Tri-Party Deep Network Representation (TRIDNR) (Pan et al., 2016) considers node attributes,network structure and potentially node labels. CENE (Sun et al., 2016) similarly to Ganguly &Pudi (2017) treats the attributes as special kinds of nodes and learns embeddings on the augmentednetwork. Text-Associated DeepWalk (TADW) (Yang et al., 2015) performs low-rank matrix factor-ization considering graph structure and text features. Heterogeneous networks are consider in (Tanget al., 2015a; Chang et al., 2015), while Huang et al. similarly to Pan et al. (2016) considers labels.GraphSAGE (Hamilton et al., 2017) is an inductive method that generates embeddings by samplingand aggregating attributes from a nodes local neighborhood and requires the edges of the new nodes.

Graph convolutional networks are another family of approaches that adapt conventional CNNs tograph data (Kipf & Welling, 2016a; Defferrard et al., 2016; Henaff et al., 2015; Monti et al., 2016;Niepert et al., 2016; Pham et al., 2017). They utilize the graph Laplacian and the spectral definitionof a convolution and boil down to some form of aggregation over neighbors such as averaging. Theycan be thought of as implicitly learning an embedding, e.g. by taking the output of the last layerbefore the supervised component. See Monti et al. (2016) for an overview. In contrast to this paper,most of these methods are (semi-)supervised. The graph variational autoencoder (GAE) (Kipf &Welling, 2016b) is a notable exception that learns node embeddings in an unsupervised manner.

Few approaches consider the idea of learning an embedding that is a distribution. Vilnis & McCal-lum (2014) are the first to learn Gaussian word embeddings to capture uncertainty. Closest to ourwork, He et al. (2015) represent knowledge graphs and Dos Santos et al. (2016) study heterogeneousgraphs for node classification. Both approaches are not applicable for the context of unsupervisedlearning of (attributed) graphs that we are interested in. The method in He et al. (2015) learns anembedding for each component of the triplets (head, tail, relation) in the knowledge graph. Note thatwe cannot naively employ this method by considering a single relation ”has an edge” and a singleentity ”node”. Since their approach considers similarity between entities and relations, all nodeswould be trivially similar to the single relation. Considering the semi-supervised approach proposedin Dos Santos et al. (2016), we cannot simply ”turn off” the supervised component to adapt theirmethod for unsupervised learning, since given the defined loss we would trivially map all nodes tothe same Gaussian. Additionally, both of these approaches do not consider node attributes.

2

3 DEEP GAUSSIAN EMBEDDING

In this section we introduce our method Graph2Gauss (G2G) and detail how both the attributes andthe network structure influence the learning of node representations. The embedding is carried outin two steps: (i) the node attributes are passed through a non-linear transformation via a deep neuralnetwork (encoder) and yield the parameters associated with the node’s embedding distribution; (ii)we formulate an unsupervised loss function that incorporates the natural ranking of the nodes asgiven by the network structure w.r.t. a dissimilarity measure on the embedding distributions.

Problem definition. Let G = (A,X) be a directed attributed graph, where A ∈ RN×N is anadjacency matrix representing the edges between N nodes and X ∈ RN×D collects the attributeinformation for each node where xi is a D dimensional attribute vector of the ith node.1 V denotesthe set of all nodes. We aim to find a lower-dimensional Gaussian distribution embedding hi =N (µi,Σi), µi ∈ RL,Σi ∈ RL×L with L � N,D, such that nodes similar w.r.t. attributes andnetwork structure are also similar in the embedding space given a dissimilarity measure ∆(hi,hj).In Fig.5(a) for example we show nodes that are embedded as two dimensional Gaussians.

3.1 NETWORK STRUCTURE REPRESENTATION VIA PERSONALIZED RANKING

To capture the structural information of the network in the embedding space, we propose a per-sonalized ranking approach. That is, locally per node i we impose a ranking of all remain-ing nodes w.r.t. their distance to node i in the embedding space. More precisely, in this pa-per we exploit the k-hop neighborhoods of each node. Given some anchor node i, we defineNik = {j ∈ V |i 6= j,min(sp(i, j),K) = k} to be the set of nodes who are exactly k hopsaway from node i, where V is the set of all nodes, K is a hyper-parameter denoting the maximumdistance we are wiling to consider, and sp(i, j) returns either the length of the shortest path startingat node i and ending in node j or∞ if node j is not reachable.

Intuitively, we want all nodes belonging to the 1-hop neighborhood of i to be closer to i w.r.t. theirembedding, compared to the all nodes in its 2-hop neighborhood, which in turn are closer than thenodes in its 3-hop neighborhood and so on up to K. Thus, the ranking that we want to ensure fromthe perspective of node i is

∆(hi,hk1) < ∆(hi,hk2) < · · · < ∆(hi,hkK ) ∀k1 ∈ Ni1,∀k2 ∈ Ni2, . . . ,∀kK ∈ NiKor equivalently, we aim to satisfy the following pairwise constraints

∆(hi,hj) < ∆(hi,hj′), ∀i ∈ V , ∀j ∈ Nik, ∀j′ ∈ Nik′ , ∀k < k′

Going beyond mere first-order and second-order proximity this enables us to capture the networkstructure at multiple scales incorporating local and global structure.

Dissimilarity measure. To solve the above ranking task we have to define a suitable dissimilaritymeasure between the latent representation of two nodes. Since our latent representations are distri-butions, similarly to Dos Santos et al. (2016) and He et al. (2015) we employ the asymmetric KLdivergence. This gives the additional benefit of handling directed graphs in a sound way. Morespecifically, given the latent Gaussian distribution representation of two nodes hi,hj we define

∆(hi,hj) = DKL(Nj ||Ni) =1

2

[tr(Σ−1i Σj) + (µi − µj)TΣ−1i (µi − µj)− L− log

det(Σj)

det(Σi)

]Here we use the notation µi,Σi to denote the outputs of some functions µθ(xi) and Σθ(xi) appliedto the attributes xi of node i and tr(.) denotes the trace of a matrix. The asymmetric KL divergencealso applies to the case of an undirected graph by simply processing both directions of the edge. Wecould alternatively use a symmetric dissimilarity measure such as the Jensen-Shannon divergence orthe expected likelihood (probability product kernel).

3.2 DEEP ENCODER

The functions µθ(xi) and Σθ(xi) are deep feed-forward non-linear neural networks parametrized byθ. It is important to note that these parameters are shared across instances and thus enjoy statistical

1Note, in the absence of node attributes we can simply use one-hot encoding for the nodes (i.e. X = I,where I is the identity matrix) and/or any other derived features such as node degrees.

3

strength benefits. Additionally, we design µθ(xi) and Σθ(xi) such that they share parameters aswell. More specifically, a deep encoder fθ(xi) processes the node’s attributes and outputs an inter-mediate hidden representation, which is then in turn used to output µi and Σi in the final layer of thearchitecture. We focus on diagonal covariance matrices.2 The mapping from the nodes’ attributes totheir embedding via the deep encoder is precisely what enables the inductiveness of Graph2Gauss.

3.3 LEARNING VIA ENERGY-BASED LOSS

Since it is intractable to find a solution that satisfies all of the pairwise constraints defined in Sec. 3.1we turn to an energy based learning approach. The idea is to define an objective function thatpenalizes ranking errors given the energy of the pairs. More specifically, denoting the KL divergencebetween two nodes as the respective energy, Eij = DKL(Nj ||Ni), we define the following loss tobe optimized

L =∑i

∑k<l

∑jk∈Nik

∑jl∈Nil

(Eijk

2 + exp−Eijl

)=

∑(i,jk,jl)∈Dt

(Eijk

2 + exp−Eijl

)(1)

where Dt = {(i, jk, jl) | sp(i, jk) < sp(i, jl)} is the set of all valid triplets. The Eijk terms arepositive examples whose energy should be lower compared to the energy of the negative examplesEijl . Here, we employed the so called square-exponential loss (LeCun et al., 2006) which unlikeother typically used losses (e.g. hinge loss) does not have a fixed margin and pushes the energy ofthe negative terms to infinity with exponentially decreasing force. In our setting, for a given anchornode i, the energy Eij should be lowest for nodes j in his 1-hop neighborhood, followed by a higherenergy for nodes in his 2-hop neighborhood and so on.

Finally, we can optimize the parameters θ of the deep encoder such that the loss L is minimized andthe pairwise rankings are satisfied. Note again that the parameters are shared across all instances,meaning that we share statistical strength and can learn them more easily in comparison to treat-ing the distribution parameters (e.g. µi, Σi) independently as free variables. The parameters areoptimized using Adam (Kingma & Ba, 2014) with a fixed learning rate of 0.001.

Sampling strategy. For large graphs, the complete loss is intractable to compute, confirming theneed for a stochastic variant. The naive approach would be to sample triplets from Dt uniformly,i.e. replace

∑(i,jk,jl)∈Dt

with E(i,jk,jl)∼Dtin Eq. 1. However, with the naive sampling we are

less likely to sample triplets that involve low-degree nodes since high degree nodes occur in manymore pairwise constraints. This in turn means that we update the embedding of low-degree nodesless often which is not desirable. Therefore, we propose an alternative node-anchored samplingstrategy. Intuitively, for every node i, we randomly sample one other node from each of its neigh-borhoods (1-hop, 2-hop, etc.) and then optimize over all the corresponding pairwise constraints(Ei1 < Ei2, . . . , Ei1 < EiK , Ei2 < Ei3, . . . Ei2 < EiK , . . . , EiK−1 < EiK).

Naively applying the node-anchored sampling strategy and optimizing Eq. 1, however, would leadto biased estimates of the gradient. Theorem 1 shows how to adapt the loss such that it is equalin expectation to the original loss under our new sampling strategy. As a consequence, we haveunbiased estimates of the gradient using stochastic optimization of the reformulated loss.

Theorem 1 For all i, let (j1, . . . , jK) be independent uniform random samples from the sets(Ni1, . . . , NiK) and |Ni∗| the cardinality of each set. Then L is equal in expectation to

Ls =∑i

E(j1,...,jK)∼(Ni1,...,NiK)

[∑k<l

|Nik| · |Nil| ·(Eijk

2 + exp−Eijl

)]= L (2)

We provide the proof in the appendix. For cases where the number of nodes N is particularly largewe can further subsample mini-batches, by selecting anchor nodes i at random. Furthermore, in ourexperimental study, we analyze the effect of the sampling strategy on convergence, as well as thequality of the stochastic variant w.r.t. the obtained solution and the reached local optima.

2 To ensure that they are positive definite, in the final layer we output σid ∈ R and obtain σid = elu(σid)+1.

4

3.4 DISCUSSION

Inductive learning. While during learning we need both the network structure (to evaluate the rank-ing loss) and the attributes, once the learning concludes, the embedding for a node can be obtainedsolely based on its attributes. This enables our method to easily handle the issue of obtaining repre-sentations for new nodes that were not part of the network during training. To do so we simply passthe attributes of the new node through our learned deep encoder. Most approaches cannot handle thisissue at all, with a notable exception being SDNE and GraphSAGE (Wang et al., 2016; Hamiltonet al., 2017). However, both approaches require the edges of the new node to get the node’s repre-sentation, and cannot handle nodes that have no existing connections. In contrast, our method canhandle even such nodes, since after the model is learned we rely only on the attribute information.

Plain graph embedding. Even though attributed graphs are often found in the real-world, some-times it is desirable to analyze plain graphs. As already discussed, our method easily handles plaingraphs, when the attributes are not available, by using one-hot encoding of the nodes instead. Aswe later show in the experiments we are able to learn useful representations in this scenario, evenoutperforming some attributed approaches. Naturally, in this case we lose the inductive ability tohandle unseen nodes. We compare the one-hot encoding version, termed G2G oh, with our fullmethod G2G that utilizes the attributes, as well as all remaining competitors.

Encoder architecture. Depending on the type of the node attributes (e.g. images, text) we could inprinciple use CNNs/RNNs to process them. We could also easily incorporate any of the proposedgraph convolutional layers inheriting their benefits. However, we observe that in practice usingsimple feed-forward architecture with rectifier units is sufficient, while being much faster and easierto train. Better yet, we observed that Graph2Gauss is not sensitive to the choice of hyperparameterssuch as number and size of hidden layers. We provide more detailed information and sensibledefaults in the appendix.

Complexity. The time complexity for computing the original loss is O(N3) where N is the numberof nodes. Using our node-anchored sampling strategy, the complexity of the stochastic version isO(K2N) where K is the maximum distance considered. Since a small value of K ≤ 2 consistentlyshowed good performance,K2 becomes negligible and thus the complexity isO(N), meaning linearin the number of nodes. This coupled with the small number of epochs T needed for convergence(T ≤ 2000 for all shown experiments, see e.g. Fig. 3(b)) and an efficient GPU implementation alsomade our method faster than most competitors in terms of wall-clock time.

4 EMBEDDING EVALUATION

We compare Graph2Gauss with and without considering attributes (G2G, G2G oh) to several com-petitors namely: TRIDNR and TADW (Pan et al., 2016; Yang et al., 2015) as representatives thatconsider attributed graphs, GAE (Kipf & Welling, 2016b) as the unsupervised graph convolutionalrepresentative, and node2vec (Grover & Leskovec, 2016) as a representative of the random walkbased plain graph embeddings. Additionally, we include a strong Logistic Regression baseline thatconsiders only the attributes. As with all other methods we train TRIDNR in a unsupervised man-ner, however, since it can only process raw text as attributes (rather than e.g. bag-of-words) it is notalways applicable. Furthermore, since TADW, and GAE only support undirected graphs we mustsymmetrize the graph before using them – giving them a substantial advantage, especially in thelink prediction task. Moreover, in all experiments if the competing techniques use an L dimensionalembedding, G2G’s embedding is actually only half of this dimensionality so that the overall numberof ’parameters’ per node (mean vector + variance terms of the diagonal Σi) matches L.

Dataset description. We use several attributed graph datasets. Cora (McCallum et al., 2000) is awell-known citation network labeled based on the paper topic. While most approaches report ona small subset of this dataset we additionally extract from the original data the entire network andname these two datasets CORA (N = 19793, E = 65311, D = 8710,K = 70) and CORA-ML(N = 2995, E = 8416, D = 2879,K = 7) respectively. CITESEER (N = 4230, E = 5358, D =2701,K = 6) (Giles et al., 1998), DBLP (Pan et al., 2016) (N = 17716, E = 105734, D =1639,K = 4) and PUBMBED (N = 18230, E = 79612, D = 500,K = 3) (Sen et al., 2008) areother commonly used citation datasets. We provide all datasets, the source code of G2G, and furthersupplementary material (https://www.kdd.in.tum.de/g2g).

5

https://www.kdd.in.tum.de/g2g

4.1 LINK PREDICTION

Setup. Link prediction is a commonly used task to demonstrate the meaningfulness of the embed-dings. To evaluate the performance we hide a set of edges/non-edges from the original graph andtrain on the resulting graph. Similarly to Kipf & Welling (2016b) and Wang et al. (2016) we createa validation/test set that contains 5%/10% randomly selected edges respectively and equal numberof randomly selected non-edges.We used the validation set for hyper-parameter tuning and earlystopping and the test set only to report the performance. As by convention we report the area underthe ROC curve (AUC) and the average precision (AP) scores for each method. To rank the candidateedges we use the negative energy −Eij for Graph2Gauss, and the exact same approach as in therespective original methods (e.g. dot product of the embeddings).

Performance on real-world datasets. Table 1 shows the performance on the link prediction task fordifferent datasets and embedding size L = 128. As we can see our method significantly outperformsthe competitors across all datasets which is a strong sign that the learned embeddings are useful.Furthermore, even the constrained version of our method G2G oh that does not consider attributesat all outperforms the competitors on some datasets. While GAE achieves comparable performanceon some of the datasets their approach doesn’t scale to large graphs. In fact, for graphs beyond 15Knodes we had to revert to slow training on the CPU since the data did not fit on the GPU memory(12GB). The simple Logistic Regression baseline showed surprisingly strong performance, evenoutperforming some of the more complicated methods.

Table 1: Link prediction performance for real-world datasets with L = 128.Method Cora-ML Cora Citeseer DBLP Pubmed Cora-ML Easy

AUC AP AUC AP AUC AP AUC AP AUC AP AUC AP

Logistic Regression 90.01 89.75 86.58 86.51 81.70 79.10 82.04 81.91 90.50 90.99 90.28 90.99node2vec(Grover & Leskovec, 2016) 76.80 75.26 79.95 78.98 83.04 83.74 95.42 95.33 95.42 95.33 93.47 93.53TADW(Yang et al., 2015) 81.26 81.34 76.56 78.06 70.14 72.93 65.67 59.85 62.72 68.02 83.53 82.47TRIDNR(Pan et al., 2016) 84.51 85.69 81.61 81.08 87.23 88.87 92.01 91.62 NTA NTA 85.59 86.16GAE(Kipf & Welling, 2016b) 96.65 96.67 97.91 98.07 92.31 93.88 95.78 96.67 96.07 96.12 95.97 95.17

G2G oh 96.95 97.54 98.41 98.63 95.89 95.78 98.29 98.46 96.75 96.47 96.98 96.42G2G 98.01 98.03 98.81 98.78 96.09 96.16 98.65 98.78 97.42 97.85 98.03 98.12

We also include the performance on the so called ”Cora-ML Easy” dataset, obtained from the Cora-ML dataset by making it undirected and selecting the nodes in the largest connected component. Wesee that while node2vec struggles on the original real-world data, it significantly improves in this”easy” setting. On the contrary, Graph2Gauss handles both settings effortlessly. This demonstratesthat Graph2Gauss can be readily applied in realistic scenarios on potentially messy real-world data.

Sensitivity analysis. In Figs.1(a) and 1(b) we show the performance w.r.t. the dimensionality of theembedding, averaged over 10 trials. G2G is able to learn useful embeddings with strong performanceeven for relatively small embedding sizes. Even for the case L = 2, where we embed the points asone dimensional Gaussian distributions (L = 1 + 1 for the mean and the sigma of the Gaussian),G2G still outperforms all of the competitors irrespective of their much higher embedding sizes.

G2G G2G_oh TRIDNR TADW GAE node2vec Logistic Regression

2 4 8 16 32 64 128Emebdding size L

0.5

0.6

0.7

0.8

0.9

1.0

ROC

Scor

e

(a) L vs. AUC

2 4 8 16 32 64 128Emebdding size L

0.5

0.6

0.7

0.8

0.9

1.0

AP S

core

(b) L vs. AP

15% 30% 45% 60% 75% 85%Percent of training edges

0.5

0.6

0.7

0.8

0.9

1.0

ROC

Scor

e

(c) %E vs. AUC

15% 30% 45% 60% 75% 85%Percent of training edges

0.5

0.6

0.7

0.8

0.9

1.0

AP S

core

(d) %E vs. AP

Figure 1: Link prediction performance for different embedding sizes and percentages of trainingedges on Cora-ML. G2G outperforms the competitors even for small sizes and percentage of edges.

Finally, we evaluate the performance w.r.t. the percentage of training edges varying from 15% to85%, averaged over 10 trials. We can see in Figs.1(c) and 1(d) Graph2Gauss strongly outperformsthe competitors, especially for small number of training edges. The dashed line indicates the percent-

6

age above which we can guarantee to have every node appear at least once in the training set.3 Theperformance below that line is then indicative of the performance in the inductive setting. Since, thestructure only methods are unable to compute meaningful embeddings for unseen nodes we cannotreport their performance below the dashed line.

4.2 NODE CLASSIFICATION

Setup. Node classification is another task commonly used to evaluate the strength of the learnedembeddings – after they have been trained in an unsupervised manner. We evaluate the node classifi-cation performance for three datasets (Cora-ML, Citeseer and DBLP) that have ground-truth classes.First, we train the embeddings on the entire training data in an unsupervised manner (excluding theclass labels). Then, following Perozzi et al. (2014) we use varying percentage of randomly selectednodes and their learned embeddings along with their labels as training data for a logistic regres-sion, while evaluating the performance on the rest of the nodes. We also optimize the regularizationstrength for each method/dataset via cross-validation. We show results averaged over 10 trials.

G2G G2G_oh TRIDNR TADW GAE node2vec Logistic Regression

2% 4% 6% 8% 10%Percent of labeled nodes

0.2

0.4

0.6

0.8

F 1 S

core

(a) F1 score on Citeseer


0.4

0.6

0.8

F 1 S

core

(b) F1 score on Cora


0.5

0.6

0.7

0.8

F 1 S

core

(c) F1 score on DBLP

Figure 2: Classification performance comparison - both G2G and G2G oh perform strongly.

Performance on real-world datasets. Figs. 2 compares the methods w.r.t. the classification perfor-mance for different percentage of labeled nodes. We can see that our method clearly outperforms thecompetitors. Again, the constrained version of our method that does not consider attributes is ableto outperform some of the competing approaches. Additionally, we can conclude that in general ourmethod shows stable performance regardless of the percentage of labeled nodes. This is a highlydesirable property since it shows that should we need to perform classification it is sufficient to trainonly on a small percentage of labeled nodes.

4.3 SAMPLING STRATEGY

Figure 3(a) shows the validation set ROC score for the link prediction task w.r.t. the number oftriplets (i, jk, jl) seen. We can see that both sampling strategies are able to reach the same perfor-mance as the full loss in significantly fewer (< 4.2%) number of pairs seen (note the log scale). Italso shows that the naive random sampling converges slower than the node-anchored sampling strat-egy. Figures 3(b) gives us some insight as to why – our node-anchored sampling strategy achievessignificantly lower loss. Finally, Fig. 3(c) shows that our node-anchored sampling strategy has lowervariance of the gradient updates, which is another contributor to faster convergence.

Our node-anchored sampling Naive random sampling Full loss

104 105 106 107 108 109

Number of pairs seen

0.6

0.7

0.8

0.9

1.0

AUC/

AP S

core

4.12

% tr

iple

ts se

en

Full loss: AUC

(a) Convergence

0 100 200 300 400 500Epoch

0.2

0.4

0.6

0.8

1.0

Loss

(b) Loss comaprison

0 10 20 30 40Epoch

1

2

3

4

5

Grad

ient

var

ianc

e

1e 5

(c) Gradient variance

Figure 3: Our sampling strategy converges significantly faster than the full loss, while maintaininggood performance. It also achieves better loss and has lower variance compared to naive sampling.

3This percentage is derived from the size of the minimum edge-cover set. For more details see appendix.

7

4.4 EMBEDDING UNCERTAINTY

Learning an embedding that is a distribution rather than a point-vector allows us to capture uncer-tainty about the representation. We perform several experiments to evaluate the benefit of modelinguncertainty. Figure 4(a) shows that the learned uncertainty is correlated with neighborhood diver-sity, where for a node i we define diversity as the number of distinct classes among the nodes in itsp-hop neighborhood (

⋃1≤k≤pNik). Since the uncertainty for a node i is an L-dimensional vector

(diagonal covariance) we show the average across the dimensions. In line with our intuition, nodeswith less diverse neighborhood have significantly lower variance compare to more diverse nodeswhose immediate neighbors belong to many different classes, thus making their embedding moreuncertain. The figure shows the result on the Cora dataset for p = 3 hop neighborhood. Similarresults hold for the other datasets. This result is particularly impressive given the fact that we learnour embedding in a completely unsupervised manner, yet the uncertainty was able to capture thediversity w.r.t. the class labels of the neighbors of a node, which were never seen during training.

1 6 11 16 21 26Diversity of 3-hop neighborhood

0.80

0.82

0.84

0.86

0.88

aver

age

(a) Neighborhood diversity

0 500 1000 1500 2000Epoch

0

1

2

3

4

5

6Av

erag

e l

Increasing 58/64Non-Increasing 6/64Stopping criterion

(b) Latent dimensionality

0 20 40 60Number of dimensions dropped

0.86

0.88

0.90

0.92

0.94

0.96

AUC/

AP sc

ore

AUCAPAUC, L = 6AP, L = 6

(c) Dropping dimensions

Figure 4: The benefit of modeling the uncertainty of the nodes.

Figure 4(b) shows that using the learned uncertainty we are able to detect the intrinsic latent di-mensionality of the graph. Each line represents the average variance (over all nodes) for a givendimension l for each epoch. We can see that as the training progresses past the stopping criterion(link prediction performance on validation set) and we start to overfit, some dimensions exhibita relatively stable average variance, while for others the variance increases with each epoch. Bycreating a simply rule that monitors the average change of the variance over time we were able toautomatically detect these relevant latent dimensions (colored in red). This result holds for multipledatasets and is shown here for Cora-ML. Interestingly, the number of detected latent dimensions (6)is close to the number of ground-truth communities (7).

The next obvious question is then how does the performance change if we remove these highlyuncertain dimensions whose variance keeps increasing with training. Figure 4(c) answers exactlythat. By removing progressively more and more dimensions, starting with the most uncertain firstwe see imperceptibly small change in performance. Only once we start removing the true latentdimension we see a noticeable degradation in performance. The dashed lines show the performanceif we re-train the model, setting L = 6, equal to the detected number of latent dimensions.

As a last study of uncertainty, in a use case analysis, the nodes with high uncertainty reveal additionalinteresting patterns. For example in the Cora dataset, one of the highly uncertain nodes was the paper”The use of word shape information for cursive script recognition” by R.J. Whitrow – surprisingly,all citations (edges) of that paper (as extracted from the dataset) were towards other papers by thesame author.

4.5 INDUCTIVE LEARNING: GENERALIZATION TO UNSEEN NODES

As discussed in Sec. 3.4 G2G is able to learn embeddings even for nodes that were not part of thenetworks structure during training time. Thus, it not only supports transductive but also inductivelearning. To evaluate how our approach generalizes to unseen nodes we perform the followingexperiment: (i) first we completely hide 10%/25% of nodes from the network at random; (ii) weproceed to learn the node embeddings for the rest of the nodes; (iii) after learning is completewe pass the (new) unseen test nodes through our deep encoder to obtain their embedding; (iv) weevaluate by calculating the link prediction performance (AUC and AP scores) using all their edgesand same number of non-edges.

8

Table 2: Inductive link prediction performance.

Method (% hidden) Cora-ML Cora Citeseer DBLP PubmedAUC AP AUC AP AUC AP AUC AP AUC AP

Log.Reg. 10% 75.95 78.62 78.53 78.70 73.09 72.54 67.55 69.55 86.83 87.34G2G 10% 90.93 89.37 94.18 93.40 88.58 88.31 85.06 83.75 92.22 90.45G2G 25% 87.83 86.31 92.96 92.31 87.30 86.61 83.09 81.49 90.20 88.28

As the results in Table 2 clearly show, since we are utilizing the rich attribute information, we areable to achieve strong performance for unseen nodes. This is true even when a quarter of the nodesare missing. This makes our method applicable in the context of large graphs where training onthe entire network is not feasible. Note that SDNE (Wang et al., 2016) and GraphSAGE (Hamiltonet al., 2017) cannot be applied in this scenario, since they also require the edges for the unseen nodesto produce an embedding. Graph2Gauss is the only inductive method that can obtain embeddingsfor a node based only on the node attributes.

4.6 NETWORK VISUALIZATION

One key application of node embedding approaches is creating meaningful visualizations of a net-work in 2D/3D that support tasks such as data exploration and understanding. Following Tang et al.(2015b) and Pan et al. (2016) we first learn a lower-dimensional L = 128 embedding for each nodeand then map those representations in 2D with TSNE (Maaten & Hinton, 2008). Additionally, sinceour method is able to learn useful representations even in low dimensions we embed the nodes as2D Gaussians and visualize the resulting embedding. This has the added benefit of visualizing thenodes’ uncertainty as well. Fig. 5 shows the visualization for the Cora-ML dataset. We see thatGraph2Gauss learns an embedding in which the different classes are clearly separated.

(a) G2G, L = 2 + 2 = 4 (b) G2G, L = 128, projected with TSNE

Figure 5: 2D visualization of the embeddings on the Cora-ML dataset. Color indicates the classlabel not used during training. Best viewed on screen.

5 CONCLUSION

We proposed Graph2Gauss – the first unsupervised approach that represents nodes in attributedgraphs as Gaussian distributions and is therefore able to capture uncertainty. Analyzing the uncer-tainty reveals the latent dimensionality of a graph and gives insight into the neighborhood diversityof a node. Since we exploit the attribute information of the nodes we can effortlessly generalizeto unseen nodes, enabling inductive reasoning. Graph2Gauss leverages the natural ordering of thenodes w.r.t. their neighborhoods via a personalized ranking formulation. The strength of the learnedembeddings has been demonstrated on several tasks – specifically achieving high link predictionperformance even in the case of low dimensional embeddings. As future work we aim to studypersonalized rankings beyond the ones imposed by the shortest path distance.

9

ACKNOWLEDGMENTS

This research was supported by the German Research Foundation, Emmy Noether grant GU 1409/2-1, and by the Technical University of Munich - Institute for Advanced Study, funded by the GermanExcellence Initiative and the European Union Seventh Framework Programme under grant agree-ment no 291763, co-funded by the European Union.

REFERENCES

Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graphembedding: Problems, techniques and applications. arXiv preprint arXiv:1709.07604, 2017.

Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with globalstructural information. In Proceedings of the 24th ACM International on Conference on Informa-tion and Knowledge Management, pp. 891–900. ACM, 2015.

Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. Het-erogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pp. 119–128. ACM, 2015.

Michael Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networkson graphs with fast localized spectral filtering. In Advances in Neural Information ProcessingSystems, pp. 3837–3845, 2016.

Ludovic Dos Santos, Benjamin Piwowarski, and Patrick Gallinari. Multilabel classification on het-erogeneous graphs with gaussian embeddings. In Joint European Conference on Machine Learn-ing and Knowledge Discovery in Databases, pp. 606–622. Springer, 2016.

Soumyajit Ganguly and Vikram Pudi. Paper2vec: Combining graph and text information for sci-entific paper representation. In European Conference on Information Retrieval, pp. 383–395.Springer, 2017.

C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexingsystem. In Proceedings of the third ACM conference on Digital libraries, pp. 89–98. ACM, 1998.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neuralnetworks. In Proceedings of the Thirteenth International Conference on Artificial Intelligenceand Statistics, pp. 249–256, 2010.

Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: Asurvey. arXiv preprint arXiv:1705.02801, 2017.

Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedingsof the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp. 855–864. ACM, 2016.

W. L. Hamilton, R. Ying, and J. Leskovec. Representation Learning on Graphs: Methods andApplications. ArXiv e-prints, September 2017.

William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on largegraphs. arXiv preprint arXiv:1706.02216, 2017.

Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao. Learning to represent knowledge graphs withgaussian embedding. In Proceedings of the 24th ACM International on Conference on Informationand Knowledge Management, pp. 623–632. ACM, 2015.

Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structureddata. arXiv preprint arXiv:1506.05163, 2015.

Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

10

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-works. arXiv preprint arXiv:1609.02907, 2016a.

Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprintarXiv:1611.07308, 2016b.

Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-basedlearning. Predicting structured data, 1:0, 2006.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of MachineLearning Research, 9(Nov):2579–2605, 2008.

Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating theconstruction of internet portals with machine learning. Information Retrieval, 3(2):127–163,2000.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word represen-tations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael MBronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. arXivpreprint arXiv:1611.08402, 2016.

Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural net-works for graphs. In Proceedings of the 33rd annual international conference on machine learn-ing. ACM, 2016.

Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang. Tri-party deep network repre-sentation. Network, 11(9):12, 2016.

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining, pp. 701–710. ACM, 2014.

Trang Pham, Truyen Tran, Dinh Q Phung, and Svetha Venkatesh. Column networks for collectiveclassification. In AAAI, pp. 2485–2491, 2017.

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad.Collective classification in network data. AI magazine, 29(3):93, 2008.

Xiaofei Sun, Jiang Guo, Xiao Ding, and Ting Liu. A general framework for content-enhancednetwork representation learning. arXiv preprint arXiv:1610.02906, 2016.

Jian Tang, Meng Qu, and Qiaozhu Mei. Pte: Predictive text embedding through large-scale hetero-geneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pp. 1165–1174. ACM, 2015a.

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scaleinformation network embedding. In Proceedings of the 24th International Conference on WorldWide Web, pp. 1067–1077. ACM, 2015b.

Luke Vilnis and Andrew McCallum. Word representations via gaussian embedding. arXiv preprintarXiv:1412.6623, 2014.

Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings ofthe 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp. 1225–1234. ACM, 2016.

Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network representationlearning with rich text information. In IJCAI, pp. 2111–2117, 2015.

11

APPENDIX

A PROOF OF THEOREM 1

To prove Theorem 1 we start with the loss Ls (Eq. 2), and show that by applying the expectationoperator we will obtain the original loss L (Eq. 1). From there it trivially follows that taking thegradient with respect to Ls for a set of samples gives us an unbiased estimate of the gradient of L.

First we notice that both L and Ls are summing over i, thus it is sufficient to show that the lossesare equal in expectation for a single node i. Denoting with L(i)

s the loss for a single node i and withEi,k,l = Eijk

2 + exp−Eijl for notational convenience we have:

L(i)s =E(j1,...,jK)∼(Ni1,...,NiK)

∑k<l

|Nik| · |Nil| · Ei,k,l

(1)=E(j1,...,jK)∼(Ni1,...,NiK)|Ni1| · |Ni2| · Ei,1,2

+ · · ·+ E(j1,...,jK)∼(Ni1,...,NiK)|NiK−1| · |NiK | · Ei,K−1,K(2)=E(j1,j2)∼(Ni1,Ni2)|Ni1| · |Ni2| · Ei,1,2

+ · · ·+ E(jK−1,jK)∼(NiK−1,NiK)|NiK−1| · |NiK | · Ei,K−1,K(3)=

∑j1∈Ni1

∑j2∈Ni2

p(j1)p(j2)|Ni1| · |Ni2| · Ei,1,2

+ · · ·+∑

jK−1∈NiK−1

∑jK∈NiK

p(jK−1)p(jK)|NiK−1| · |NiK | · Ei,K−1,K

(4)=

1

|Ni1|1

|Ni2||Ni1||Ni2|

∑j1∈Ni1

∑j2∈Ni2

·Ei,1,2

+ · · ·+ 1

|NiK−1|1

|NiK ||NiK−1||NiK |

∑jK−1∈NiK−1

∑jK∈NiK

·Ei,K−1,K

=∑

j1∈Ni1

∑j2∈Ni2

·Ei,1,2 + · · ·+∑

jK−1∈NiK−1

∑jK∈NiK

·Ei,K−1,K

=∑k<k′

∑j∈Nik

∑j′∈Nik′

(Eij

2 + β · exp−Eij′

)

In step (1) we have expanded the sum over k < l in independent terms. In step (2) we havemarginalized the expectation over the variables that do not appear in the expression, e.g. for theterm E(j1,...,jK)∼(Ni1,...,NiK)|Ni1| · |Ni2| ·Ei12 we can marginalize over jp where p 6= 1 and p 6= 2since the term doesn’t depend on them. In step (3) we have expanded the expectation term. Instep (4) we have substituted p(jp) with 1

|Nijp |since we are sampling uniformly at random.

Since L(i)s is equal to L(i) in expectation it follows that∇Ls based on a set of samples is an unbiased

estimate of∇L.

B IMPLEMENTATION DETAILS

Architecture and hyperparameters. We observed that Graph2Gauss is not sensitive to the choiceof hyperparameters such as number and size of hidden layers. Better yet, as shown in Sec. 4.4,Graphs2Gauss is also not sensitive to the size of the embedding L. Thus, for a new graph, one cansimply pick a relatively large embedding size and if required prune it later similarly to the analysisperformed in Fig. 4(c).

12

As a sensible default we recommend an encoder with a single hidden layer of size s1 = 512. Morespecifically, to obtain the embeddings for a node i we have

hi = relu(XiW + b) µi = hiWµ + bµ σi = elu(hiWΣ + bΣ) + 1

where xi are node attributes, relu and elu are the rectified linear unit and exponential linear unitrespectively. In practice, we found that the softplus works equally well as the elu for making surethat σi are positive and in turn Σi is positive definite. We used Xavier initialization (Glorot &Bengio, 2010) for the weight matrices W ∈ RD×s1 , b ∈ Rs1 , Wµ ∈ Rs1×L/2, bµ ∈ RL/2, WΣ ∈Rs1×L/2, bΣ ∈ RL/2. As discussed in Sec. 3.4, multiple hidden layers, or other architectures suchas CNNs/RNNs can also be used based on the specific problem.

Unlike other approaches using Gaussian embeddings (Vilnis & McCallum, 2014; He et al., 2015;Dos Santos et al., 2016) we do not explicitly regularize the norm of the means and we do not clipthe covariance matrices. Given the self-regularizing nature of the KL divergence this is unnecessary,as was confirmed in our experiments. The parameters are optimized using Adam (Kingma & Ba,2014) with a fixed learning rate of 0.001 and no learning rate annealing/decay.

Edge cover. Some of the methods such as node2vec (Grover & Leskovec, 2016) are not able toproduce an embedding for nodes that have not been seen during training. Therefore, it is importantto make sure that during the train-validation-test split of the edge set, every node appears at least oncein the train set. Random sampling of the edges does not guarantee this, especially when allocatinga low percentage of edges in the train set during the split. To guarantee that every node appears atleast once in the train set we have to find an edge cover. An edge cover of a graph is a set of edgessuch that every node of the graph is incident to at least one edge of the set. The minimum edge coverproblem is the problem of finding an edge cover of minimum size. The dashed line in Figures 1(c)and 1(d) indicates exactly the size of the minimum edge cover. This condition had to be satisfiedfor the competing methods, however, since Graph2Gauss is inductive, it does not require that everynode is in the train set.

13

deep gaussian embedding of graphs unsupervised …

Documents