center for complex networks and systems research, · defining and identifying the optimal...

15
Defining and identifying the optimal embedding dimension of networks Weiwei Gu, 1 Aditya Tandon, 2 Yong-Yeol Ahn, 2 and Filippo Radicchi 2 1 School of Systems Science, Beijing Normal University, Beijing, 100875, P. R. China 2 Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, Indiana 47408, USA Network embedding is a general-purpose machine learning technique that encodes network structure in vector spaces with tunable dimension. Choosing an appropriate embedding dimension — small enough to be efficient and large enough to be effective — is challenging but necessary to generate embeddings applicable to a multitude of tasks. Unlike most existing strategies that rely on performance maximization in downstream tasks, here we propose a principled method for the identification of an optimal dimension such that all structural information of a network is parsimoniously encoded. The method is validated on various embedding algorithms and a large corpus of real-world networks. Estimated values of the optimal dimension in real-world networks suggest that efficient encoding in low-dimensional spaces is usually possible. INTRODUCTION Neural embedding methods are machine learning techniques that learn geometric representations of entities. For instance, word embedding methods leverage the relationships between words captured in large corpora to map each word to a vector [1– 3]; graph embedding methods map each node to a vector by capturing structural information in the graph [4–7]. Embedding approaches have not only been pushing the performance envelope in many tasks, such as the word analogy and network link prediction, but also provide a novel way to capture semantic and structural relationships geometrically [3, 8–11]. In neural embedding of networks, the embedding space and dimension do not have a special meaning as in traditional embed- ding methods such as Laplacian Eigenmaps [12], or hyperbolic space embedding [13–16]. Instead, the dimension is considered as a hyperparameter of the model and either optimized through a model selection process or simply chosen based on the common practice (e.g., 100, 200, or 300 dimensions). In word embedding, because we expect that the semantic space of human language would not drastically vary across corpora or languages, using common default parameters is reasonable, and the behavior of embedding models with the dimension parameter is rather well studied. For instance, it is common to use 100 to 300 dimensions, which are known to provide excellent performance in various tasks without a lot of over-parametrization risk [17, 18]. By contrast, the space of graphs that we have is vast and we expect that there is no strong universal structure that may lead to similar optimal hyperparameters. Moreover, it is unclear how the structural properties of networks, such as community structure, would affect the right dimension of the model. For instance, imagine a road network, which is naturally embedded in a two-dimensional space. Because its geometrical nature, the optimal embedding dimension should be close to 2. Now, imagine another network that consists of two densely connected clusters of nodes, but only a few connections are present between clusters. If there is little structural difference between the nodes within the same cluster, then even one dimension would suffice to represent the nodes effectively. Although the embedding dimension is one of the most important hyperparameter of the embedding models, particularly in graph embedding literature, it is difficult to find principled approaches. Most existing methods use performance in downstream tasks, such as node classification and link prediction, to perform the model selection rather than establishing direct connections between the embedding dimension and the structural properties of the network. However, the performance for different tasks may be optimized with different number of dimensions. Furthermore, it is natural to expect that the total information content of a network dataset is task independent. Here, we propose a principled technique to estimate the optimal dimension of a network embedding and examine the rela- tionships between structural characteristics and the optimal dimension. We follow a route similar to the recent study on word embedding dimension by Yin et al. [17, 18]. Yin et al. proposed a metric, Pairwise Inner Product (PIP) loss function, that compares embeddings across different dimensions by measuring pairwise distances between entities within an embedding. By using this metric, they argued that the optimal embedding dimension can be identified as the one corresponding to the optimal balance between bias and variance. Here, we extend the approach to the embedding of networks by proposing an alternative metric to quantify the amount of information shared by embeddings of different dimensions. We show that the content of struc- tural information encoded in the embedding saturates in a power-law fashion as the dimension of the embedding space increases, and identify the optimal value of the embedding dimension as the point of saturation of our metric. We evaluate our method by employing multiple embedding techniques, a host of real-world networks, and down-stream prediction tasks. arXiv:2004.09928v1 [physics.soc-ph] 21 Apr 2020

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

Defining and identifying the optimal embedding dimension of networks

Weiwei Gu,1 Aditya Tandon,2 Yong-Yeol Ahn,2 and Filippo Radicchi2

1School of Systems Science, Beijing Normal University, Beijing, 100875, P. R. China2Center for Complex Networks and Systems Research,

Luddy School of Informatics, Computing, and Engineering,Indiana University, Bloomington, Indiana 47408, USA

Network embedding is a general-purpose machine learning technique that encodes network structure in vectorspaces with tunable dimension. Choosing an appropriate embedding dimension — small enough to be efficientand large enough to be effective — is challenging but necessary to generate embeddings applicable to a multitudeof tasks. Unlike most existing strategies that rely on performance maximization in downstream tasks, here wepropose a principled method for the identification of an optimal dimension such that all structural informationof a network is parsimoniously encoded. The method is validated on various embedding algorithms and a largecorpus of real-world networks. Estimated values of the optimal dimension in real-world networks suggest thatefficient encoding in low-dimensional spaces is usually possible.

INTRODUCTION

Neural embedding methods are machine learning techniques that learn geometric representations of entities. For instance,word embedding methods leverage the relationships between words captured in large corpora to map each word to a vector [1–3]; graph embedding methods map each node to a vector by capturing structural information in the graph [4–7]. Embeddingapproaches have not only been pushing the performance envelope in many tasks, such as the word analogy and network linkprediction, but also provide a novel way to capture semantic and structural relationships geometrically [3, 8–11].

In neural embedding of networks, the embedding space and dimension do not have a special meaning as in traditional embed-ding methods such as Laplacian Eigenmaps [12], or hyperbolic space embedding [13–16]. Instead, the dimension is consideredas a hyperparameter of the model and either optimized through a model selection process or simply chosen based on the commonpractice (e.g., 100, 200, or 300 dimensions).

In word embedding, because we expect that the semantic space of human language would not drastically vary across corporaor languages, using common default parameters is reasonable, and the behavior of embedding models with the dimensionparameter is rather well studied. For instance, it is common to use 100 to 300 dimensions, which are known to provide excellentperformance in various tasks without a lot of over-parametrization risk [17, 18]. By contrast, the space of graphs that we have isvast and we expect that there is no strong universal structure that may lead to similar optimal hyperparameters. Moreover, it isunclear how the structural properties of networks, such as community structure, would affect the right dimension of the model.For instance, imagine a road network, which is naturally embedded in a two-dimensional space. Because its geometrical nature,the optimal embedding dimension should be close to 2. Now, imagine another network that consists of two densely connectedclusters of nodes, but only a few connections are present between clusters. If there is little structural difference between thenodes within the same cluster, then even one dimension would suffice to represent the nodes effectively. Although the embeddingdimension is one of the most important hyperparameter of the embedding models, particularly in graph embedding literature, it isdifficult to find principled approaches. Most existing methods use performance in downstream tasks, such as node classificationand link prediction, to perform the model selection rather than establishing direct connections between the embedding dimensionand the structural properties of the network. However, the performance for different tasks may be optimized with differentnumber of dimensions. Furthermore, it is natural to expect that the total information content of a network dataset is taskindependent.

Here, we propose a principled technique to estimate the optimal dimension of a network embedding and examine the rela-tionships between structural characteristics and the optimal dimension. We follow a route similar to the recent study on wordembedding dimension by Yin et al. [17, 18]. Yin et al. proposed a metric, Pairwise Inner Product (PIP) loss function, thatcompares embeddings across different dimensions by measuring pairwise distances between entities within an embedding. Byusing this metric, they argued that the optimal embedding dimension can be identified as the one corresponding to the optimalbalance between bias and variance. Here, we extend the approach to the embedding of networks by proposing an alternativemetric to quantify the amount of information shared by embeddings of different dimensions. We show that the content of struc-tural information encoded in the embedding saturates in a power-law fashion as the dimension of the embedding space increases,and identify the optimal value of the embedding dimension as the point of saturation of our metric. We evaluate our method byemploying multiple embedding techniques, a host of real-world networks, and down-stream prediction tasks.

arX

iv:2

004.

0992

8v1

[ph

ysic

s.so

c-ph

] 2

1 A

pr 2

020

Page 2: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

2

METHODS

Networks

In this paper, we focus our attention on unweighted and undirected networks. The topology of a given network with N nodesis described by its adjacency matrix A, an N ×N matrix where Aij = Aji = 1 if a connection from node i to node j exists, andAij = Aji = 0, otherwise.

Empirical networks

We consider a corpus of 83 network datasets. Sizes of these networks range fromN = 34 toN = 51, 083 nodes. We considernetworks from different domains, including social, technological, information, biological, and transportation networks (see SI).We ignore directions and weights of the edges, thus making every network undirected and unweighted. For illustrative purposes,we explicitly consider two real-world networks: the American college football network [19] and the Cora citation network [20].The American college football network is a network composed of N = 115 nodes and M = 613 edges. Each node is a collegefootball team. Two teams are connected by an edge if they played one against the other during the season of year 2000. The Coracitation network is composed of N = 2, 708 nodes and M = 5, 429 edges. Each node is a scientific paper and edges representcitations among papers.

Network models

In addition to the empirical networks, we consider three network generative models: Erdos-Rényi (ER) model [21], Barabási-Albert (BA) model [22], and the stochastic block (SB) models [23].

First, the ER model generate random networks with a Poisson degree distribution, given the total number of nodes N and theconnection probability p, where the average degree is given by 〈k〉 = Np.

Second, the BA model is a growing network model that generates graphs obeying power-law degree distributions P (k) ∼ k−γ ,with degree exponent γ = 3. To generate a single instance of the model, we specify the total number of nodes N of the network,and the number of edges m that each node introduces to the network during the growth process (the total number of edges isM = Nm).

Finally, the SB model generates random networks with pre-imposed block structure, which solely determines the connectionprobabilities between nodes. Here, we implement the simplest version of the model, where N nodes are divided into C groupsof identical size N/C. Pairs of nodes within the same group are connected with probability pin; pairs of nodes belongingto different groups are connected with probability pout. The total number of edges in the network is approximated by M =N/Cpin + (C − 1)N/Cpout.

Network embedding algorithms

The embedding in a d-dimensional space of a network with N nodes is fully specified by the embedding matrix V ∈ RN×d.The i-th row of the matrix V is the vector Vi,· containing the coordinate values of node i in the embedding. The entire proceduredescribed in this paper can be applied to any embedding technique. Here, as a proof of concept and to ensure the robustness ofthe results, we consider two different embedding algorithms: node2vec [5] and LINE [24]. For any value of the dimension d,we center the embedding matrix V so that

∑Ni=1

∑ds=1 Vi,s = 0.

node2vec

node2vec [5] is a popular network embedding algorithm that builds on the word2vec algorithm [3] by taking the followinganalogy: nodes in the network are considered as “words”; a sequence of nodes explored during a biased random walks isconsidered as a “sentence.” In our analysis, we fix the number of walks per node to 20, the walk length to 10, the number ofiterations to 10, and the parameters that bias the random walk towards a breadth-first or depth-first walk both equal to 1. Thelatter condition makes the embedding algorithm equivalent to DeepWalk [4].

Page 3: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

3

LINE

LINE [24] is another popular embedding algorithm that aims at identifying embeddings that preserve first- and second-orderneighborhood information. We use the default values for the algorithm parameters: the order of the proximity was set to 2, thenumber of negative samples to 5, and the initial learning rate to 0.025.

Task-based evaluation of network embedding methods

We consider two quantitative evaluation tasks: link prediction and graph clustering. Both tasks are considered as standardtests for the evaluation of the performance of graph embedding methods [7, 25].

Link prediction

For the link prediction task, we use the repeated random sub-sampling validation by applying the following procedure 10times. We report results corresponding to the mean value of the accuracy metric over these repetitions. We first hide 10% ofrandomly chosen edges of the original graph. The hidden edges in the original graph are regarded as the “positive” sample set.We sample an equal amount of disconnected vertex pairs as the “negative” sample set. The union of the “positive” and “negative”sample sets form our test set. The training set consists of the remaining 90% of connected node pairs and an equal number ofrandomly chosen disconnected node pairs. We use the training set to learn the embedding and determine the embedding of thenodes. Further, we use the training set to learn the probability of connection between pairs of nodes given their embedding.Specifically, for every pair of nodes i and j in the training set, we evaluate the Hadamard product eij = Vi,·� Vj,·. Note that eijis a d-dimensional vector. We assume that the probability of connection between nodes i and j is given by

pij(eij ; θ) =1

1 + exp (eTijθ)(1)

where θ is a d-dimensional parameter vector, and eTijθ is the dot product between the vectors eij and θ. The best estimate ofthe entries of vector θ are obtained from the training set via logistic regression. We use Eq.(1) over the test set to estimate theaccuracy in link prediction. Specifically, we consider that the edge (i, j) is present if pij(eij ; θ) ≥ 0.5, and missing otherwise.We quantify the accuracy of link prediction of the test set using F1 score.

Graph clustering

We take advantage of the SB model to generate graph instances composed of N = 256 nodes and C = 16 groups withpin = 0.2 and pout = 0.02. We then perform the embedding of the graph using information from its adjacency matrix only. Toretrieve clusters of nodes emerging from the embedding, we use the k-means clustering algorithm [26]. In the application of theclustering algorithm, we provide additional information by specifying that we are looking for C = 16 clusters. The retrievedclusters of nodes are compared against the true clusters imposed in the construction of the SB model. Specifically, we rely on thenormalized mutual information (NMI) to measure the overlap between the ground-truth clusters and the detected clusters [27].Values of NMI close to one indicate an almost perfect match between the two sets; NMI equals zero if detected and ground-truthclusters have no relation. In our tests, we report results corresponding to the mean value of NMI over 10 realizations of the SBmodel.

Evaluation metrics of network embeddings

Here, we define two metrics for assessing the quality of network embeddings.

Normalized embedding loss function

We define a loss function similarly to the Pairwise Inner Product (PIP) loss function used for word embeddings [17]. Themetric can be used to compare any pair of embeddings, regardless of the way they are obtained, as long as they refer to the samenetwork with the same set of nodes. The function takes two inputs V (a) and V (b), respectively representing the matrices of the

Page 4: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

4

embeddings a and b that we want to compare. We preprocess each of these matrices by calculating the cosine similarity betweenall node pairs in the embeddings

C(V (a)) = V (a) [V (a)]T , (2)

from the embedding matrix V (a). The same expression is used to compute C(V (b)) for the embedding matrix V (b). C(V (a)) isa N × N matrix that captures the similarity between the embedding of a pair of nodes; [C(V (a))]i,j corresponds to the cosinesimilarity between nodes i and j in the embedding V (a).

The normalized embedding loss function L(V (a), V (b)) is defined as the average, over all possible node pairs, of the absolutedifference values between the cosine similarity of the two embeddings, i.e.,

L(V (a), V (b)) =2

N(N − 1)

∑i<j

∣∣∣ [C(V (a))]i,j − [C(V (b))]i,j

∣∣∣ . (3)

L(V (a), V (b)) = 0 if the two embeddings a and b are equivalent. We expect instead L(V (a), V (b)) ' 2 if the two embeddingsrepresent two radically different representations of the network.

Embedding variance

Another metric that we use is “embedding variance”, which estimates the level of coherence among a set of multiple embed-dings of the same network. The metric takes a set {V (1), V (2), . . . , V (K)} of K embedding matrices of the same network as aninput and calculates the average variance of the node pair similarities across embeddings. We obtain the cosine similarity matrixC(V (k)) for each of the k = 1, . . . ,K embeddings using Equation (2). We compute the embedding variance as

S2(V (1), V (2), . . . , V (K)) =1

K

K∑k=1

∑i<j

[C(V (k))i,j − 〈Ci,j〉]2 , (4)

where

〈Ci,j〉 =1

K

K∑k=1

C(V (k))i,j

is the average value of the cosine similarity of the nodes i and j over the entire set of embeddings. Equation (4) equals thevariance of the cosine similarity over all pairs of nodes in all embeddings. S2(V (1), V (2), . . . , V (K)) = 0, if the embeddingsare such that the cosine similarity of all node pairs is always the same across the entire set of embeddings. High values ofS2(V (1), V (2), . . . , V (K)) indicate low coherence among the embeddings in the set.

RESULTS

Embedding dimension and performance in detection tasks

To compare embeddings of the same network obtained for different values of the embedding dimension d, we use the normal-ized embedding loss. We first set a reference dimension dr, which we consider sufficiently large to capture information stored inthe original network. We then compute the embedding matrix V (d) for d < dr and use Equation (3) to evaluate the normalizedembedding loss function at dimension d as L(d) := L(V (d), V (dr)) 1. Specifically, for graphs with less than N = 500 nodes,we set the reference value dr = N and for graphs with more than N = 500 nodes, we set dr = 500. The latter choice is madefor convenience to avoid computational issues. Also, we stress that the choice of the reference value dr is not so crucial as longas dr is large enough.

An example of this analysis is reported in Fig. 1a for the network of the American college football. Here, we use node2vec toobtain the embeddings. The reference value for this graph is dr = 115, where N = 115 is the number of nodes. As d increases,L(d) smoothly decreases towards an asymptotic value close to zero. This fact indicates that, as we increase the dimension d

1 We stress that some embedding algorithms contain stochastic components (e.g., due to random sampling procedures), so that the embedding V (d) obtainedfor a given dimension d may be not the same over different runs of the algorithm on the same network. We neglect this fact in the analysis performed in thissection. We will consider the role of fluctuations in the determination of the optimal embedding dimension below.

Page 5: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

5

Figure 1. Geometric embedding of real-world networks. a Normalized embedding loss as a function of the embedding dimension for theAmerican college football network. Blue circles refer to numerical results obtained from node2vec; red stars refer to embeddings obtainedwith the LINE algorithm; the blue curve represents the best fit of Eq. (6) with the data points obtained with node2vec embeddings. We finds = 0.484, α = 1.254 and L∞ = 0.027. The good quality of the fit is testified by the fact that the mean-squared error R2 = 2.1 × 10−3.c Accuracy of link prediction with the node2vec embedding as a function of the embedding dimension for the American college footballnetwork. Symbols represent average values of the accuracy metric over 10 random sub-sampling validation tests, while the shaded regionidentifies values within one standard deviation away from the mean. b Same as in panel a, but for the Cora graph. The best fit of the data pointswith Eq. (6) is obtained for s = 0.494, α = 0.540 and L∞ = 0.000(mean-squared error R2 = 4 × 10−4). d Same as in panel c, but for theCora graph.

of the embedding, the resulting geometric representations become similar to the one obtained for the reference value dr. Notethat L(d) is already very close to the asymptotic value for d ' 20, suggesting that the representation obtained by node2vec atd = 20 is already able to capture the structural features (pairwise distance relationships) of the network from the perspective ofnode2vec. Of course, this observation does not mean that 20 dimensions are sufficient to fully describe the observed network.It simply tells us that increasing the dimension d of the embedding space is superfluous, in the sense that it does not augmentwhat node2vec is able to learn about the network.

This statement is further supported by our experiments on the performance in link prediction obtained for different valuesof the embedding dimension d, as illustrated in Fig. 1c. The accuracy in predicting missing links based on d-dimensionalembeddings behaves almost exactly as L(d). Increasing d is useful up to a certain point. After that, there is no additional gainin prediction accuracy. The saturation of the link prediction accuracy arises earlier than the one observed for L(d). This factmay be expected as there should be potentially more information in a geometric embedding than the one actually used in linkprediction.

The analysis of the Cora network allows us to draw similar conclusions (Fig. 1b). Still, we see that L(d) quickly decreasesto an asymptotic value as d increases. Further, it seems that d ' 40 dimensions are more than enough to provide a sufficientlyaccurate geometric representation of the network that contains almost all structural features that node2vec is able to extractfrom it. Also, the behavior of L(d) predicts pretty well how the accuracy in link prediction grows as a function of the embeddingdimension d (Figure 1).

If we repeat the same analysis using the LINE algorithm [24], although the rate of decay of the function L(d) is not identicalfor the two embedding algorithms, we find analogous behavior with L(d) smoothly decaying towards an asymptotic value asd increases (Figure 1a, 1b). However, a faster decay doesn’t necessarily imply that the corresponding algorithm is able torepresent more efficiently the network than the other algorithm. The reference embeddings used in the definition of L(d) are infact algorithm dependent, and the quantitative comparison between the L(d) functions of two different algorithms may be notinformative.

Similar observations can be made when we perform the geometric embedding of synthetic networks. In Figure 2, we displayL(d) for networks generated according to the SM model described earlier. We still observe a saturation of the function L(d).Graph clustering reaches maximal performance at slightly smaller d values than those necessary for the saturation of L(d).

We tested the robustness of our findings for different choices of the value of the reference dimension dr (see SI). We find that

Page 6: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

6

Figure 2. Geometric embedding of synthetic networks. a Normalized loss as a function of the embedding dimension in SB networks.Embedding is performed using node2vec. The blue curve in panel a is the best fit curve of Eq. (6) (s = 1.089,α = 0.968, L∞ = 0.261 andR2 = 5.3× 10−3) with the data points. b Test of performance in recovering the ground-truth community structure in SB graphs as a functionof the dimension of the embedding. We measure the recovery accuracy with normalized mutual information (NMI). Symbols refer to averagevalues of NMI computed over 10 instances of the SB model; the shaded region identifies the range of values corresponding to one standarddeviation away from the mean.

as long as dr is large enough, then the function L(d) is basically unaffected by the specific choice made. Further, we comparedour loss function with the PIP loss function introduced in Ref. [17]. The functions behave in a qualitatively similar manner,displaying a clear decay towards an asymptotic value as the embedding dimension increases. The normalized loss functionintroduced here has the advantage of looking smoother than the PIP loss function, thus allowing for a mathematical descriptionthat requires less parameters.

Estimating the optimal embedding dimension

The observed behaviour of the loss function L(d) indicates that embedding algorithms may generate sufficiently accurate geo-metric descriptions with a relatively small value of d. Assuming that the plateau value L∞ of the loss function L(d) correspondsto the best geometric description that the embedding algorithm can achieve, we define the optimal dimension do(ε) at accuracylevel ε as

do(ε) = arg mind

(L(d)− L∞ < ε) , (5)

i.e., the minimal d value such that the difference between L(d) and the optimum L∞ is at most equal to ε. There could beseveral ways of finding the solution of Eq. (5). Here, given the smoothness of the empirically observed loss functions, we optedfor a parametric solution. Specifically, we tested the accuracy of different mathematical functions in modeling the empiricallyobserved decay of L(d) towards its asymptotic value L∞. We found that

L(d) = L∞ +s

dα, (6)

where s is a fitting parameter, represents well the data in various combinations of embedding methods and embedded networks.Best fits, obtained by minimizing the mean-squared error, of the function of Eq.(6) are shown in of Figure 1a, 1b, and 2a. Weperformed a systematic analysis on a corpus of 83 real-world networks.

We found that best estimates L∞ of the asymptotic value L∞ for the normalized embedding loss function are generally closeto zero (see Figure 3). Best estimates s of the factor s, regulating the rate of the the power-law decay towards L∞, seemvery similar to each other, irrespective of the network that is actually embedded. Measured values of s indicate only a milddependence on the underlying network (see Figure 3). Best estimates α of the decay exponent α are generally greater than thoseexpected for uncorrelated clouds of data points2, indicating that the embedding algorithm correctly retains network structuralinformation even in high-dimensional space. In systematic analyses performed on random networks constructed using either the

2 For uncorrelated clouds of points in d dimension, the central limit theorem allows us to predict that L(d) ∼ 1/d1/2.

Page 7: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

7

Figure 3. Optimal embedding dimension of real-world networks. a Distribution of s over the corpus of 83 real-world networks consideredin this paper. b Same as in panel a, but for α. c Complementary cumulative distribution of do(ε), with ε = 0.05. d Complementary cumulativedistribution of the best estimate of the optimal dimension do(ε) rescaled by the network size N .

Figure 4. Dependence of the optimal dimension from network size and density. a s as a function of the network size N in the BA model.We control the network size and change network density of BA model by choosing different number of edges to attach from a new node to theexisting nodes. c α as a function of the network size for the same networks as in panel a. b and d Same as in panels a and c, respectively, butfor the ER model. Here, we control the density of the graphs by tuning the link probability p between different nodes.

ER and the BA models, we find that the size of the network is not an important factor in determining the plateau value L∞ andthe decay rate s of Eq. (6) (see Figure 4). The rate s is positively correlated with density of the embedded network.

Assuming the validity of Eq. (6), the solution of Eq.(5) for the optimal dimension do(ε) can be written as

do(ε) =(sε

)1/α. (7)

Page 8: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

8

0 20 40 60 80 100

10

20

30

40

50

60

70

varia

nce

0 20 40 60 80 1000

20000

40000

60000

80000

100000

120000

16 128 1024 8192

8

9

10

16 128 1024 8192

5000

10000

15000

dimension

Figure 5. Embedding variance in real networks. a Variance as a function of the embedding dimension d. Results are valid for node2vecembeddings of the American college football network. b Same as in panel a, but for the Cora citation network.

The best estimate do(ε) is calculated using the knowledge of the best estimates s and α as do(ε) = (s/ε)1/α. Values of the

optimal dimension measured in real-world networks for ε = 0.05 are reported in the SI. In Figure 3, we display the cumulativedistribution of do(ε = 0.05) over the entire corpus. For all networks in our dataset, we find that do(ε = 0.05) < 40. The sizeN of the network is an upper bound for the optimal dimension. However for sufficiently large networks, estimated values of theoptimal dimension do not display any clear dependence on N . Specifically, for roughly 50% of the networks in our corpus wefind the ratio do(ε = 0.05)/N < 0.01, and for 80% of the real networks the ratio is do(ε = 0.05)/N < 0.05 (see Figure 3).

Optimal dimension and variance of network embeddings

Our definition of optimal embedding dimension of Eq. (5) corresponds to the minimal embedding dimension necessary tolearn the structure of a network with a sufficient level of accuracy. However, we are not in the position to tell if the desiredlevel of accuracy is reached because the embedding algorithm is actually encoding the network structure in an optimal way, orinstead the algorithm is over fitting the network. In this section, we perform a simple analysis in the attempt of providing someclarifications regarding this aspect of the problem.

We rely on the embedding coherence function of Equation (4) to quantify the accuracy of a given algorithm to embed a networkin d dimensions. Specifically, we apply node2vecK = 10 times to the same network for every value of the embedding dimen-sion d to obtain a set of embeddings V (1)

d , . . . , V(K)d . Here, the diversity of the embeddings is due to the stochastic nature of the

algorithm that relies on finite-size samples of random walks to embed the network. We then quantify the embedding coherenceS2(d) of the algorithm at dimension d using the embedding coherence function of Equation( 4) as S2(d) := S2(V

(1)d , . . . , V

(K)d ).

Figure 5 shows how S2(d) behaves as a function of d for the same of the networks as we considered earlier. S2(d) is a non-monotonic function of d, showing a minimum value at a certain dimension d. We note that S2(d) decrease quickly towards itsminimum, but it slowly increases afterwards. This finding concurs with previous observations that using higher-than-necessarydimensions does not critically affect the usefulness of the embeddings and may explain why the performance of embeddingalgorithms in tasks such as link prediction (Fig. 1) and graph clustering (Fig. 2) doesn’t deteriorate as d grows.

DISCUSSION

In this paper, we proposed a principled solution to the problem of defining and identifying the optimal embedding dimensionof graphs. Our method is an extension to network data of the technique recently introduced by Yin et al. [17] in word embedding.Our method defines the optimal embedding dimension as the smallest dimension value able to provide a sufficiently accurategeometric representation of the network structure. We validated the method by applying it to the cases where we could estimatethe performance of embeddings in downstream tasks. We found that the optimal embedding dimension that our method is ableto detect roughly corresponds to the optimal dimension that can be deduced from the performance of standard tasks such aslink prediction and graph clustering. We then applied the method systematically to a large collection of real-world networks,finding that the estimated optimal dimension is much smaller than the number of nodes in the network. This finding suggeststhat the actual number of dimensions that are needed to describe the structure of a network is typically low, thus justifying theuse of low-dimensional embedding methods, such as community detection algorithms, for an effective description of real-worldnetworks.

Page 9: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

9

ACKNOWLEDGEMENTS

Funding: WW acknowledges financial support from the National Science Foundation of China (61673070) and the program ofChina Scholarships Council (No.201806040107). YYA acknowledges support from the Air Force Office of Scientific Research(FA9550-19-1-0391). FR acknowledges support from the National Science Foundation (CMMI-1552487) and the US ArmyResearch Office (W911NF-16-1- 0104).Data and materials availability: The Python script used for this project is available online at https://github.com/Villafly/defining_identifying_optimal_dimension.

Page 10: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

10

SUPPLEMENTARY MATERIAL

Figure S1. Embedding of real-world networks. The description of the various panels are identical to those of Figure(1) of the main text withthe difference that here we are analyzing different real-world networks. Specifically, panels a and c regard the scientific collaboration networkca-GrQc, whereas panels b and d are for the citation network Citeseer. The blue lines appearing in panels a and b are the best fits of Eq.(7) ofthe main paper with data points. Numerical estimates of the fitting parameters can be found in Table S1.

Figure S2. Comparison between normalized loss function and PIP metric. a Normalized loss as function of the embedding dimension forthe American college football network. c PIP loss as function of the embedding dimension for the American college football network. b andd Same as in panels a and c, respectively, but for the Cora citation network. Results are obtained using node2vec embeddings. Panels a andb show the same data points as of Figure(1) of the main text.

Page 11: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

11

Figure S3. Robustness of the behavior of the normalized loss function against specific choices for the value of the reference dimension.We present results obtained for different values of reference dimension dr . In the left panels, we show results obtained for the Americancollege football. The right panels show results for the Cora citation network.

Page 12: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

12

network name N M do(ε = 0.05) s α R2 Refs. UrlZachary karate club 34 78 8 1.469± 0.125 1.584± 0.117 1.6 · 10−4 [28] urlWindsurfers 43 336 26 0.859± 0.212 0.270± 0.155 9.4 · 10−4 [29] urlContiguous USA 49 107 6 2.394± 0.299 2.204± 0.172 2.0 · 10−4 [30] urlDolphins 62 159 8 0.767± 0.095 1.340± 0.150 2.4 · 10−4 [31] urlMacaques 62 1,167 30 1.698± 0.044 0.830± 0.032 2.1 · 10−4 [32] urlTrain bombing 64 243 10 0.677± 0.045 1.128± 0.079 9.7 · 10−5 [33] urlHighschool 70 274 8 2.139± 0.154 1.988± 0.095 1.1 · 10−4 [34] urlLes Miserables 77 254 10 2.572± 0.188 1.850± 0.094 2.5 · 10−4 [35] urlDavid Copperfield 112 425 14 1.958± 0.085 1.400± 0.050 2.0 · 10−4 [36] urlHypertext 2009 113 2,196 26 1.656± 0.030 1.004± 0.019 6.8 · 10−5 [37] urlthe American college football 115 613 12 1.834± 0.135 1.503± 0.087 3.7 · 10−4 [19] urlFlorida ecosystem wet 128 2,075 24 0.660± 0.061 0.667± 0.102 7.6 · 10−4 [38] urlFlorida ecosystem dry 128 2,106 32 0.617± 0.055 0.465± 0.130 1.5 · 10−3 [38] urlAmerican Revolution 136 157 4 3.039± 0.540 3.022± 0.251 7.3 · 10−5 [39] urlManufacturing emails 167 3,250 10 1.852± 0.091 1.613± 0.060 1.2 · 10−4 [40] urlLittle Rock Lake 183 2,434 6 3.300± 0.593 2.668± 0.250 2.5 · 10−4 [41] urlJazz musicians 198 2,741 10 1.171± 0.053 1.461± 0.053 6.3 · 10−5 [42] urlPDZBase 212 242 26 0.986± 0.013 0.807± 0.014 2.3 · 10−5 [43] urlResidence hall 217 1,839 16 1.613± 0.052 1.257± 0.036 1.1 · 10−4 [44] urlPhysicians 241 923 22 0.828± 0.024 0.794± 0.031 8.1 · 10−5 [45] urlHaggle 274 2,124 14 1.964± 0.052 1.456± 0.031 6.2 · 10−5 [46] urlInfectious 410 2,765 20 0.842± 0.027 0.854± 0.033 8.5 · 10−5 [37] urlCaenorhabditis elegans 453 2,025 28 1.132± 0.021 0.800± 0.019 6.1 · 10−5 [47] urlUnicode languages 614 1,248 30 1.094± 0.019 0.778± 0.018 5.0 · 10−5 [48] urlCrime 829 1,475 34 1.103± 0.018 0.717± 0.017 5.5 · 10−5 [49] urlUC Irvine forum 899 7,019 36 1.385± 0.012 0.749± 0.009 2.1 · 10−5 [50] urlDNC emails co-recipients 906 10,429 22 0.963± 0.021 0.861± 0.022 4.9 · 10−5 [51] urlemail-Eu-core 1,005 16,705 30 1.147± 0.007 0.788± 0.006 6.7 · 10−6 [52] urlU. Rovira i Virgili 1,133 5,450 34 1.093± 0.010 0.719± 0.010 1.7 · 10−5 [53] urlEuroroad 1,174 1,417 40 0.980± 0.009 0.561± 0.011 2.3 · 10−5 [54] urlBlogs 1,224 16,715 38 1.263± 0.006 0.690± 0.005 5.7 · 10−6 [55] urlAir traffic control 1,226 2,408 34 0.978± 0.009 0.666± 0.010 1.6 · 10−5 [56] urlVenture Capital 1,436 4,623 36 0.934± 0.008 0.608± 0.010 1.7 · 10−5 [57] urlChicago 1,467 1,298 40 0.947± 0.014 0.550± 0.018 6.4 · 10−5 [58] urlopsahl-usairport 1,574 17,215 30 1.060± 0.007 0.746± 0.007 6.8 · 10−6 [59] urlHuman protein (Stelzl) 1,702 3,155 36 1.020± 0.008 0.650± 0.009 1.5 · 10−5 [60] urlBible 1,773 9,131 36 1.009± 0.007 0.651± 0.008 1.0 · 10−5 [61] urlHamsterster friendships 1,858 12,534 34 1.042± 0.007 0.676± 0.007 9.6 · 10−6 [62] url

Table S1. Optimal embedding dimension of real-world networks. List of all networks analyzed in our paper. From left to right, we report:name of the network, number of nodes N , number of edges M , value of the best estimate of the optimal dimension do(ε) at accuracy levelε = 0.05, values of the best estimates s and α obtained in fitting numerical results with the function of Eq.(7) in the main paper, mean-squarederror R2 of the fit with the function of Eq.(7), reference to paper where the network data have been first considered, url to the repository wheredata have been downloaded. Embeddings of the networks have been performed using the algorithm node2vec.

Page 13: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

13

network name N M do(ε = 0.05) s α R2 Refs. UrlProtein 1,870 2,277 36 0.970± 0.006 0.611± 0.008 1.0 · 10−5 [63] urlDNC emails 1,891 4,465 24 1.007± 0.015 0.851± 0.015 2.5 · 10−5 [64] urlUC Irvine messages 1,899 13,838 38 1.293± 0.012 0.722± 0.010 2.3 · 10−5 [65] urlHuman protein (Figeys) 2,239 6,432 38 0.967± 0.009 0.596± 0.010 1.9 · 10−5 [66] urlHamsterster full 2,426 16,631 38 0.918± 0.005 0.584± 0.006 7.0 · 10−6 [67] urlAdolescent health 2,539 10,455 42 0.933± 0.003 0.521± 0.004 2.7 · 10−6 [68] urlCora 2,708 5,278 36 0.955± 0.005 0.627± 0.006 7.0 · 10−6 [52] urlFacebook (NIPS) 2,888 2,981 12 1.193± 0.056 1.348± 0.053 9.6 · 10−5 [69] urlOpenFlights 2,939 15,677 30 1.048± 0.005 0.767± 0.005 4.0 · 10−6 [70] urlHuman protein (Vidal) 3,133 6,726 36 0.980± 0.005 0.616± 0.006 5.8 · 10−6 [71] urlOpenFlights 3,425 19,256 28 1.092± 0.007 0.811± 0.007 6.6 · 10−6 [72] urlUS power grid 4,941 6,594 42 0.968± 0.008 0.540± 0.010 1.9 · 10−5 [73] urlca-GrQc 5,242 14,496 38 0.951± 0.003 0.582± 0.004 2.2 · 10−6 [52] urlJUNG and javax dependency 6,120 50,290 38 1.074± 0.006 0.632± 0.006 8.5 · 10−6 [74] urlp2p-Gnutella08 6,301 20,777 40 1.070± 0.007 0.609± 0.008 1.3 · 10−5 [52] urlReactome 6,327 147,547 32 0.967± 0.011 0.694± 0.012 2.1 · 10−5 [75] urlJDK dependency 6,434 53,658 38 1.011± 0.005 0.607± 0.006 6.3 · 10−6 [76] urlRoute views 6,474 13,895 36 1.040± 0.008 0.649± 0.008 1.4 · 10−5 [77] urlAdvogato 6,539 43,277 40 0.940± 0.004 0.536± 0.005 5.0 · 10−6 [78] urlwiki-Vote 7,115 100,762 36 1.063± 0.003 0.667± 0.003 1.9 · 10−6 [52] urlWikipedia elections 7,118 100,751 36 1.064± 0.003 0.667± 0.003 2.1 · 10−6 [79] urlChess 7,301 55,898 36 0.987± 0.004 0.620± 0.005 4.0 · 10−6 [80] urlMovieLens user–movie 7,601 55,384 36 1.338± 0.014 0.744± 0.011 3.3 · 10−5 [81] urlp2p-Gnutella09 8,114 26,013 40 1.074± 0.008 0.613± 0.008 1.6 · 10−5 [52] urlp2p-Gnutella06 8,717 31,525 38 1.111± 0.010 0.642± 0.010 2.4 · 10−5 [52] urlp2p-Gnutella05 8,846 31,839 38 1.116± 0.010 0.645± 0.010 2.1 · 10−5 [52] urlca-HepTh 9,877 25,998 40 0.952± 0.002 0.561± 0.003 1.6 · 10−6 [52] urlSexual escorts 10,106 39,016 36 1.169± 0.011 0.693± 0.011 2.5 · 10−5 [82] urlPretty Good Privacy 10,680 24,316 36 0.962± 0.008 0.619± 0.010 1.8 · 10−5 [83] urlp2p-Gnutella04 10,876 39,994 38 1.174± 0.013 0.680± 0.012 3.3 · 10−5 [52] urlDBLP 12,591 49,620 34 0.998± 0.002 0.659± 0.002 9.9 · 10−7 [84] urlGoogle.com internal 15,763 148,585 38 0.932± 0.004 0.587± 0.005 5.0 · 10−6 [85] urlMovieLens tag–movie 16,528 71,067 38 1.169± 0.012 0.683± 0.011 2.8 · 10−5 [86] urlMovieLens user–tag 16,528 43,739 40 0.993± 0.004 0.583± 0.004 3.6 · 10−6 [87] urlarXiv cond-mat 22,015 58,586 40 1.000± 0.007 0.574± 0.008 1.3 · 10−5 [88] urlp2p-Gnutella25 22,687 54,705 38 1.048± 0.009 0.615± 0.009 1.9 · 10−5 [52] urlCora citation 23,166 89,157 38 0.974± 0.006 0.585± 0.007 9.1 · 10−6 [89] urlTwitter lists 23,370 32,831 36 0.992± 0.010 0.619± 0.012 2.7 · 10−5 [90] urlGoogle+ 23,628 39,194 32 0.996± 0.012 0.684± 0.014 3.0 · 10−5 [91] urlCAIDA 26,475 53,381 36 0.973± 0.005 0.622± 0.006 6.9 · 10−6 [77] urlp2p-Gnutella24 26,518 65,369 38 1.075± 0.010 0.634± 0.011 2.4 · 10−5 [52] urlDigg 30,398 86,312 38 1.030± 0.008 0.607± 0.009 1.7 · 10−5 [92] urlInternet topology 34,761 107,720 36 0.990± 0.005 0.626± 0.006 5.9 · 10−6 [93] urlp2p-Gnutella30 36,682 88,328 38 1.055± 0.010 0.631± 0.010 2.1 · 10−5 [52] urlSlashdot threads 51,083 116,573 38 1.037± 0.008 0.608± 0.009 1.7 · 10−5 [94] url

Table S2. Continuation of Table S1.

Page 14: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

14

[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, Journal of machine learning research 3, 1137 (2003).[2] J. Pennington, R. Socher, and C. Manning, in Proceedings of the 2014 conference on empirical methods in natural language processing

(EMNLP) (2014), pp. 1532–1543.[3] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Advances in Neural Information Processing Systems 26, 3111 (2013).[4] B. Perozzi, R. Al-Rfou, and S. Skiena, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery

and data mining - KDD ’14 (ACM Press, New York, New York, USA, 2014), pp. 701–710, ISBN 978-1-4503-2956-9, URL http://dl.acm.org/citation.cfm?doid=2623330.2623732.

[5] A. Grover and J. Leskovec, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016, 855 (2016).

[6] T. N. Kipf and M. Welling, in Proceedings of the 5th International Conference on Learning Representations (2017), ICLR ’17, URLhttps://openreview.net/forum?id=SJU4ayYgl.

[7] W. L. Hamilton, R. Ying, and J. Leskovec, IEEE Data Engineering Bulletin (2017).[8] J. An, H. Kwak, and Y.-Y. Ahn, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:

Long Papers) (2018), pp. 2450–2461.[9] W. L. Hamilton, J. Leskovec, and D. Jurafsky, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics

(Volume 1: Long Papers) (2016), pp. 1489–1501.[10] N. Garg, L. Schiebinger, D. Jurafsky, and J. Zou, Proceedings of the National Academy of Sciences 115, E3635 (2018).[11] A. C. Kozlowski, M. Taddy, and J. A. Evans, American Sociological Review 84, 905 (2019), https://doi.org/10.1177/0003122419877135,

URL https://doi.org/10.1177/0003122419877135.[12] M. Belkin and P. Niyogi, in Advances in neural information processing systems (2002), pp. 585–591.[13] M. A. Serrano, D. Krioukov, and M. Boguná, Physical Review Letters 100, 078701 (2008).[14] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná, Physical Review E 82, 036106 (2010).[15] M. Boguná, F. Papadopoulos, and D. Krioukov, Nature Communications 1, 62 (2010).[16] A. Faqeeh, S. Osat, and F. Radicchi, Physical Review Letters 121, 098301 (2018).[17] Z. Yin, arXiv preprint arXiv:1803.00502 (2018).[18] Z. Yin and Y. Shen, in Advances in Neural Information Processing Systems (2018), pp. 887–898.[19] M. Girvan and M. E. Newman, Proceedings of the national academy of sciences 99, 7821 (2002).[20] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, AI Magazine 29, 93 (2008), URL https://www.aaai.

org/ojs/index.php/aimagazine/article/view/2157.[21] P. Erdos and A. Rényi, Publ. Math. Debrecen 6, 290 (1959).[22] A.-L. Barabási and R. Albert, Science 286, 509 (1999).[23] P. W. Holland, K. B. Laskey, and S. Leinhardt, Social networks 5, 109 (1983).[24] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, in International Conference on World Wide Web (2015), pp. 1067–1077.[25] P. Goyal and E. Ferrara, Knowledge-Based Systems 151, 78 (2018).[26] J. MacQueen et al., in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Oakland, CA, USA, 1967),

vol. 1, pp. 281–297.[27] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, Journal of Statistical Mechanics: Theory and Experiment 2005, P09008 (2005).[28] W. Zachary, J. of Anthropological Research 33, 452 (1977).[29] L. C. Freeman, S. C. Freeman, and A. G. Michaelson, J. of Social and Biological Structures 11, 415 (1988).[30] D. E. Knuth, The Art of Computer Programming, Volume 4, Fascicle 0: Introduction to Combinatorial and Boolean Functions (Addison-

Wesley, 2008).[31] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396 (2003).[32] Y. Takahata, The Monkeys of Arashiyama. State University of New York Press, Albany pp. 123–139 (1991).[33] B. Hayes, American Scientist 94, 400 (2006).[34] J. S. Coleman, London Free Press Glencoe (1964).[35] D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, vol. 37 (Addison-Wesley Reading, 1993).[36] M. E. J. Newman, Physical Review E 74 (2006).[37] L. Isella, J. Stehlé, A. Barrat, C. Cattuto, J.-F. Pinton, and W. V. den Broeck, J. of Theoretical Biology 271, 166 (2011).[38] R. E. Ulanowicz, J. J. Heymans, and M. S. Egnotovich, Annual Report to the United States Geological Service Biological Resources

Division Ref. No.[UMCES] CBL 00-0176, Chesapeake Biological Laboratory, University of Maryland (2000).[39] American revolution network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/brunson_

revolution.[40] Manufacturing emails network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/radoslaw_

email.[41] N. D. Martinez, J. J. Magnuson, T. Kratz, and M. Sierszen, Ecological Monographs 61, 367 (1991).[42] P. M. Gleiser and L. Danon, Advances in Complex Systems 6, 565 (2003).[43] T. Beuming, L. Skrabanek, M. Y. Niv, P. Mukherjee, and H. Weinstein, Bioinformatics 21, 827 (2005).[44] L. C. Freeman, C. M. Webster, and D. M. Kirke, Social Networks 20, 109 (1998).[45] J. Coleman, E. Katz, and H. Menzel, Sociometry pp. 253–270 (1957).[46] A. Chaintreau, P. Hui, J. Crowcroft, C. Diot, R. Gass, and J. Scott, IEEE Trans. on Mobile Computing 6, 606 (2007).[47] J. Duch and A. Arenas, Phys. Rev. E 72, 027104 (2005).

Page 15: Center for Complex Networks and Systems Research, · Defining and identifying the optimal embedding dimension of networks Weiwei Gu,1 Aditya Tandon, 2Yong-Yeol Ahn, and Filippo Radicchi2

15

[48] Unicode languages network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/unicodelang.[49] Crime network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/moreno_crime.[50] T. Opsahl and P. Panzarasa, Social Networks 34 (2011).[51] Dnc emails co-recipients network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

dnc-corecipient.[52] J. Leskovec and A. Krevl, SNAP Datasets: Stanford large network dataset collection, http://snap.stanford.edu/data (2014).[53] R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, and A. Arenas, Phys. Rev. E 68, 065103 (2003).[54] L. Šubelj and M. Bajec, Eur. Phys. J. B 81, 353 (2011).[55] Blogs network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/moreno_blogs.[56] Air traffic control network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/maayan-faa.[57] W. Gu, J.-d. Luo, and J. Liu, Social Networks 57, 70 (2019).[58] D. E. Boyce, K. S. Chon, M. E. Ferris, Y. J. Lee, K.-T. Lin, and R. W. Eash, Chicago Area Transportation Study pp. xii + 169 (1985).[59] Us airports network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/opsahl-usairport.[60] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, et al.,

Cell 122, 957 (2005).[61] Bible network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/moreno_names.[62] Hamsterster friendships network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

petster-friendships-hamster.[63] M. P. Stumpf, C. Wiuf, and R. M. May, Proceedings of the National Academy of Sciences of the United States of America 102, 4221

(2005).[64] Dnc emails network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

dnc-temporalGraph.[65] T. Opsahl and P. Panzarasa, Social Networks 31, 155 (2009).[66] R. M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. McBroom-Cerajewski, M. D. Robinson, L. O’Connor, M. Li, et al.,

Molecular Systems Biology 3 (2007).[67] Hamsterster full network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

petster-hamster.[68] J. Moody, Social Networks 23, 261 (2001).[69] Facebook (nips) network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/ego-facebook.[70] T. Opsahl, F. Agneessens, and J. Skvoretz, Social Networks 3, 245 (2010).[71] J.-F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F. Berriz, F. D. Gibbons, M. Dreze, and N. Ayivi-

Guedehoussou, Nature pp. 1173–1178 (2005).[72] Openflights network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/openflights.[73] D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).[74] Jung and javax dependency network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

subelj_jung-j.[75] G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D’Eustachio, E. Schmidt, B. de Bono, B. Jassal, G. Gopinath, G. Wu, L. Matthews, et al.,

Nucleic Acids Research 33, D428 (2005).[76] Jdk dependency network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/subelj_jdk.[77] J. Leskovec, J. Kleinberg, and C. Faloutsos, ACM Trans. Knowledge Discovery from Data 1, 1 (2007).[78] Advogato network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/advogato.[79] Wikipedia elections network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/elec.[80] Chess network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/chess.[81] Movielens user–movie network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

movielens-10m_ui.[82] L. E. C. Rocha, F. Liljeros, and P. Holme, Proc. of the National Academy of Sciences 107, 5706 (2010).[83] M. Boguñá, R. Pastor-Satorras, A. Díaz-Guilera, and A. Arenas, Phys. Rev. E 70, 056122 (2004).[84] Dblp network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/dblp-cite.[85] G. Palla, I. J. Farkas, P. Pollner, I. Derényi, and T. Vicsek, New J. Phys. 9, 186 (2007).[86] Movielens tag–movie network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

movielens-10m_ti.[87] Movielens user–tag network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

movielens-10m_ut.[88] M. E. J. Newman, Proceedings of the National Academy of Sciences 98, 404 (2001).[89] Cora citation network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/subelj_cora.[90] Twitter lists network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/ego-twitter.[91] Google+ network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/ego-gplus.[92] Digg network dataset – KONECT(2017), URL http://konect.uni-koblenz.de/networks/munmun_digg_reply.[93] B. Zhang, R. Liu, D. Massey, and L. Zhang, SIGCOMM Computer Communication Review 35, 53 (2005).[94] Slashdot threads network dataset – KONECT (2017), URL http://konect.uni-koblenz.de/networks/

slashdot-threads.