cire: circular embeddings of knowledge graphsidke.ruc.edu.cn/publications/2017/cire circular... ·...

CirE: Circular Embeddingsof Knowledge Graphs

Zhijuan Du, Zehui Hao, Xiaofeng Meng(B), and Qiuyue Wang

School of Information, Renmin University of China, Beijing, China{2237succeed,jane0331,xfmeng,qiuyuew}@ruc.edu.cn

Abstract. The embedding representation technology provides conve-nience for machine learning on knowledge graphs (KG), which encodesentities and relations into continuous vector spaces and then constructs〈entity,relation,entity〉 triples. However, KG embedding models are sen-sitive to infrequent and uncertain objects. Furthermore, there is a con-tradiction between learning ability and learning cost. To this end, wepropose circular embeddings (CirE) to learn representations of entireKG, which can accurately model various objects, save storage space,speed up calculation, and is easy to train and scalable to very largedatasets. We have the following contributions: (1) We improve the accu-racy of learning various objects by combining holographic projection anddynamic learning. (2) We reduce parameters and storage by adopting thecirculant matrix as the projection matrix from the entity space to therelation space. (3) We reduce training time through adaptive parametersupdate algorithm which dynamically changes learning time for variousobjects. (4) We speed up the computation and enhance scalability byfast Fourier transform (FFT). Extensive experiments show that CirEoutperforms state-of-the-art baselines in link prediction and entity clas-sification, justifying the efficiency and the scalability of CirE.

Keywords: Knowledge graph · Circular embedding · Circulant matrix ·Holographic projection · Dynamic learning · FFT

1 Introduction

Entities are the basic unit of human knowledge and are linked by relations.They are very important to learn relational knowledge representation. KG isa collection of multi-relational knowledge. It is mathematically represented asa multi-graph that linked with facts 〈head entity, relation, tail entity〉, abbrevi-ated as 〈h, r, t〉, where h and t indicate nodes and are linked by the edge r.

This research was partially supported by the National Key Research and Devel-opment Program of China (No. 2016YFB1000603, 2016YFB1000602); the grantsfrom the Natural Science Foundation of China (No. 61532010, 61379050, 91646203,61532016); Specialized Research Fund for the Doctoral Program of Higher Educa-tion (No. 20130004130001), and the Fundamental Research Funds for the CentralUniversities, the Research Funds of Renmin University (No. 11XNL010).

c© Springer International Publishing AG 2017S. Candan et al. (Eds.): DASFAA 2017, Part I, LNCS 10177, pp. 148–162, 2017.DOI: 10.1007/978-3-319-55753-3 10

CirE: Circular Embeddings of Knowledge Graphs 149

They are widely utilized in nature language processing applications, such asquestion answering, web search, knowledge inference, fusion and completion, etc.However, applications of KG are suffering from data sparsity and computationalinefficiency with the increasing size of KG. Thus, the embedding representationtechnology for KG is born and become a hot trend, including three branches:translation-based models, compositional models, SE/SME and neural networkmodels.

Although these models have strong capability, they still face two challenges:the sensitive issue and the contradiction issue.

Firstly, about 48.9% 1-to-1 and N-to-1 triplets are over 85% accuracy(Hits@10 1), and 28.3% 1-to-N triplets are less than 60% [1,2]. This mainly resultsfrom uncertainty objects in KG, as shown in Fig. 1(a). For examples, US Presi-dent can be Herbert Bush or Walker Bush. In addition, the occurring frequencyof objects (entities or relations) is not balanced in KG, as shown in Fig. 1(b).We define objects with occurring less than 10 times as infrequent objects, andmore than 50 times as frequent objects. Infrequent objects are highly informativeand discriminative, but usually suffer an overfitting problem when training. Fre-quent ones some times are underfitting. Improper fitting losses information andslows down convergence. Thereby, embedded models are sensitive for uncertaintyobjects, infrequent objects and frequent objects.

parentsUS

Herbert Bush

president gender male

parents

parents

Barbara Pierce

"Jeb" Bush

NtoN

1 to N N to 1

Walker Bush

(a) uncertainty object (b) objects distribution

Fig. 1. Entities and relationships distribution statistics

Secondly, existing embedding models cannot balance learning cost, computa-tional cost and fitting accuracy. As shown in Table 1, TransE [1] has the shortesttraining time because there is only the vector addition and subtraction in triplefitting. TransR [3] has better accuracy since entities and relations lie in differ-ent semantic spaces. But its matrix-vector multiplication needs longer trainingtime. RESCAL [4,5] represents fact via the tensor product and captures richinteractions. HOLE [6] outperforms others in fitting accuracyHits@10 ). But theyrequire a higher dimensions and larger storage.

In this paper, we propose an embedding model called CirE to overcome abovetwo problems. Our contributions are as follows:1 Hit@10 :proportion of correct entities in top-10 ranked entities.

150 Z. Du et al.

Table 1. Example of contradiction issue

FB15K TransE TransR RESCAL HOLE

Fitting operation v m Tensor product Dot product

Hits@10 47.1 68.7 44.1 70.3

Training times 5min 3 h 8 h 2 h

Dimensions 50d 50d 150d 200d

Semantic space Same Different Same Same

v:vector add or subtract; m:matrix-vector multiplication

• accuracy: We use the holographic projection and dynamic learning toimprove the accuracy of triple fitting.

• cost: We use the circulant matrix to project h and t to r, which reducesmodel parameters and the storage space.

• convergence: We use adaptive parameters update algorithm dynamicallyadjusting learning time on various objects to prevent improper fitting, whichaccelerates convergence and reduces training time.

• scalability: For high-dimensional data, we employ FFT to speed up thecomputation and further reduce storage space, which enhances scalability.

The rest of the paper is organized as follows. Section 2 describes related workin learning embeddings for KG. Section 3 expounds our CirE model and analyzesits ability. Section 4 presents experimental results on the link prediction task, theentity classification task and the scalability analysis task. Section 5 concludes thepaper.

2 Related Work

The mathematical notations are as follows: a triple is denoted by (h, r, t). Itscolumn vectors is denoted by bold lower case letters h, r, t, e = {h, t}; matricesby bold upper case letters, such as M; projection vector is denoted by subscriptp; score function is represented by fr (h, t). As mentioned in Sect. 1, the relatedwork is introduced through the following 3 branches.

2.1 Translation-Based Models

The first branch is the translation-based model. Our CirE also belongs to thisbranch, which is inspired by translation invariant of semantic and syntactic rela-tions. This means h + r is supposed to be close to t for a triplet (h, r, t). Inthe score function fr (h, t) = ‖h + r − t‖�1,�2, r, h and t can be within the samesemantic space or from separate semantic spaces. TransE [1] is a innovative work,which models r and e within the same semantic space. Here, ei = ej ,∀i �= j whenr link many h or t. Hence, it is unsuitable for N-to-1, 1-to-N and N-to-N rela-tions. To break the limitation, TransH characterizes the entity embedding as


er = e − wTr ewr to enable an entity to have distinct representations and play

different roles when involved in different relations [7].Later, as a milestone, TransR [3] inspires series of models putting r and e in

different semantic spaces, such as TransD [8] and TransSparse [2] which set theprojection matrix Mr for each r to project e from the entity space to the relationspace, e.g. ep = Mre. The score function is fr (h, t) = ‖hp + r − tp‖�1,�2. Mr isa common matrix and is shared with h and t in TransR [3]. TransD [8] believesthat h and t should be distinguished, and Mr should be related to both entitiesand relations. Thus, Mr is replaced by Mre = rpep + In×m. TranSparse [2]replaces Mr with sparse matrix Me

r (θer). The sparse degree θe

r is determined bythe number of entities linked to r.

Although translation-based models outperform others, their convergence andcalculation are slow. TranSparse [2] only updates nonzero elements, which onlyreduces the complexity of updating. Moreover, sparse matrix-vector multipli-cation has a inherent bottleneck of frequently memory access. However, CirEcan accelerate convergence via dynamic learning time for different objects. FFTspeeds up the computation and further reduces storage to enhance scalability.

2.2 Compositional Models

The second branch is the compositional model that fits triplet based on tensorproduct, such as LFM [9,10], RESCAL [4,5], DistMult [11], which lets fr (h, t) =hTMrt. LFM only optimizes the nonzero elements. RESCAL optimizes entireMr, but brings a lot of parameters. Thus, DistMult [11] uses diagonal of Mr toreduce parameters. But this approach can only model symmetric relations.

HOLE [6] uses dot product rather than tensor product. It employs circularcorrelation between h and t to represent the entity pairs, denoted as [h ∗ t]k =∑d−1

i=0 hit(i+k) mod d, fr (h, t) = σ(rT (h ∗ t)

), σ (x) = 1/(1 + e−x). It has an

advantage on non-commutative relation and equivalence relation (h similar to t),and speeds up calculation by FFT.

The basic ideal of HOLE [6] is the circular correlation between two entitieswhile CirE focuses on cross-correlation between an entity and a relation. Thus,HOLE is limited to relation types. Yet, CirE can avoid this.

2.3 SE/SME and Neural Network Models

The third branch is the earliest. SE [12] transforms the entity space withthe head-specific matrix Mrh and tail-specific matrix Mrt by fr(h, t) =‖Mrhh − Mrtt‖11, but it cannot capture relations between entities [13]. SME[14] can handle correlations between entities and relations by fr (h, t) =(M1h � M2r + b1)

T (M3h � M4r + b2) ,� = +,⊗. In addition, there aresome neural network models, such as SLM [13] and NTN [13]. SE [12] can-not be used to accurately depict the semantic relations between entities andrelations. SLM [13] uses the nonlinear operation of single-layer neural net-work to solve this problem and meanwhile reduces the parameters of SE. But

152 Z. Du et al.

it only provides a relatively weak link between entities and relations. TheNTN [13] model defines a score function combining SLM [13] and LFM [9,10],fr (h, t) = uT

r g (hMrt + Mr,1h + Mr,2t + br), where uTr is the relation-specific

linear layer, g is the tanh function, Mr(Mr ∈ Rd×d×k) is a 3-way tensor. There-

fore, this kind of models need much triplets for training and does not suit forsparse KG [1].

3 Methodology

CirE is a translation-based model. More particularly, entities and relations arefrom separate semantic spaces. They are converted by the projection matrix,which usually uses common matrix or sparse matrix. But our CirE uses circulantmatrix. Circulant matrix [15] is a special kind of Toeplitz matrix and is widelyused in dimensionality reduction, binary embedding and so on. Each row vectorrotates one element to the right relative of the preceding row vector. However,projection matrix aims at linear space transformation. Therefore, the elementsin the circulant matrix can be rotated to the right or to the left. Here, We extendthe circulant matrix, as shown in Definition 1.

Definition 1 (circulant matrix). As shown in Fig. 2, each row in circulantmatrix rotates one element to the right or left relative to the preceding rowvector. Thus, it can be divided into left and right circulant matrix, respectivelydenoted as AL = CircL (a) and AR = CircR(a), where a is the first row vector,called circulant vector.

(a) left circular (b) right circular

Fig. 2. The schematic of circulant matrix

The elements of left and right circulant matrix are denoted by Eqs.(1) and(2) respectively.

aLij = aL

(i−1)((k−j)mod n), k = 1, . . . , n,aij ∈ AL (1)

aRij = aR

(i−1)((k+j) mod n), k = 1, . . . , n,aij ∈ AR (2)

The relation between the left and the right circulant matrix is shown in Eq. (3)⎧⎨⎩

aL1j = aR

1j , i = 0

aLij = aR

i(−j mod n), i > 0(3)


3.1 CirE

In CirE, for each triple (h, r, t), we set entities as e = {h, t} and its embeddingsvector as e = {h, t} ∈ R

n and relation embedding as r ∈ Rm. For each r, we

project e from Rn to R

m via projection matrix Ae ∈ Rn×m or Ae ∈ R

n×n

(m>n). Ae is the circulant matrix. The circulant vector is denoted as ae =(ae

i , i = 1, . . . , n) ,a1e =(a1e

i , i = 1, . . . ,m − n + 1). The projected vector of e

is er. Here, let Ae be a right circulant matrix. Left circulant matrix is similar.The kth element of er can be obtained by Eqs. (4), (5) and (6), and the scorefunctions is defined as Eq. (7).

if n = m, erk =

∑i=n−1

i=0eiae

(i+k)mod n, k ∈ [0, n − 1] (4)

if n > m, erk1 = er

k, k1 = k ∈ [0,m − 1] (5)

else

⎧⎨⎩

ae =[0,a1e

],a1e ∈ R

(m−n+1)

erk =

∑i=n−1i=0 eiae

(i+k)mod n, k ∈ [0, n − 1](6)

fr (h, t) = ||hr + r − tr||2�1,�2

||o||�1,�2 ≤ 1,o = h, r, t, er

(7)

From the Eq. (4),we can see er is a cross-correlation [15] between e and a.Similarly, it is a circular convolution [16] when Ae is a left circulant matrix.We called this projection a holographic projection. In Eq. (7), ||o||�1,�2 ≤ 1 is anenforce constraints on the norms of the embedding h, r, t and er.

Furthermore, in Eq. (4), the calculation process of the projected vector er

is shown in Fig. 3(b), which is a inverse process of Fig. 3(a). In Fig. 3(a), theprojected vectors er can be interpreted as a circular convolution between e anda. In signal processing, e and er are input and output signal, a is an activationfunction [17]. According to the theory of structural calculations [18], circulantmatrix A is internal constraints between each component in e, and er is externalconditions. From holographic memories [19], er can preserve similar propertiesas e. a is a cue. Thus, a and e are solved by the correlation of the cue withmemory trace in learning phase. From the aspect of linear algebra, er = Ae isa circular linear equation, the solution of which is stable [20].

Similarly in Fig. 3(b), er is cross-correlation between e and a, which equalto circular convolution when aR

i = aL(−i mod n) [20]. Thus, we can employ FFT

[21] to speed up the computation. Then, the runtime complexity is quasilinear(loglinear) to d, as er can be computed via Eq. (8).

ekr = F−1

(F (ei) � F (ae

i ))

, e = h, t (8)

where F and F−1 denote the FFT [20] and its inverse, � denotes theHadamard (entrywise) product. The computational complexity of FFT isO(d log d).

154 Z. Du et al.

(a) circular convolution (b) cross-correlation

Fig. 3. The solution of the holographic projection

3.2 Training Method and Algorithm Implementation

At last, Eq. (7) is transformed into a minimization problem with constraints||o||�1,�2 ≤ 1,o = r,h, t. The objective function L and score function fr (h, t)are shown in Eqs.(9) and (10).

L =∑

(h,r,t)∈Δ

∑(h′ ,r,t′)∈Δ′ max

(0, fr(h, t) + γ − fr(h

′, t′)

)(9)

Δ′ ={(

h′, r, t

)|h′ ∈ E ∪

(h, r, t

′) |t′ ∈ E}

(10)

where γ is the margin, Δ and Δ′ are the set of correct and incorrect triplesrespectively. Δ′ is the negative sampling set of Δ. Then minimize (L) can besolved by using Stochastic Gradient Descent (SGD) [22], as also proposed inmost models [1–7,9–14]. SGD use a global learning rate η to updating all para-meters, regardless of their characteristic. But according to analyze in Sect. 1, wecan see frequent objects need longer time to learn and infrequent objects needshorter time to learn. We adopt Adadelta [23] which dynamically adapts overtime to make small gradients have larger learning rates, while large gradientshave smaller learning rates. Firstly, Adadelta restricts the window of past gra-dients that are accumulated to be with in a fixed size w, and then implementsthis accumulation as an exponentially decaying average of the squared gradients.Assume at time, t this running average is E

[g2

]t

as shown in Eq. (11).

E[g2

]t= ρE

[g2

]t−1

+ (1 − ρ) g2t (11)

where ρ is a decay constant. Since it requires the square root of this quan-tity when updating parameters, this effectively becomes the RMS (as shown inEq. (12)) of previous squared gradients up to time.

RMS[g]t =√

E[g2]t + ε (12)

where ε is a constant, the resulting parameter update is shown in Eq. (13).

Δxt = − η

RMS[g]tgt (13)

Therefore, the algorithm of Op-TransE is shown in Algorithm 1.


Algorithm 1. Learning CirEInput: Training set Δ and Δ′, entity and relation set E and R, margin γ, embeddings dim. m, n.Output: h, r, t1: initializer ← uniform(−6/

√k, 6/

√k) for each r ∈ R //or the result of TransE

2: r ← r/||r|| for each r ∈ R3: e ← uniform(−6/

√k, 6/

√k) for each e ∈ E

4: Let each projection matrix be Ae = Circ R(1, 0, . . . , 0) //also Circ L5: loop6: e ← e/‖e‖ for each e ∈ E7: Δbatch ← sample(Δ, b) //sample a minibatch of size b8: Tbatch ← ∅ // initialize the set of pairs of triplets9: for (h, r, t) ∈ Δbatch do10: (h′, r, t′) ← sample(Δ′

(h′,r,t′))//sample a corrupted triplet

11: Tbatch ← Tbatch ∪ {(h, r, t), (h′, r, t′)}12: end for13: update embeddings and projection matrices14: for � ∈ entities or relations in Tbatch // normalized vector15: if ||�||2�1,�2 > 1 then

16: � ← �/

||�||2�1,�2 //constrains: ||e||�1,�2 ≤ 1, ||r||�1,�2 ≤ 1, ||er||�1,�2 ≤ 1, e = h, t

17: end if18: end for19: for t ∈ [1, T ] do //update parameters, loop over # of updates20: gt ←∑

((h,r,t),(h′,r,t′))∈Tbatch∇[γ + fr (h, t) − fr(h

′, t′)]+ // compute gradient

21: E[g2]

t← ρE

[g2]

t−1 + (1 − ρ) g2t // accumulate gradient

22: Δxt = − ηRMS[g]t

gt // compute update

23: E[Δx2]

t= ρE

[Δx2]

t−1 + (1 − ρ)Δx2t //accumulate update

24: xt ← xt−1 + Δxt // update parameters25: end for26: end loop

3.3 Complexity Comparisons

The scalability of an algorithm lies not only in high accuracy, but also low timeand space complexity. We compare CirE with other models as shown in Table 2.

In Table 2, complexity is measured by the number of parameters, time andmemory of multiplication operations in an epoch. An epoch is a single pass overall triples. ne, nr and ntr represent the number of entities, relations and tripletsin a KG respectively. nk is the number of hidden nodes of a neural network andns is the number of slice of a tensor. d represent de = dr. de and dr represent thedimension of entity and relation embedding space respectively. θ(0 θ 1)denotes the average sparse degree of all transfer matrices. We can see that thenumber of parameters in CirE is same as TransH. The time of multiplicationoperations is similar to TranSpare, and between TransD and TransR, whichshows the high efficiency of our approach. In addition, the higher dimension is,the more obvious our advantage is.

4 Experiments and Analysis

Our approach, CirE, is evaluated by the tasks of link prediction, triplet classi-fication efficiency and scalability on two subsets of WN11 [13] and WN18 [15]from WordNet and two subsets of FB15k [15] and FB13 [13] from Freebase. Thestatistics of the 4 datasets are given in Table 3.

156 Z. Du et al.

Table 2. Complexities comparison

Model Time complexity Memory complexity Opt.

SE [12] O(2d2ntr) O(ne d + 2nrd2) SGD

SME [14] O(4dnknsntr) O(ne

d +nrd + (4dns + 1)nk)SGD

SLM [14] O((2dnk + nk)ntr) O(ned + 2nrnk(1 + d)) SGD

NTN [13] O(((d2 + d)ns +(2d + 1)nk)ntr)

O(ned +nrns(d2

× 2d + 2))LBFGS

LFM [10] O((d2 + d)ntr) O(ned + nrd2) SGD

RESCAL [4] O(pqde (1 + q)+ q2∗(3de + q + pq))

O(ned + nrd2) SGD

DistMult [11] O(pde (1 + q) + q(3de +q + pq))

O(ned + nrd2) AdaGrad

HOLE [6] O(pqde (1 + log q) +q log q(3de + q + p log q))

O((ne + nr)d SGD

TransE [1] O(dntr) O((ne + nr)d SGD

TransH [7] O(2dntr) O(nede + 2nrdr) SGD

TransR [3] O(2dedrntr) O(nede + (de + 1)nrdr) SGD

TransD [8] O(2drntr) O(2(nede + nrdr)) AdaDelta

TranSparse [2] O (λ (1 − θ) dedrntr) (0 �θ � 1)

O(nede + λ (1 − θ) ∗(de + 1)nrdr)

SGD

CirE O(de log dentr) O(nede + 2nrdr) AdaDelta

ns = 1: linear; λ = 2: separate, λ = 1: share

Table 3. Statistics of the datasets used in this paper

Dataset WN11 WN18 FB13 FB15k

#Rel 11 18 13 1,345

#Ent 38,696 40,493 75,043 14,951

#Train 112,581 141,442 316,232 483,142

#Valid 2,609 5,000 5908 50,000

#Test 10,544 5,000 23,733 59,071

For evaluation, we use the same metrics as TransE [1]: (1) average rank ofcorrect entities MeanRank and (2) proportion of correct entities in top-10 rankedentities(Hits@10). Firstly, we replace the head or tail entity for each test tripletwith all entities in the KG, and rank them in descending order of score functionfr (h, t). Secondly, we filter out the corrupted triples which have appeared inKG. We report MeanRank and Hits@10 with two settings: the original one isnamed Raw, while the newer one is Filter. We compare with translation-based


Table 4. Evaluation results on link prediction

Datasets WN18 FB15K

Metric MeanRank Hits@10% Mean Rank Hits@10%

Raw Filter Raw Filter Raw Filter Raw Filter

TransE [1] 263 251 75.4 89.2 243 125 34.9 47.1

TransH(u/b) [7] 318/401 303/388 75.4/73.0 86.7/82.3 211/212 84/ 87 42.5/45.7 58.5/64.4

TransR(u/b) [3] 232/238 219/225 78.3/79.8 91.7/92.0 226/198 78/77 43.8/48.2 65.5/68.7

CTransR(u/b

[3]

243/231 230/218 78.9/79.4 92.3/92.3 233/199 82/75 44.0/48.4 66.3/70.2

TransD(u/b) [8] 242/224 229/212 79.2/79.6 92.5/92.2 211/194 67/91 49.4/53.4 74.2/77.3

TranSparse(t1)

[2]

248/237 236/224 79.7/80.4 93.5/93.6 226 /194 95/88 48.8/53.4 73.4/77.7

TranSparse(t2)

[2]

242/233 229/221 79.8/80.5 93.7/93.9 231/191 101/86 48.9/53.5 73.5 /78.3

TranSparse(t3)

[2]

235 /224 223/221 79.0/79.8 92.3/92.8 211/187 63/82 50.1/53.3 77.9/79.5

TranSparse(t4)

[2]

233/223 221/211 79.6/80.1 93.4/93.2 216/190 66/ 82 50.3/53.7 78.4/79.9

TranSparse(ave) 239.5/229.3 227.3/219.3 79.5/80.2 93.2/93.4 221/190.5 81.3/84.5 49.5/53.5 75.8/78.9

CirE(u/b) 228/221 220/213 81.25/82.1 94.2/94.6 203/163 68/85 52.4/54.1 80.3/80.5

u/b: unif/bern; t1: share, S, unif/bern; t2: share, US, unif/bern; t3: separate, S, unif/bern; t4: separate, US,

unif/bern; ave: (t1 + t2 + t3 + t4)/4

models. Since the datasets are the same, we directly report the results of severalbaselines from TranSparse [2].

To be fair, CirE is limited to d <= 100, our projection matrix is square.We select the margin γ = {0.1, 0.5, 1, 2, 5, 10}, dimension n = m = {20, 50, 100}and the mini-batch size B = {100, 200, 480, 1440}, the dissimilarity measured = {1, 2}.

4.1 Link Prediction

The role of link prediction is to predict the missing h or t for a given fact(h, r, t) [1,2]. We evaluate our CirE on WN18 and FB15K, the results are shownin Table 4. In addition, to compare Hits@10 of different kinds of relations, weevaluate our CirE on FB15K according to TransE [1], the detailed results asshown in Table 5.

As expected, in Table 4, the filtered setting provides lower MeanRank andhigher Hits@10. Compared to TranSparse(t4), CirE improves 0.6% on Hits@10on FB15K, and outperforms all the compared methods. On WN18, CirEimproves 1.4% and 2.4% on Hits@10 compared to TranSparse and TransDrespectively, and improves 12.1% compared to TransR. Therefore, we can con-clude that the entity vectors projected from the entity space to the relation spaceare more reasonable in CirE than TransH, TransR, TransD and TranSparse. Andthe performance achieved by CirE is significant through the combination of holo-graphic projection and dynamic learning. On average, CirE improves 8.31, 14.56on MeanRank and 1.46%, 2.42% on Hits@10 than TranSparse on WN18 andFB15K respectively while TranSparse improves −2.06,−3.56, 0.71%, 0.84% thanTransD.

158 Z. Du et al.

Table 5. Experimental results on FB15k by mapping properties of relations (%)

Tasks Prediction Head(Hits@10) Prediction Tail(Hits@10)

Types 1-to-1 1-to-N N-to-1 N-to-N 1-to-1 1-to-N N-to-1 N-to-N

TransE [1] 43.7 65.7 18.2 47.2 43.7 19.7 66.7 50.0

TransH(u/b) [7] 66.7/66.8 81.7/87.6 30.2/28.7 57.4/ 64.5 63.7/ 65.5 30.1/39.8 83.2/83.3 60.8/67.2

TransR(u/b) [3] 76.9/78.8 77.9/89.2 38.1/34.1 66.9/69.2 76.2/79.2 38.4/37.4 76.2/90.4 69.1/72.1

CTransR(u/b) [3] 78.6/81.5 77.8/89.0 36.4/34.7 68.0/71.2 77.4/80.8 37.8/38.6 78.0/90.1 70.3/73.8

TransD(u/b) [8] 80.7/86.1 85.8/95.5 47.1/39.8 75.6/78.5 80.0/85.4 54.5/50.6 80.7/94.4 77.9/81.2

TranSparse(t1) [2] 83.2/87.5 86.4/95.9 50.3/44.1 73.9/78.7 84.8/87.6 57.7/55.6 83.3/93.9 75.3/80.6

TranSparse(t2) [2] 83.4/87.1 86.7/95.8 49.8/44.2 73.4/79.1 84.8/87.2 57.3/55.5 78.2/94.1 76.4/81.7

TranSparse(t3) [2] 82.3/86.8 85.2/95.5 51.3/44.3 79.6/80.9 82.3/86.6 59.8/56.6 84.9/94.4 82.1/83.3

TranSparse(t4) [2] 83.2/87.1 85.2/95.8 51.8/44.4 80.3/81.2 82.6/87.5 60.0/57.0 85.5/94.5 82.5/83.7

TranSparse(ave) 83.0/87.1 85.9/95.8 50.8/44.3 76.8/80.0 83.6/87.2 58.7/56.2 83.0/94.2 79.1/782.3

CirE(u/b) 84.8/87.8 85.5/96.1 54.6/50.2 82.0/83.0 85.5/88.0 62.3/60.0 93.8/94.5 84.0/84.3

l/b: linear/bilinear; u/b: unif/bern; t1: share, S, unif/bern; t2: share, US, unif/bern; t3: separate, S,

unif/bern; t4: separate, US, unif/bern; ave: (t1 + t2 + t3 + t4)/4

Table 5 shows CirE outperforms other methods on FB15k. For all typesof relations, CirE is higher than 50% on Hits@10. Especially, compared toTranSparse(t4), CirE improves 4.4% and 1.2% on Hits@10 for N-to-1 and N-to-N relations. On average, CirE improves 1.28%, 4.30%, 5.21%, 3.78% thanTranSparse on 1-to-1, 1-to-N, N-to-1, N-to-N respectively. And CirE improvemore significantly than other methods. This shows that multiple relation seman-tics can be accurately represented in CirE. Although the improvement of CirEis slightly lower in the simple relationship, for the complex relationship, theimprovement is significant. There are mainly two reasons: (1) each dimensionof the vector obtained by the holographic projection measures some correlationbetween the entity vector and the relation vector, which is more suitable for pro-jection; (2) the accuracy of infrequent objects is further improved by dynamicparameters learning which learns frequent objects with longer times and learnsinfrequent objects with shorter times.

4.2 Triplet Classification

Triplet classification is a binary classification, which aims to judge whether agiven triplet (h, r, t) is correct or not. We use 3 datasets WN11, FB13 and FB15Kto evaluate CirE. The test sets of WN11, FB13 is provided by TransE. It containspositive and negative triplets. But the test set of FB15K only contains correcttriples, so we make up some negative triples used by TranSparse for it. Thecompared results are described in Table 6 and Fig. 4.

Table 6 shows that CirE almost outperforms all the compared models. Itobtains the best accuracy of 87.3% on WN11. It is near to the best accuracyof 89.1% of TransD on FB13 and 88.5% of TranSparse on FB15K, but higherthan that of other models significantly. Figure 4 shows that CirE improves theperformance of TransE and TranSparse both on simple and complex relations.Moreover, the classification accuracies of different relations are more than 70%.That is why we use joint holographic projection and adaptive gradient update to


Table 6. Triples classification accuracies (%)

Data sets WN11 FB13 FB15

TransE(u/b) [1] 75.9/75.9 70.9/81.5 77.3/79.8

TransH(u/b) [7] 77.7/78.8 76.5/83.5 74.2/79.9

TransR(u/b) [3] 85.5/85.9 74.7/82.5 81.1/82.1

CTransR(bern) [3] 85.7 - 84.3

TransD(u/b) [8] 85.6/86.4 85.9/89.1 86.4/88.0

TranSparse(t1) [2] 86.2/86.3 85.5/87.8 85.7/87.9

TranSparse(t2) [2] 86.3/86.3 85.3/87.7 86.2/88.1

TranSparse(t3) [2] 86.2/86.4 86.7/88.2 87.1/88.3

TranSparse(t4) [2] 86.8/86.8 86.5/87.5 87.4/88.5

CirE(u/b) 87.2/87.3 87.4/88.6 88.1/88.4

u/b: unif/bern; t1: share, S, u/b; t2: share, US, u/b; t3: sep-arate, S, u/b n; t4: separate, US, u/b

Fig. 4. Classification accuracies of different relations on WN11.

improve the accuracy of learning infrequent objects but not affect the frequentobjects. Therefore, We believe that our model CirE can deal with the multi-relational data very well.

4.3 Scalability Evaluation

Another advantage of CirE is easy to train. We perform experiments on WN18and FB15K to evaluate the scalability of CirE. Which is tested by number ofiterations and time of each iteration on the one hand. The results are describedin Table 7.

From Table 7, we can see CirE converges after 300 epochs on both WN18 andFB15K. The convergence is fast compared to 1000 epochs for other methods. Thereason is that joint holographic projection and adaptive parameter update avoid

160 Z. Du et al.

Table 7. Train time of embedding models

Method d Epoch Time/Epoch

WN18 FB15K

TransE 50 1000 5 s 5.4 s

TransH 50 500 7 s 12 s

TransR 50 500 1.4 min 3.08 min

TransD 50 1000 12 s 20 s

TranSparse 50 1000 0.87 min —

TranSparse 100 1000 3.83 min 6.4 min

CirE 50 300 1.17 min —

100 300 4.82 min 8.1 min

128 300 3.94 min 6.8 min

256 300 8.96 min 13 min

512 300 13.2 min 17.9 min

underfitting of frequent objects and overfitting of infrequent objects. Moreover,the training time of TransE, TransH and TransD are about 5, 7 and 12 s perepoch on WN18 respectively. The computation complexity of TransR is higher,which takes about 1.4 min. TranSparse costs 0.87 min per epoch. CirE costs3.94 min per epoch with 128d close to 3.83 min of TranSparse with 100d onWN18. Because we use FFT to accelerate computing starting from the 128d.The convergence of CirE is far below TranSparse. It can be seen that our modelis easy to train and expand.

On the other hand, we design query answer experiments with different dimen-sions to test runtime. The experiments demonstrate that CirE performs inferenceas efficiently as embedding methods when d = {50, 100, 128, 256, 512}. Figure 5depicts the number of answers that CirE offers per second.

In Fig. 5, we use FFT to accelerate computing starting from 128d. We cansee, the query answer per second rates presents exponential reduction with the

Fig. 5. Query answers per second rates for different dimensions.


exponential growth of dimension without FFT, such as at 50d and 100d. How-ever, if we use FFT to accelerate computing on 128d, 256d, 512d, the queryanswer per second rates presents logarithmic reduction rather than exponentialreduction with the exponential growth of dimension. And the query answer persecond rates with FFT is higher than without FFT, such as on 128d higher than100d. We believe that the higher dimension, the more advantageous for CirE.Therefore, CirE is easy to expand.

5 Conclusion and Future Work

In this paper we propose CirE, a translation based embedding model for KGs,which is based on the holographic projection with cross-correlation between enti-ties and relations, and via adaptive parameters update avoid underfitting forfrequent objects and overfitting for infrequent objects to accelerate convergenceand reduce training time. An attractive property of CirE is that it balanceslearning ability and learning cost, and is easy to train. Experiments show thatCirE provides state-of-the-art performance on a variety of benchmark datasets.However, Our experiments are on binary relations, and have not tested on high-erarity yet. In addition, KG is simply seen as symbol triples in these embeddingmodels, which focuses on the explicit semantic of the knowledge, but ignores theimplied semantics. The explicit semantic is reflected by triple fitting, but implicitsemantic is omitted from contextual information, which is supposed to be mostcritical in natural language understanding. To this end, we will further exploitCirE in higher-arity relations or via semantic-aided with background knowledgein future.

References

1. Bordes, A., Usunier, N., Garcia-Duran, A, et al.: Translating embeddings for mod-eling multi-relational data. In: Advances in Neural Information Processing Systems,pp. 2787–2795. MIT Press Massachusetts (2013)

2. Ji, G., Liu, K., He, S., et al.: Knowledge graph completion with adaptive sparsetransfer matrix. In: Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence, pp. 985–991. AAAI Press, MenloPark (2016)

3. Lin, Y., Liu, Z., Sun, M., et al.: Learning entity and relation embeddings for knowl-edge graph completion. In: Proceedings of the Twenty-Ninth AAAI Conference onArtificial Intelligence, pp. 2181–2187. AAAI Press, MenloPark (2015)

4. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learningon multi-relational data. In: Proceedings of the 28th International Conference onMachine Learning, pp. 809–816. ACM Press, New York (2011)

5. Nickel, M., Tresp, V., Kriegel, H.: Factorizing YAGO: scalable machine learning forlinked data. In: Proceedings of the 21st World Wide Web Conference, pp. 271–280.ACM, New York (2012)

6. Nickel, M., Rosasco, L., Poggio, T.: Holographic embeddings of knowledge graphs.In: Proceedings of the Thirteenth AAAI Conference on Artificial Intelligence, pp.1955–1961. AAAI Press, MenloPark (2016)

162 Z. Du et al.

7. Wang, Z., Zhang, J., Feng, J., et al.: Knowledge graph embedding by translating onhyperplanes. In: Proceedings of the Twenty-Eighth AAAI Conference on ArtificialIntelligence, pp. 1112–1119. AAAI Press, MenloPark (2014)

8. Ji, G., He, S., Xu, L., et al.: Knowledge graph embedding via dynamic mappingmatrix. In: Proceedings of the 53rd Annual Meeting of the Association for Com-putational Linguistics and the 7th International Joint Conference on Natural Lan-guage Processing of the Asian Federation of Natural Language Processing, pp.687–696. MIT Press, Massachusetts (2015)

9. Sutskever, I., Tenenbaum, J.B., Salakhutdinov, R.: Modelling relational data usingBayesian clustered tensor factorization. In: Proceedings of Advances in NeuralInformation Processing Systems 22: 23rd Annual Conference on Neural InformationProcessing Systems, pp. 1821–1828. MIT Press, Massachusetts (2009)

10. Jenatton, R., Roux, N.L., Bordes, A., et al.: A latent factor model for highly multi-relational data. In: Proceedings of Advances in Neural Information Processing Sys-tems 25: 26th Annual Conference on Neural Information Processing Systems, pp.3167–3175. MIT Press, Massachusetts (2012)

11. Yang, B., Yih, W., He, X., Entities, E., et al.: Relations for learning, inferencein knowledge bases. In: Proceedings of ICLR (2015). Engelmore, R., Morgan, A.(eds.): Blackboard Systems. Addison-Wesley, Reading (1986)

12. Bordes, A., Weston, J., Collobert, R., et al.: Learning structured embeddings ofknowledge bases. In: Proceedings of the the Twenty-Fifth AAAI Conference onArtificial Intelligence, pp. 301–306. AAAI Press, MenloPark (2011)

13. Socher, R., Chen, D., Manning, C.D., et al.: Reasoning with neural tensor networksfor knowledge base completion, In: Proceedings of Advances in Neural InformationProcessing Systems 26: 27th Annual Conference on Neural Information ProcessingSystems, pp. 926–934. MIT Press, Massachusetts (2013)

14. Erbas, C., Tanik, M.M., Nair, V.S.S.: A circulant matrix based approach to storageschemes for parallel memory systems. In: Proceedings of the Fifth IEEE Sympo-sium on Parallel and Distributed Processing, pp. 92–99. IEEE Press, Piscataway(1993)

15. Plate, T.A.: Holographic reduced representations. IEEE Trans. Neural Netw. 6(3),623–641 (1995)

16. Zhang, J., Fu, N., Peng, X.: Compressive circulant matrix based analog to infor-mation conversion. IEEE Sig. Process. Lett. 21(4), 428–431 (2014)

17. Gentner, D.: Structure-mapping: a theoretical framework for analogy. Cogn. Sci.7(2), 155–170 (1983)

18. Gabor, D.: Associative holographic memories. IBM J. Res. Dev. 13(2), 156–159(1969)

19. Schnemann, P.H.: Some algebraic relations between involutions, convolutions, andcorrelations, with applications to holographic memories. Biol. Cybern. 56(5–6),367–374 (1987)

20. Angel, E.S.: Fast Fourier transform and convolution algorithm. Proc. IEEE 70(5),527–527 (1982)

21. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning,stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

22. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprintarXiv:1212.5701 (2012)

23. Bordes, A., Glorot, X., Weston, J., et al.: A semantic matching energy function forlearning with multi-relational data. Mach. Learn. 94(2), 233–259 (2014)

http://arxiv.org/abs/1212.5701

https://arxiv.org/abs/1212.5701

cire: circular embeddings of knowledge graphsidke.ruc.edu.cn/publications/2017/cire circular... ·...

Documents