going deep: graph convolutional ladder-shape networks · going deep: graph convolutional...

8
Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu, 1 Shirui Pan, 2 Guodong Long, 1 Qinghua Lu, 3 Liming Zhu, 3 Jing Jiang 1 1 Centre for Artificial Intelligence, University of Technology Sydney, Australia 2 Faculty of IT, Monash University, Australia 3 Data61, CSIRO {ruiqi.hu, guodong.long, jing.jiang}@uts.edu.au, [email protected] {qinghua.lu, liming.zhu}@data61.csiro.au Abstract Neighborhood aggregation algorithms like spectral graph convolutional networks (GCNs) formulate graph convolu- tions as a symmetric Laplacian smoothing operation to ag- gregate the feature information of one node with that of its neighbors. While they have achieved great success in semi- supervised node classification on graphs, current approaches suffer from the over-smoothing problem when the depth of the neural networks increases, which always leads to a notice- able degradation of performance. To solve this problem, we present graph convolutional ladder-shape networks (GCLN), a novel graph neural network architecture that transmits mes- sages from shallow layers to deeper layers to overcome the over-smoothing problem and dramatically extend the scale of the neural networks with improved performance. We have validated the effectiveness of proposed GCLN at a node-wise level with a semi-supervised task (node classification) and an unsupervised task (node clustering), and at a graph-wise level with graph classification by applying a differentiable pooling operation. The proposed GCLN outperforms original GCNs, deep GCNs and other state-of-the-art GCN-based models for all three tasks, which were designed from various perspec- tives on six real-world benchmark data sets. Introduction Graphs are of the essence for constructing non-Euclidean data and they are omnipresent in most areas. In the social media industry, for instance, users, through their personal profile information, are linked with and interact with other users (e.g., friends, colleagues), and the entire social me- dia network is modeled as an attributed graph for prod- uct/friend/community recommendations (node clustering); In medicine and pharmacology, molecules and chemical bonds can be constructed as graphs to potentially discover new drugs by identifying their bio-activities (node clas- sification (Defferrard, Bresson, and Vandergheynst 2016; Gilmer et al. 2017)); In academic citation networks, pa- pers are connected by their citations, and the titles, au- thors, venues and keywords form graph characteristics for automatic categorization (semi-supervised node classifica- tion (Velickovic et al. 2017; Kipf and Welling 2016b; Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Zhang et al. 2016) and image classification(Liu et al. 2019b; 2019a)). There has recently been a rise in interest in leveraging embedding methods or deep learning algorithms for graphs. Many probabilistic models (Grover and Leskovec 2016; Tang et al. 2015) or matrix factorization-based works (Cao, Lu, and Xu 2015; Wang et al. 2017) aim to capture the pat- terns (walks) with neighborhood connections and the first or second order proximities from a graph, and to encode the graph into a low-dimensional, compact vector space. The well-learned embedding can be directly analyzed by conven- tional machine learning approaches. These methods exploit and preserve the topological characteristics and lose sight of the contextual features of the nodes in the graph. Other methods simultaneously consider structural char- acteristics and node features to learn a robust embedding of the graph. Examples in this category include content enhanced network embedding methods and neighborhood aggregation (or message passing) algorithms. Content en- hanced methods associate text features with the representa- tion architectures, examples like TADW (Yang et al. 2015) which incorporates the feature information into network representation under a matrix factorization framework and TriDNR (Pan et al. 2016) which trains a neural network ar- chitecture to capture both structural proximity and attribute proximity from the attributed graph. Neighborhood aggre- gation algorithms (Dai et al. 2018; Li, Han, and Wu 2018; Klicpera, Bojchevski, and G¨ unnemann 2018), represented as spectral graph convolutional networks (spectral GCNs) (Kipf and Welling 2016b), introduce the Laplacian smooth- ing operation to propagate the attributes of a node over its neighborhood. Spectral-based algorithms have commanded more attention among the aforementioned approaches due to their significant improvements over semi-supervised tasks. Spectral-based algorithms borrow the idea of filters from the graph signal processing domain to conduct the graph convolution operation, which can also be interpreted as elim- inating noises from graph signal. The major difference be- tween one type of spectral GCN and another lies in the de- sign of the filter. Spectral CNNs (Bruna et al. 2013) formu- late the filter with a collection of learnable parameters to implement the convolution operation based on the spectrum

Upload: others

Post on 09-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

Going Deep: Graph Convolutional Ladder-Shape Networks

Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing Jiang1

1Centre for Artificial Intelligence, University of Technology Sydney, Australia2Faculty of IT, Monash University, Australia

3Data61, CSIRO{ruiqi.hu, guodong.long, jing.jiang}@uts.edu.au, [email protected]

{qinghua.lu, liming.zhu}@data61.csiro.au

Abstract

Neighborhood aggregation algorithms like spectral graphconvolutional networks (GCNs) formulate graph convolu-tions as a symmetric Laplacian smoothing operation to ag-gregate the feature information of one node with that of itsneighbors. While they have achieved great success in semi-supervised node classification on graphs, current approachessuffer from the over-smoothing problem when the depth ofthe neural networks increases, which always leads to a notice-able degradation of performance. To solve this problem, wepresent graph convolutional ladder-shape networks (GCLN),a novel graph neural network architecture that transmits mes-sages from shallow layers to deeper layers to overcome theover-smoothing problem and dramatically extend the scaleof the neural networks with improved performance. We havevalidated the effectiveness of proposed GCLN at a node-wiselevel with a semi-supervised task (node classification) and anunsupervised task (node clustering), and at a graph-wise levelwith graph classification by applying a differentiable poolingoperation. The proposed GCLN outperforms original GCNs,deep GCNs and other state-of-the-art GCN-based models forall three tasks, which were designed from various perspec-tives on six real-world benchmark data sets.

IntroductionGraphs are of the essence for constructing non-Euclideandata and they are omnipresent in most areas. In the socialmedia industry, for instance, users, through their personalprofile information, are linked with and interact with otherusers (e.g., friends, colleagues), and the entire social me-dia network is modeled as an attributed graph for prod-uct/friend/community recommendations (node clustering);In medicine and pharmacology, molecules and chemicalbonds can be constructed as graphs to potentially discovernew drugs by identifying their bio-activities (node clas-sification (Defferrard, Bresson, and Vandergheynst 2016;Gilmer et al. 2017)); In academic citation networks, pa-pers are connected by their citations, and the titles, au-thors, venues and keywords form graph characteristics forautomatic categorization (semi-supervised node classifica-tion (Velickovic et al. 2017; Kipf and Welling 2016b;

Copyright © 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Zhang et al. 2016) and image classification(Liu et al. 2019b;2019a)).

There has recently been a rise in interest in leveragingembedding methods or deep learning algorithms for graphs.Many probabilistic models (Grover and Leskovec 2016;Tang et al. 2015) or matrix factorization-based works (Cao,Lu, and Xu 2015; Wang et al. 2017) aim to capture the pat-terns (walks) with neighborhood connections and the first orsecond order proximities from a graph, and to encode thegraph into a low-dimensional, compact vector space. Thewell-learned embedding can be directly analyzed by conven-tional machine learning approaches. These methods exploitand preserve the topological characteristics and lose sight ofthe contextual features of the nodes in the graph.

Other methods simultaneously consider structural char-acteristics and node features to learn a robust embeddingof the graph. Examples in this category include contentenhanced network embedding methods and neighborhoodaggregation (or message passing) algorithms. Content en-hanced methods associate text features with the representa-tion architectures, examples like TADW (Yang et al. 2015)which incorporates the feature information into networkrepresentation under a matrix factorization framework andTriDNR (Pan et al. 2016) which trains a neural network ar-chitecture to capture both structural proximity and attributeproximity from the attributed graph. Neighborhood aggre-gation algorithms (Dai et al. 2018; Li, Han, and Wu 2018;Klicpera, Bojchevski, and Gunnemann 2018), representedas spectral graph convolutional networks (spectral GCNs)(Kipf and Welling 2016b), introduce the Laplacian smooth-ing operation to propagate the attributes of a node over itsneighborhood. Spectral-based algorithms have commandedmore attention among the aforementioned approaches due totheir significant improvements over semi-supervised tasks.

Spectral-based algorithms borrow the idea of filters fromthe graph signal processing domain to conduct the graphconvolution operation, which can also be interpreted as elim-inating noises from graph signal. The major difference be-tween one type of spectral GCN and another lies in the de-sign of the filter. Spectral CNNs (Bruna et al. 2013) formu-late the filter with a collection of learnable parameters toimplement the convolution operation based on the spectrum

Page 2: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

Figure 1: Influence of the depth of graph convolutional net-works on semi-supervised node classification performance.When GCN goes deep (layer>3), its classification accuracyis decreased as the number of layers increases.

of the graph Laplacian. ChebNet (Defferrard, Bresson, andVandergheynst 2016) considers the filter as Chebyshev poly-nomials of the diagonal matrix of eigenvalues, where theChebyshev polynomial is defined with a recursive formu-lation. Spectral GCNs (Kipf and Welling 2016b) can be in-terpreted as the first order approximation of ChebNet, whichassumes that the order of the polynomials of the Laplacian isrestricted to 1 while the maximum of the eigenvalues is re-stricted to 2. Most recently, g-U-Nets (Gao and Ji 2019) at-tempts to generate the position information of selected nodesby using proposed gPooling and gUnpool operations, so theclassic computer vision methods can be applied in graphsand increases the depth of the architecture. However, thismethod heavily relays on the graph preprocessing and canbe difficult to train for good performance.

The major problem with all the aforementioned algo-rithms is that the quality of the neighborhood aggrega-tion procedure inevitably declines in a deep neural net-work architecture that has many graph convolutional lay-ers (Klicpera, Bojchevski, and Gunnemann 2018). The linechart (Fig.1) illustrates how the performance is degraded asthe depth of the GCN model is increased. The main reasonfor this problem is that Laplacian smoothing in these ag-gregation (or propagation) schemes over-smooths the nodefeatures and renders the nodes from different clusters in-distinguishable (Xu et al. 2018). Explicitly, adding multipleGCN layers is equivalent to repeatedly conducting a Lapla-cian smoothing operation, and the features of neighboringconnected nodes in the graph will converge to the same val-ues. In the case of spectral GCN, the features will convergein proportion to the square root of the node degree (Li, Han,and Wu 2018), which is the over-smoothing problem ad-dressed in this paper.

In this paper, we propose a novel graph convolutionalarchitecture, graph convolutional ladder-shape networks(GCLN), which is built upon one contracting path andone expanding path with contextual feature channels. Theladder-shape architecture allows the network to directlytransmit and fuse the contextual information from the shal-low layers to the deeper layers, which challenges the as-sumption that neighborhood aggregation-based architec-tures can only be shallow networks (e.g., the GCN per-formance will necessarily be degraded when there are

more than three convolutional layers) because of the over-smoothing of features during the propagation procedure.

To further extend GCLN to exploit the hierarchical em-bedding information of graphs, we also fuse a differentiablegraph pooling module (Ying et al. 2018) with our frame-work, so that GCLN can infer and aggregate the local topo-logical structure as well as the coarse-grained topologicalstructure over graphs. We have not only evaluated the effec-tiveness of GCLN with a semi-supervised task and unsuper-vised task at the node-wise level, but have also conductedgraph classification to validate its predictive performance atthe graph-wise level.

Our main contributions are summarized as follows:

• We propose a novel graph convolutional network symmet-rically constructed on one contracting and one expand-ing path with contextual feature channels, which dramat-ically extends the depth of spectral graph convolutionalnetworks.

• The proposed graph convolutional ladder-shape net-work (GCLN) solves the problem of over-smoothed fea-tures with many convolutional layers in neighborhoodaggregation-based neural networks.

• Extensive experiments on semi-supervised and unsuper-vised node-wise node level tasks, and a graph-wise taskcompared with sufficient classical and state-of-the-artbaselines demonstrate the effectiveness of GCLN.

Problem DefinitionA graph is defined as G = {V, E}, where {vi}i=1,··· ,n ∈ Vrepresents all the nodes in a graph. ei,j = (vi, vj)i,j=1,··· ,n ∈E indicates the set of binary linkage relationships betweentwo nodes, where ei,j = 1 if there is an edge between nodesvi and vj, otherwise, ei,j = 0. An adjacency matrix A is usedto mathematically present the topological information of thegraph G, where each cell of the matrix A maps the corre-sponding edge information encoded in E . A feature matrixX preserves the features xi ∈ X pertinent to the node vi.

Node-wise Embedding DefinitionGiven G, the objective of graph neural networks is to embedall the nodes V into a compact space Z ∈ Rn×d throughf : (A, X) 7−→ Z, where d, normally a smaller number, isthe dimension of the learned embedding. zi ∈ Z indicatesthe encoded node vi with associated topological informa-tion from A and feature information from X. Ideally, nodeswith similar characteristics are expected to remain close inthe embedding space. With well-learned embedding Z of thegraph G, the classical machine learning methods can be di-rectly applied for semi-supervised tasks like node classifica-tion and unsupervised tasks like node clustering.

Graph-wise Classification DefinitionGiven a set of aforementioned graphs G ={(G1, y1), (G2, y2), ..., (Gk, yk)} where Gk ∈ G withthe corresponding labels yk ∈ Y, the purpose of graph-wisetasks such as graph classification is to learn a mappingf : G 7−→ Y which assigns the given graphs to the labels.

Page 3: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

Each given graph is embedded into a low-dimensionalvector based on which the class label of the whole graphcan be inferred.

PreliminariesThe proposed GCLN is most related to two models: spec-tral graph convolutional networks and u-shape networks.We also employ the differentiable graph pooling module tograph-wisely validate the effectiveness of GCLN. We elab-orate the details of these three models in this section.

Graph Convolutional Networks

Figure 2: The architecture of a typical spectral graph convo-lutional network. The performance of GCNs with more thantwo or three layers will dramatically drop as a result of theover-smoothing problem.

Standard graph convolutional networks (GCNs) (Kipf andWelling 2016b), as shown in Figure 2, comprise a two-layersemi-supervised framework which propagates the featuresof a node over its neighboring nodes. The objective of GCNsis to learn a spectral function f(X,A) where the adjacencymatrix A represents the topological characteristics of thegraph and matrix X preserves the interdependence of thenodes and features. The propagation rule is more likely torepresent each node in the graph by aggregating its neigh-borhood with self-loops, and the output of the networks isa normalized node-level representation (or graph embed-ding), which represents each node with a low-dimensionalvector. Because the spectral function f(•) conducts the fea-ture smoothing on the topological adjacent matrix A, thenetworks can distribute the gradient through the supervisederror and learn the representation of both annotated and un-labeled nodes.

U-NetThe U-Net (Ronneberger, Fischer, and Brox 2015) is an el-egant U-shape fully convolutional network architecture de-veloped for bio-medical image segmentation. The architec-ture consists of two parts: the down-sampling path and theup-sampling path.

The down-sampling path is comprised of four blocks withtwo convolutional layers and one max pooling (stride=2)layer for each block. At each step following the down-sampling path, the number of feature channels is doubled.The contracting path captures the contextual features fromthe input image for segmentation and passes the features tothe up-sampling path with skip connections.

The up-sampling path is also comprised of four similarde-convolutional blocks and enables the model to simultane-ously obtain accurate localization information and sufficientcontextual information from the down-sampling path.

The U-Net adequately and simultaneously exploits theprecise localization from the up-sampling path and the con-textual information from the down-sampling path. The twosources of information are correspondingly concatenatedthrough the feature channels during the training for bio-medical image segmentation.

Differentiable Graph PoolingThe differentiable graph pooling module (Ying et al. 2018) isdesigned to enable graph neural networks to conduct predic-tive tasks over graphs rather than over nodes of one graph.

Graph pooling is necessary for graph-wise tasks such asgraph classification, and the challenge of the pooling oper-ation in the graph setting, compared to the computer visionsetting, is that unlike an image, a graph has no natural spatiallocation and unified dimension.

The differentiable graph pooling module solves the afore-mentioned challenges by learning a differentiable soft as-signment at each layer of a GNN model, assigning nodesto a set of clusters according to the learned representations.Thus, the pooling operation of each GNN layer coarsens thegiven graph iteratively and generates a hierarchical repre-sentation of every input graph on completion of the entiretraining procedure.

Graph Convolutional Ladder-shape Networks

Figure 3: Illustration of a graph convolutional ladder-shapenetworks (GCLNs). The network consists of two symmetricpaths: a contracting path (left) and an expanding path (right).The contextual feature channels between the two paths areused to pass and fuse contextual information from the con-tracting path to the location information from the expandingpath in a simple but elegant operation.

Page 4: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

The proposed GCLN is a symmetric architecture con-structed of one contracting path and one expanding pathwith GCN layers. Three contextual feature channels allowthe context features captured from the contracting path tofuse with the localization information learned through theexpanding path. Each layer is built with the spectral graphconvolution and there are eight GCN layers in total in theproposed framework.

Graph Convolutional OperationThe layers of GCLN are constructed on the first order ap-proximation of Chebyshev spectral CNNs. The graph con-volution here can be defined by the normalized graph Lapla-cian matrix :

gθ ⊗ x = θ0Ix− θ1D−12 AD−

12 x, (1)

where gθ is the spectral filter, x ∈ Rn indicates the signalof the graph and ⊗ represents the convolution operator. Ixis the identity matrix. D denotes the diagonal degree matrixof topological adjacency matrix A and θk is the Chebyshevcoefficients.

Due to the condition of the first order approximation (Kipfand Welling 2016b), the Chebyshev coefficients are furthersimplified with θ = θ1 = −θ0, and the graph convolution isupdated as:

gθ ⊗ x = θ(

I + D−12 AD−

12

)x, (2)

The convolution matrix is normalized through:

I + D−12 AD−

12 7−→ I + D

− 12 AD

− 12 (3)

where D =∑

j Aij and A = A + I.Adopting the definition of graph convolution, given a

graph signal with m feature channels (e.g., X ∈ Rn×m, mfeatures associated with each node), the layer-wise propa-gation rule of the spectral graph convolutional networks isdefined as:

H(l+1) = ϕ

(D− 1

2 AD− 1

2 HlWl

). (4)

where H0 = X and Hl is the activation from the lth GCNlayer. Wl is the trainable weight matrix and ϕ is the activa-tion function such as ReLU(•) and Linear(•).

For semi-supervised multi-class classification, the soft-max activation function is applied, row-wisely, to the finalembedding of the graph. The form of the classifier is definedas follows:

C = softmax(

AReLU(

AXW(l−1))

Wl). (5)

where A = D− 1

2 AD− 1

2 and Wl is the weight matrix be-tween the last GCN-layer and the output layer.

The node classification loss function is written with cross-entropy error as follows:

L :=∑i∈Vl

F∑f=1

Yif ln Cif . (6)

where Vl denotes the set of annotated nodes and F is thenumber of classes. Yif is the label indicator matrix mappingthe predicted nodes.

GCLN Contracting PathThe contracting path is a four-layer graph convolutional em-bedding architecture in which each layer halves the size (thenumber of neurons) of the previous layer. Each GCN layerconducts the graph convolution with Eq.(4), followed by theReLU activation and dropout.

The contracting path can be considered as the encoder partif GCLN is interpreted as an end-to-end encoder-decoder ar-chitecture which encodes the topological characteristics andfeatures associated with each node into feature representa-tions at multiple echelons. The contextual information of thegraph is captured through the contracting path and preservedat each layer, and will be correspondingly transmitted to theexpanding path via the contextual feature channels.

GCLN Expanding PathThe expanding path mirrors the architecture of the contract-ing path with the arrangement of the layers reversed. Eachlayer doubles the number of neurons in the previous layer.An element-wise summation operation is conducted to re-ceive and fuse the contextual information skipped from thecontracting path as follows:

Hlexpanding = summation

(Hl,H(L−l)

). (7)

where Hl is the latent representation from the lth layer andL is the number of graph convolutional layers. For the ex-periments described in this paper, L = 8.

The contextual feature channels allow the network to godeeper and to fuse the contextual features from the contract-ing side with the up-convolutional features from the expan-sive side. This summation enables representations to be bet-ter localized and obtained following many convolutions. Theindistinguishable features of nodes caused by repetitivelyconducting a symmetrically normalized Laplacian smooth-ing operation are directly characterized and enhanced, en-abling them to merge with the contextual features from thecontracting path. As a result, the over-smoothing problem ofneighborhood aggregation-based algorithms is overcome.

A softmax activation takes the output from the last graphconvolutional layer to conduct the semi-supervised nodeclassification with Eq.(5) and Eq.(6).

Algorithm for GCLNThe pseudo-code of GCLN is described in Algorithm 1.

Step 2 walks through the contracting path and collects thecontextual information of the graph layer by layer, whileSteps 3 and 4 lead the network to go deep along the expan-sive path and fuse the corresponding contextual informationvia the feature channels. Steps 5 and 6 conduct the semi-supervised node classification and update the network.

In particular, the input for each graph convolution layer inthe expanding path is the corresponding result of the sum-mation, rather than the direct latent matrix Hexp.

Fig. 3 demonstrates the construction of the proposedGCLN.

Page 5: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

Figure 4: Illustration of differentiable graph pooling procedure. At each hierarchical layer, the embeddings of nodes are learnedand obtained through a GCN layer, after which the nodes are clustered according to the learned embeddings into the coarsenedgraph for another GCN layer. The process is iterated for l layers and the final output is used for the graph classification task.

Algorithm 1 Graph Convolutional Ladder-Shape Networks forNode ClassificationRequire:

G = {V,E}: a graph with nodes and edges;X: the feature matrix;T : the number of iterations;o: the number of the neurons in the first convolution layer;

Ensure: o is divisible by 8.1: for iterator = 1,2,3, · · · · · · , T do2: Generate latent matrix H via Eq.(4) passing contracting

path;3: Generate direct latent matrix Hexp via Eq.(4) passing ex-

panding path;4: Conduct corresponding summation via Eq.(7) with H and

Hexp;5: Get the prediction results via Eq.(5)6: Update the model with the loss computed via Eq.(6);7: end for

Extension - GCLN with Differentiable PoolingTo extend GCLN for the graph-wise task, we have addeda differentiable pooling layer behind each GCN layer andtaken the output from the last layer for graph classification.The differentiable graph pooling formulates a general recipefor hierarchically pooling nodes across a set of graphs bygenerating a cluster assignment matrix S over the nodesleveraging the output of each GCN layer. The cluster ma-trix S on layer l is computed as follows:

S(l) = softmax(GCN

(A(l),X(l)

)). (8)

where GCN(A,X) is the graph convolutional operationelaborated in the last subsection and the softmax functionis row-wisely conducted.

With cluster assignment matrix S, the differentiable pool-ing layer coarsens the given graph, generating a new adja-cency matrix A(l+1) and a new matrix X(l+1) by applyingthe following equations:

X(l+1) = S(l)T

Hl ∈ Rnl+1×d. (9)

A(l+1) = S(l)T

AlS(l) ∈ Rnl+1×nl+1 . (10)The generated A(l+1) and X(l+1) will be processed into thenext GCN layer.

Fig. 4 illustrates the general process of the differentiablegraph pooling.

ExperimentsIn this section, we set up the experiments compared to 15classical and state-of-the-art methods to demonstrate thesolid performance of GCLN on graph node classification.The experiments with 8-layer GCN and GAT validate theexistence of the over-smoothing problem and demonstratethat GCLN is the solution to the problem. The contrast ex-periments compared residual GCN and GAT to illustrate thatresidual connection cannot effectively ameliorate the prob-lem. We also reveal the effectiveness of GCLN on unsuper-vised learning tasks with clustering experiments comparedto 16 peers. In addition, GCLN proves its predictive capabil-ity with the differentiable graph pooling module on graph-wise tasks such as graph classification.

Data SetsWe conducted the experiments using three real-world bib-liographic data sets: Cora, Citeseer and Pubmed (Sen et al.2008) and the details of the data set statistics are summa-rized in Table 1. The data sets are used for both node-wiseclassification and clustering.

Node ClassificationWe conducted node classification to validate the effective-ness of GCLN on node-wise semi-supervised tasks.

Baseline Algorithms GCLN is compared with Classicmethods and GCN-based algorithms such as Chebyshev(Defferrard, Bresson, and Vandergheynst 2016), GCN (Kipfand Welling 2016b), GraphInfoMax (Velickovic et al. 2018),LGCN (Gao, Wang, and Ji 2018), StoGCN (Chen, Zhu,and Song 2018), DualGCN (Zhuang and Ma 2018), GAT(Velickovic et al. 2017), and g-U-Nets (Gao and Ji 2019).

Node Classification Setup We used an eight GCN-layerGCLN (except for the input layer and the layer with soft-max) to conduct all the experiments. The first layer consistsof 64 neurons and each following layer in the contractingpath halves the number of neurons in the previous layer. Be-cause of the symmetric architecture of GCLN, the first layerin the expanding path starts with 8 neurons and each subse-quent layer doubles the number of neurons, until 64 neuronsare found in the last layer. 0.9 dropout and the ReLU acti-vation function were applied after every graph convolutional

Page 6: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

Table 1: Datasets for Node Classification and Clustering# Nodes # Edges # Features # Classes # Training Nodes # Validation Nodes # Test Nodes Label rate

Citeseer 3,327 4,732 3,703 6 120 500 1,000 0.036Cora 2,708 5,429 1,433 7 140 500 1,000 0.052

Pubmed 19,717 443,388 500 3 60 500 1,000 0.003

operation. The learning rate was retained at 0.01 for all theexperiments.

To verify the existence of the over-smoothing issue in theneighbor aggregation algorithms, we also set up the experi-ments for 8-layer GCN and GAT (the same number of lay-ers as GCLN) to compare them with GCLN. For even fairercomparison, we constructed the residual 8-layer GCN andGAT, which adds a residual connection behind every layer,and compared it with GCLN. It is noteworthy that the resid-ual 8-layer GCN and GAT contain many more parametersthan GCLN.

Each experiment was conducted ten times and the averagescores are reported as detailed below.

Node Classification Results The experimental results aresummarized in Table 2. The GCN-based algorithms, such asGCN, GAT, LGCN, StoGCN and DualGCN, take advantageof the neighbor aggregation operation to propagate the fea-ture of a node over its neighboring nodes, and outperformedclassical methods such as TADW and DeepWalk.

Compared to two-layer GCN and two-layer GAT, the clas-sification performance of the 8-layer variations were signif-icantly degraded (more than 60%), which verified the over-smoothing issue in the graph convolution-based methods.

GCLN outperformed or matched all fifteen state-of-the-art/classical peers, 8-layer GCN and 8-layer GAT (withand without residual connections) across all three data sets.The large margin (26.3% to 173.1%) of difference in theclassification results between GCLN, 8-layer GCN and 8-layer GAT directly proves that GCLN successfully addressesthe over-smoothing problem of neighborhood aggregation-based algorithms caused by the repetitious application of theLaplacian smoothing operation.

To validate whether applying a residual mechanism tospectral graph convolutional algorithms could alleviate theover-smoothing problem, we set the experiments by addingresidual connections to GCN and GAT with eight layers tovie with the normal eight-layer GCN and GAT as well asGCLN. The last five rows of Table 2 illustrate that adding aresidual connection after each GCN layer dramatically en-hanced performance when the network deepened (8 layers).However, there is still a large margin between GCLN andresidual GCN. For GAT, the residual connections resultedin a performance reduction compared to 8-layer GAT in allthree data sets. The last five sets of contrast experiments val-idate that the naıve residual connection is not an effectivesolution to the over-smoothing problem.

Node ClusteringWe evaluated the node-wise unsupervised predictive perfor-mance of GCLN by conducting a clustering task on the same

Table 2: Comparison of classification accuracy

real-world data sets used for the node classification. Follow-ing (Kipf and Welling 2016a), we remove the Softmax layerin the model and reconstruct the topological information atthe last layer as the embedding of the given graph to train ak-means for node clustering task.

Baseline Algorithms GCLN is compared with two groupsof peers: 1) Single source information-leveraged algo-rithms such as DNGR (Cao, Lu, and Xu 2016), Graph En-coder (Tian et al. 2014), GAE∗ (Kipf and Welling 2016a),VGAE∗; and 2) Both content and structure-leveraged al-gorithms such as GAE, VGAE, ARGA. Due to the pagelimitation, some of baselines are not listed in the table.

Table 3: Clustering results

Page 7: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

Table 4: Datasets for Graph Classification# Nodes(max) # Nodes(avg.) # Graphs # Edges(avg.) # Nodes Labels # Classes Sources

D&D 5,748 284.32 1,178 715.65 82 2 BioENZYMES 126 32.60 600 63.14 6 6 BioPROTEINS 620 39.06 1,113 72.81 4 2 Bio

Clustering Results Table 3 lists the experimental resultson Cora, Citeseer and Pubmed respectively. We conductedexperiments with other ten baselines and with more metricsincluding normalise mutual information (NMI), Precision,and Adjusted Rand index (ARI). We only demonstrate sixrepresentative peers due to the limited space.

GCLN outperforms all baselines on Cora and Citeseer un-der the five metrics, and achieved competitive performancecompared with the latest algorithm on Pubmed. For ex-ample, on Cora, GCLN improves the accuracy from 7.5%(compared with ARGA) to 69.1% (compared with RMSC);raises the F1 score from 6.8% (compared with ARGA) to79.6% (compared with K-means); and improves NMI from21.8% (compared with TADW) to 72.3% (compared withDNGR).

The results from methods such as DeepWalk and Big-Clam, which only consider a single source of informationfrom the given graph, are inferior to the performance ofthose that simultaneously leverage topological structure andnode characteristics. The graph convolutional models withadversarial regularization also show their superiority overtheir peers on the clustering task. GCLN, with 8 layers, isnot affected by the over-smoothing issue and maintains theoptimal performance over all sixteen baselines.

Graph ClassifcationWe validated the graph-wise performance of GCLN withdifferentiable pooling module by conducting graph classi-fication on three biological graph data sets summarized inthe Table 4.

Baseline Algorithms We compared GCLN with both 1)graph kernel-based methods such as GK (Shervashidze etal. 2009), DGK (Yanardag and Vishwanathan 2015), WL-OA (Kriege, Giscard, and Wilson 2016); and 2) deep learn-ing models such as DCNN (Atwood and Towsley 2016),ECC (Simonovsky and Komodakis 2017), DGCNN (Zhanget al. 2018), DIFFPOOL; and 3) CapsuleNet-based such asGCAPS-CNN (Verma and Zhang 2018) and GCAPS (Xinyiand Chen 2019).

Experimental Setup To objectively evaluate the effective-ness of GCLN on graph-wise level prediction, we appliedten-fold cross validation to conduct the graph classificationexperiments. Specifically, Eight-fold training was used totrain the model, one training fold was used as the valida-tion set for hyper-parameter adjustment, and the remainingfold was used for testing. Each experiment was conductedten times and the average accuracy is reported.

Graph Classification Results Table 5 compares the per-formance of GCLN at a graph-wise level with other base-

lines. The results demonstrate that GCLN obtains the op-timal average performance over all graph classification al-gorithms. The 8-layer GCLN with DIFFPOOL increasesthe accuracy compared to the 2-layer GraphSAGE(Hamil-ton, Ying, and Leskovec 2017) with DIFFPOOL. GCLNhas also outperformed the CasuleNet-based algorithms byaround 4.3% accuracy as well as graph kernel-based meth-ods by more than 10%. The accuracy of the graph classifi-cation shows that the architecture of GCLN not only tacklesthe over-smoothing problem at a node-wise level, but alsoachieves stable performance at a graph-wise level.

Table 5: Graph Classification Results

ConclusionNeighborhood aggregation algorithms inevitably suffer fromthe over-smoothing problem when the depth of the networkis increased, because repeatedly conducting the Laplaciansmoothing operation leads to the features of neighboringconnected nodes in the graph converging to the same val-ues. In this paper, we have proposed graph convolutionalladder-shape networks (GCLNs) which address the over-smoothing problem with a symmetric ladder-shape architec-ture. The network consists of a contracting path, expandingpath and contextual feature channels which characterize andenhance indistinguishable features by fusing the correspond-ing contextual information from the contracting side to thedeeper layers. A comparison of the results of our experi-ments on classical and state-of-the-art peers, 8-layer GCN,8-layer GAT and their residual versions prove the superiorityof our method.

We have experimentally evaluated the performance ofGCLN from the perspective of a node-wise semi-supervisedtask and node-wise unsupervised task as well as a graph-wise task. All the experiments results validate the optimaleffectiveness of GCLN from multiple perspectives.

Page 8: Going Deep: Graph Convolutional Ladder-shape Networks · Going Deep: Graph Convolutional Ladder-Shape Networks Ruiqi Hu,1 Shirui Pan,2 Guodong Long,1 Qinghua Lu,3 Liming Zhu,3 Jing

ReferencesAtwood, J., and Towsley, D. 2016. Diffusion-convolutionalneural networks. In NIPS, 1993–2001.Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013.Spectral networks and locally connected networks ongraphs. arXiv preprint arXiv:1312.6203.Cao, S.; Lu, W.; and Xu, Q. 2015. Grarep: Learning graphrepresentations with global structural information. In CIKM,891–900. ACM.Cao, S.; Lu, W.; and Xu, Q. 2016. Deep neural networks forlearning graph representations. In AAAI, 1145–1152.Chen, J.; Zhu, J.; and Song, L. 2018. Stochastic trainingof graph convolutional networks with variance reduction. InICML, 941–949.Dai, H.; Kozareva, Z.; Dai, B.; Smola, A.; and Song, L.2018. Learning steady-states of iterative algorithms overgraphs. In ICML, 1114–1122.Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016.Convolutional neural networks on graphs with fast localizedspectral filtering. In NIPS, 3844–3852.Gao, H., and Ji, S. 2019. Graph U-nets. In Proceedings ofThe 36th International Conference on Machine Learning.Gao, H.; Wang, Z.; and Ji, S. 2018. Large-scale learnablegraph convolutional networks. In KDD, 1416–1424. ACM.Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; andDahl, G. E. 2017. Neural message passing for quantumchemistry. arXiv preprint arXiv:1704.01212.Grover, A., and Leskovec, J. 2016. node2vec: Scalable fea-ture learning for networks. In KDD, 855–864. ACM.Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017. Inductiverepresentation learning on large graphs. In NIPS.Kipf, T. N., and Welling, M. 2016a. Variational graph auto-encoders. NIPS.Kipf, T. N., and Welling, M. 2016b. Semi-supervised clas-sification with graph convolutional networks. arXiv preprintarXiv:1609.02907.Klicpera, J.; Bojchevski, A.; and Gunnemann, S. 2018.Combining neural networks with personalized pagerank forclassification on graphs.Kriege, N. M.; Giscard, P.-L.; and Wilson, R. 2016. Onvalid optimal assignment kernels and applications to graphclassification. In NIPS, 1623–1631.Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights intograph convolutional networks for semi-supervised learning.arXiv preprint arXiv:1801.07606.Liu, L.; Zhou, T.; Long, G.; Jiang, J.; Yao, L.; and Zhang, C.2019a. Prototype propagation networks (ppn) for weakly-supervised few-shot learning on category graph. In IJCAI.Liu, L.; Zhou, T.; Long, G.; Jiang, J.; and Zhang, C. 2019b.Learning to propagate for graph meta-learning. In NeurIPS.Pan, S.; Wu, J.; Zhu, X.; Zhang, C.; and Wang, Y. 2016.Tri-party deep network representation. Network 11(9):12.

Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net:Convolutional networks for biomedical image segmentation.In MICCAI, 234–241. Springer.Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.;and Eliassi-Rad, T. 2008. Collective classification in net-work data. AI magazine 29(3):93.Shervashidze, N.; Vishwanathan, S.; Petri, T.; Mehlhorn, K.;and Borgwardt, K. 2009. Efficient graphlet kernels for largegraph comparison. In Artificial Intelligence and Statistics,488–495.Simonovsky, M., and Komodakis, N. 2017. Dynamicedge-conditioned filters in convolutional neural networks ongraphs. In CVPR, 3693–3702.Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei,Q. 2015. Line: Large-scale information network embed-ding. In WWW, 1067–1077. International World Wide WebConferences Steering Committee.Tian, F.; Gao, B.; Cui, Q.; and et.al. 2014. Learning deeprepresentations for graph clustering. In AAAI, 1293–1299.Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio,P.; and Bengio, Y. 2017. Graph attention networks. arXivpreprint arXiv:1710.10903 1(2).Velickovic, P.; Fedus, W.; Hamilton, W. L.; Lio, P.; Bengio,Y.; and Hjelm, R. D. 2018. Deep graph infomax. arXivpreprint arXiv:1809.10341.Verma, S., and Zhang, Z.-L. 2018. Graph capsule convolu-tional neural networks. arXiv preprint arXiv:1805.08090.Wang, X.; Cui, P.; Wang, J.; Pei, J.; Zhu, W.; and Yang, S.2017. Community preserving network embedding. In AAAI,203–209.Xinyi, Z., and Chen, L. 2019. Capsule graph neural network.In ICLR.Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning ongraphs with jumping knowledge networks. arXiv preprintarXiv:1806.03536.Yanardag, P., and Vishwanathan, S. 2015. Deep graph ker-nels. In SIGKDD, 1365–1374. ACM.Yang, C.; Liu, Z.; Zhao, D.; Sun, M.; and Chang, E. Y. 2015.Network representation learning with rich text information.In IJCAI, 2111–2117.Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; andLeskovec, J. 2018. Hierarchical graph representation learn-ing with differentiable pooling. In NIPS, 4800–4810.Zhang, Q.; Wu, J.; Yang, H.; Lu, W.; Long, G.; and Zhang,C. 2016. Global and local influence-based social recom-mendation. In CIKM, 1917–1920. ACM.Zhang, M.; Cui, Z.; Neumann, M.; and Chen, Y. 2018. Anend-to-end deep learning architecture for graph classifica-tion. In Thirty-Second AAAI Conference on Artificial Intel-ligence.Zhuang, C., and Ma, Q. 2018. Dual graph convolutionalnetworks for graph-based semi-supervised classification. InWWW, 499–508.