node representation learningsrijan/teaching/cse6240/spring2020/slides/15-noderl.pdfthese slides are...

41
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1 CSE 6240: Web Search and Text Mining. Spring 2020 Node Representation Learning Prof. Srijan Kumar http://cc.gatech.edu/~srijan

Upload: others

Post on 13-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1

CSE 6240: Web Search and Text Mining. Spring 2020

Node Representation Learning

Prof. Srijan Kumarhttp://cc.gatech.edu/~srijan

Page 2: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2

Administrivia• Project midterm rubrik released – Discussion at the end

• Proposal regrades done

Page 3: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3

Today’s Lecture • Introduction• Node embedding setup• Random walk approaches for node

embedding• Project midterm rubrik

These slides are inspired by Prof. Jure Leskovec’s CS224W lecture

Page 4: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 4

Machine Learning in Networks

? ?

??

?Machine Learning

• Networks are complex• Need a uniform language to process various

networks

Page 5: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 5

Example: Node Classification• Classifying the function

of proteins in the interactome

Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel protein–protein interactions. Nature.

Page 6: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 6

Example: Link Prediction

Machine Learning

?

?

?

x

• Which links exist in the network?

Page 7: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 7

Machine Learning Lifecycle

Raw Data

Structured Data

Learning Algorithm Model

Downstream task

Feature Engineering

Automatically learn the features

• Typical machine learning lifecycle requires feature engineering every single time!

• Goal: avoid task-specific feature engineering

Page 8: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 8

Feature Learning in Graphs• Goal: Efficient task-independent feature

learning for machine learning with graphs!

vecnode

𝑓: 𝑢 → ℝ&

ℝ&Feature representation,

embedding

u

Page 9: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 9

What is network embedding?• We map each node in a network into a low-

dimensional space– Distributed representation for nodes– Similarity between nodes indicate the link

strength – Encode network information and generate node

representation

17

Why Network Embedding?• Task: We map each node in a network into

a low-dimensional space– Distributed representations for nodes– Similarity of embeddings between nodes

indicates their network similarity– Encode network information and generate node

representation

Page 10: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 10

Example Node Embedding• 2D embeddings of nodes of the Zachary’s

Karate Club network:Example

• Zachary’s Karate Network:

18

Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.

Page 11: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 11

Why Is It Hard?• Modern deep learning toolbox is designed

for simple sequences or grids.– CNNs for fixed-size images/grids….

– RNNs or word2vec for text/sequences…

Page 12: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 12

Why Is It Hard?• But networks are far more complex!– Complex topographical structure

(i.e., no spatial locality like grids)

– No fixed node ordering or reference point (i.e., the isomorphism problem)

– Often dynamic and have multimodal features.

Page 13: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 13

Today’s Lecture • Introduction• Node embedding setup • Random walk approaches for node

embedding• Project midterm rubrik

Page 14: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 14

Framework Setup• Assume we have a graph G:– V is the vertex set.– A is the adjacency matrix (assume binary).– No node features or extra information is used!

Page 15: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 15

Embedding Nodes

• Goal: Encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network

Page 16: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 16

Embedding Nodes

similarity(u, v) ⇡ z>v zuGoal:

Need to define!in the original network Similarity of the embedding

Page 17: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 17

Learning Node Embeddings1. Define an encoder (i.e., a mapping from

nodes to embeddings)2. Define a node similarity function (i.e., a

measure of similarity in the original network)3. Optimize the parameters of the encoder

so that:

similarity(u, v) ⇡ z>v zuin the original network Similarity of the embedding

Page 18: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 18

Two Key Components• Encoder: maps each node to a low-

dimensional vector

• Similarity function: specifies how the relationships in vector space map to the relationships in the original network

enc(v) = zvnode in the input graph

d-dimensional embedding

Similarity of u and v in the original network

dot product between node embeddings

similarity(u, v) ⇡ z>v zu

Page 19: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 19

How to Define Node Similarity?• Key choice of methods is how they define

node similarity.

• E.g., should two nodes have similar embeddings if they….– are connected?– share neighbors?– have similar “structural roles”?– …?

Page 20: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 20

Today’s Lecture • Introduction• Node embedding setup • Random walk approaches for node

embedding• Project midterm rubrik

Page 21: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 21

Random Walk• Given a graph and a starting point, we select

a neighbor of it at random, and move to this neighbor; then we select a neighbor of this point at random, and move to it, etc.

• The (random) sequence of points selected this way is a random walk on the graph.

Page 22: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 22

Random-Walk Node Similarity

probability that uand v co-occur on a random walk over

the network

z>u zv ⇡

Page 23: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 23

Random-Walk Embeddings• Estimate probability of

visiting node 𝑣 on a random walk starting from node 𝑢using some random walk strategy R

• Learn node embedding such that nearby nodes are close together in the network– Similarity here: dot product

Page 24: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 24

Unsupervised Feature Learning

• Given a node 𝒖, how do we define nearby nodes?– 𝑁0 𝑢 = neighborhood of 𝑢 obtained by some

random-walk strategy 𝑅• Different neighborhood definitions give

different algorithms– We will look at DeepWalk and node2vec

Page 25: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 25

Random Walk Optimization1. Run short fixed-length random walks

starting from each node on the graph using some strategy R

2. For each node 𝒖, collect 𝑵𝑹(𝒖), the multiset* of nodes visited on random walks starting from u

– 𝑁0(𝑢) can have repeat elements since nodes can be visited multiple times on random walks

3. Optimize embeddings

Page 26: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 26

Random Walk Optimization

sumoverallnodes𝑢

sumovernodes𝑣seenonrandom

walksstartingfrom𝑢

predictedprobabilityof𝑢and𝑣 co-occuringon

randomwalk

L =

X

u2V

X

v2NR(u)

� log

✓exp(z>u zv)P

n2V exp(z>u zn)

• High score (= embedding cosine similarity) of nodes appearing in random walk; Low probability of other nodes

• Expensive to calculate for all node pairs• Use negative sampling

Page 27: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 27

DeepWalk [Perozzi et al., 2013]• What strategies should we use to run

these random walks?• Simplest idea: Just run fixed-length,

unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013).– The issue is that such notion of similarity is too

constrained– Node2vec generalizes this

Page 28: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 28

DeepWalk Example• 2D embeddings of nodes of the Zachary’s

Karate Club network:Example

• Zachary’s Karate Network:

18

Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.

Page 29: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 29

node2vec [Grover et al., 2016]

• Goal: Embed nodes with similar network neighborhoods close in the feature space– Frame this goal as a maximum likelihood

optimization problem, independent to the downstream prediction task

• Key observation: Develop biased 2nd order random walk 𝑅 to generate network neighborhood 𝑁0(𝑢) of node 𝑢

Page 30: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 30

node2vec: Biased WalksIdea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).

node2vec: Scalable Feature Learning for Networks

Aditya GroverStanford University

[email protected]

Jure LeskovecStanford University

[email protected]

ABSTRACTPrediction tasks over nodes and edges in networks require carefuleffort in engineering features for learning algorithms. Recent re-search in the broader field of representation learning has led to sig-nificant progress in automating prediction by learning the featuresthemselves. However, present approaches are largely insensitive tolocal patterns unique to networks.

Here we propose node2vec , an algorithmic framework for learn-ing feature representations for nodes in networks. In node2vec , welearn a mapping of nodes to a low-dimensional space of featuresthat maximizes the likelihood of preserving distances between net-work neighborhoods of nodes. We define a flexible notion of node’snetwork neighborhood and design a biased random walk proce-dure, which efficiently explores diverse neighborhoods and leads torich feature representations. Our algorithm generalizes prior workwhich is based on rigid notions of network neighborhoods and wedemonstrate that the added flexibility in exploring neighborhoodsis the key to learning richer representations.

We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link predic-tion in several real-world networks from diverse domains. Takentogether, our work represents a new way for efficiently learningstate-of-the-art task-independent node representations in complexnetworks.

Categories and Subject Descriptors: H.2.8 [Database Manage-ment]: Database applications—Data mining; I.2.6 [Artificial In-telligence]: LearningGeneral Terms: Algorithms; Experimentation.Keywords: Information networks, Feature learning, Node embed-dings.

1. INTRODUCTIONMany important tasks in network analysis involve some kind of

prediction over nodes and edges. In a typical node classificationtask, we are interested in predicting the most probable labels ofnodes in a network [9, 38]. For example, in a social network, wemight be interested in predicting interests of users, or in a protein-protein interaction network we might be interested in predictingfunctional labels of proteins [29, 43]. Similarly, in link prediction,we wish to predict whether a pair of nodes in a network shouldhave an edge connecting them [20]. Link prediction is useful ina wide variety of domains, for instance, in genomics, it helps usdiscover novel interactions between genes and in social networks,it can identify real-world friends [2, 39].

Any supervised machine learning algorithm requires a set of in-put features. In prediction problems on networks this means thatone has to construct a feature vector representation for the nodes

u

s3

s2 s1

s4

s8

s9

s6

s7

s5

BFS

DFS

Figure 1: BFS and DFS search strategies from node u (k = 3).

and edges. A typical solution involves hand-engineering domain-specific features based on expert knowledge. Even if one discountsthe tedious work of feature engineering, such features are usuallydesigned for specific tasks and do not generalize across differentprediction tasks.

An alternative approach is to use data to learn feature represen-tations themselves [4]. The challenge in feature learning is defin-ing an objective function, which involves a trade-off in balancingcomputational efficiency and predictive accuracy. On one side ofthe spectrum, one could directly aim to find a feature representationthat optimizes performance of a downstream prediction task. Whilethis supervised procedure results in good accuracy, it comes at thecost of high training time complexity due to a blowup in the numberof parameters that need to be estimated. At the other extreme, theobjective function can be defined to be independent of the down-stream prediction task and the representation can be learned in apurely unsupervised way. This makes the optimization computa-tionally efficient and with a carefully designed objective, it resultsin task-independent features that match task-specific approaches inpredictive accuracy [25, 27].

However, current techniques fail to satisfactorily define and opti-mize a reasonable objective required for scalable unsupervised fea-ture learning in networks. Classic approaches based on linear andnon-linear dimensionality reduction techniques such as PrincipalComponent Analysis, Multi-Dimensional Scaling and their exten-sions [3, 31, 35, 41] invariably involve eigendecomposition of arepresentative data matrix which is expensive for large real-worldnetworks. Moreover, the resulting latent representations give poorperformance on various prediction tasks over networks.

Neural networks provide an alternative approach to unsupervisedfeature learning [15]. Recent attempts in this direction [28, 32]propose efficient algorithms but are largely insensitive to patternsunique to networks. Specifically, nodes in networks could be or-ganized based on communities they belong to (i.e., homophily); inother cases, the organization could be based on the structural rolesof nodes in the network (i.e., structural equivalence) [7, 11, 40,42]. For instance, in Figure 1, we observe nodes u and s

1

belong-ing to the same community exhibit homophily, while the hub nodesu and s

6

in the two communities are structurally equivalent. Real-

Page 31: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 31

node2vec: Biased Walks• Two strategies to define a neighborhood 𝑵𝑹 𝒖 of a given node 𝒖: BFS & DFS

• Walk of length 3 (𝑁0 𝑢 of size 3):𝑁:;< 𝑢 = {𝑠?, 𝑠@, 𝑠A}

𝑁C;< 𝑢 = {𝑠D, 𝑠E, 𝑠F}Local microscopicviewGlobal macroscopicview

node2vec: Scalable Feature Learning for Networks

Aditya GroverStanford University

[email protected]

Jure LeskovecStanford University

[email protected]

ABSTRACTPrediction tasks over nodes and edges in networks require carefuleffort in engineering features for learning algorithms. Recent re-search in the broader field of representation learning has led to sig-nificant progress in automating prediction by learning the featuresthemselves. However, present approaches are largely insensitive tolocal patterns unique to networks.

Here we propose node2vec , an algorithmic framework for learn-ing feature representations for nodes in networks. In node2vec , welearn a mapping of nodes to a low-dimensional space of featuresthat maximizes the likelihood of preserving distances between net-work neighborhoods of nodes. We define a flexible notion of node’snetwork neighborhood and design a biased random walk proce-dure, which efficiently explores diverse neighborhoods and leads torich feature representations. Our algorithm generalizes prior workwhich is based on rigid notions of network neighborhoods and wedemonstrate that the added flexibility in exploring neighborhoodsis the key to learning richer representations.

We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link predic-tion in several real-world networks from diverse domains. Takentogether, our work represents a new way for efficiently learningstate-of-the-art task-independent node representations in complexnetworks.

Categories and Subject Descriptors: H.2.8 [Database Manage-ment]: Database applications—Data mining; I.2.6 [Artificial In-telligence]: LearningGeneral Terms: Algorithms; Experimentation.Keywords: Information networks, Feature learning, Node embed-dings.

1. INTRODUCTIONMany important tasks in network analysis involve some kind of

prediction over nodes and edges. In a typical node classificationtask, we are interested in predicting the most probable labels ofnodes in a network [9, 38]. For example, in a social network, wemight be interested in predicting interests of users, or in a protein-protein interaction network we might be interested in predictingfunctional labels of proteins [29, 43]. Similarly, in link prediction,we wish to predict whether a pair of nodes in a network shouldhave an edge connecting them [20]. Link prediction is useful ina wide variety of domains, for instance, in genomics, it helps usdiscover novel interactions between genes and in social networks,it can identify real-world friends [2, 39].

Any supervised machine learning algorithm requires a set of in-put features. In prediction problems on networks this means thatone has to construct a feature vector representation for the nodes

u

s3

s2 s1

s4

s8

s9

s6

s7

s5

BFS

DFS

Figure 1: BFS and DFS search strategies from node u (k = 3).

and edges. A typical solution involves hand-engineering domain-specific features based on expert knowledge. Even if one discountsthe tedious work of feature engineering, such features are usuallydesigned for specific tasks and do not generalize across differentprediction tasks.

An alternative approach is to use data to learn feature represen-tations themselves [4]. The challenge in feature learning is defin-ing an objective function, which involves a trade-off in balancingcomputational efficiency and predictive accuracy. On one side ofthe spectrum, one could directly aim to find a feature representationthat optimizes performance of a downstream prediction task. Whilethis supervised procedure results in good accuracy, it comes at thecost of high training time complexity due to a blowup in the numberof parameters that need to be estimated. At the other extreme, theobjective function can be defined to be independent of the down-stream prediction task and the representation can be learned in apurely unsupervised way. This makes the optimization computa-tionally efficient and with a carefully designed objective, it resultsin task-independent features that match task-specific approaches inpredictive accuracy [25, 27].

However, current techniques fail to satisfactorily define and opti-mize a reasonable objective required for scalable unsupervised fea-ture learning in networks. Classic approaches based on linear andnon-linear dimensionality reduction techniques such as PrincipalComponent Analysis, Multi-Dimensional Scaling and their exten-sions [3, 31, 35, 41] invariably involve eigendecomposition of arepresentative data matrix which is expensive for large real-worldnetworks. Moreover, the resulting latent representations give poorperformance on various prediction tasks over networks.

Neural networks provide an alternative approach to unsupervisedfeature learning [15]. Recent attempts in this direction [28, 32]propose efficient algorithms but are largely insensitive to patternsunique to networks. Specifically, nodes in networks could be or-ganized based on communities they belong to (i.e., homophily); inother cases, the organization could be based on the structural rolesof nodes in the network (i.e., structural equivalence) [7, 11, 40,42]. For instance, in Figure 1, we observe nodes u and s

1

belong-ing to the same community exhibit homophily, while the hub nodesu and s

6

in the two communities are structurally equivalent. Real-

Page 32: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 32

Interpolating BFS and DFS• Biased fixed-length random walk 𝑹 that

given a node 𝒖 generates neighborhood 𝑵𝑹 𝒖

• Two parameters:– Return parameter 𝒑:• Return back to the previous node

– In-out parameter 𝒒:• Moving outwards (DFS) vs. inwards (BFS)• Intuitively, 𝑞 is the “ratio” of BFS vs. DFS

Page 33: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 33

Biased Random WalksBiased 2nd-order random walks explore network neighborhoods:– Random walk just traversed edge (𝑠?, 𝑤) and is

now at 𝑤– Neighbors of 𝑤 can only be:

Idea: Remember where that walk came from

s1

s2

ws3

uBack to 𝒔𝟏

Same distance to 𝒔𝟏

Farther from 𝒔𝟏

Page 34: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 34

Biased Random Walks• Walker came over edge (s?, w) and is at w.

Where to go next?

• 𝑝, 𝑞 model transition probabilities– 𝑝 = return parameter, 𝑞 = “walk away” parameter• BFS-like walk: Low value of 𝑝• DFS-like walk: Low value of 𝑞

1

1/𝑞1/𝑝

1/𝑝, 1/𝑞, 1 are unnormalizedprobabilitiess1

s2

ws3

u s4

1/𝑞

Page 35: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 35

Experiments: Micro vs. Macro• Graphs = Interactions of characters in a novel• Node color = identified communities

Figure 3: Complementary visualizations of Les Misérables co-appearance network generated by node2vec with label colorsreflecting homophily (top) and structural equivalence (bottom).

also exclude a recent approach, GraRep [6], that generalizes LINEto incorporate information from network neighborhoods beyond 2-hops, but does not scale and hence, provides an unfair comparisonwith other neural embedding based feature learning methods. Apartfrom spectral clustering which has a slightly higher time complex-ity since it involves matrix factorization, our experiments stand outfrom prior work in the sense that all other comparison benchmarksare evaluated in settings that equalize for runtime. In doing so, wediscount for performance gain observed purely because of the im-plementation language (C/C++/Python) since it is secondary to thealgorithm. In order to create fair and reproducible comparisons, wenote that the runtime complexity is contributed from two distinctphases: sampling and optimization.

In the sampling phase, all benchmarks as well as node2vec pa-rameters are set such that they generate equal samples at runtime.As an example, if K is the overall sample constraint, then the node2vec

parameters satisfy K = r · l · |V |. In the optimization phase,all benchmarks optimize using a stochastic gradient descent algo-rithm with two key differences that we correct for. First, DeepWalkuses hierarchical sampling to approximate the softmax probabilitieswith an objective similar to the one use by node2vec in (2). How-ever, hierarchical softmax is inefficient when compared with neg-ative sampling [26]. Hence, keeping everything else the same, weswitch to negative sampling in DeepWalk which is also the de factoapproximation in node2vec and LINE. Second, both node2vec andDeepWalk have a parameter (k) for the number of context neigh-borhood nodes to optimize for and the greater the number, the morerounds of optimization are required. This parameter is set to unityfor LINE. Since LINE completes a single epoch quicker than otherapproaches, we let it run for k epochs.

The parameter settings used for node2vec are in line with typ-ical values used for DeepWalk and LINE. Specifically, d = 128,r = 10, l = 80, k = 10 and the optimization is run for a singleepoch. (Following prior work [34], we use d = 500 for spec-tral clustering.) All results for all tasks are statistically significantwith a p-value of less than 0.01.The best in-out and return hyperpa-rameters were learned using 10-fold cross-validation on just 10%

Algorithm DatasetBlogCatalog PPI Wikipedia

Spectral Clustering 0.0405 0.0681 0.0395DeepWalk 0.2110 0.1768 0.1274LINE 0.0784 0.1447 0.1164node2vec 0.2581 0.1791 0.1552node2vec settings (p,q) 0.25, 0.25 4, 1 4, 0.5Gain of node2vec [%] 22.3 1.3 21.8

Table 2: Macro-F1

scores for multilabel classification on Blog-Catalog, PPI (Homo sapiens) and Wikipedia word cooccur-rence networks with a balanced 50% train-test split.

labeled data with a grid search over p, q 2 {0.25, 0.50, 1, 2, 4}.Under the above experimental settings, we present our results fortwo tasks under consideration.

4.3 Multi-label classificationIn the multi-label classification setting, every node is assigned

one or more labels from a finite set L. During the training phase, weobserve a certain fraction of nodes and all their labels. The task isto predict the labels for the remaining nodes. This is a challengingtask especially if L is large. We perform multi-label classificationon the following datasets:

• BlogCatalog [44]: This is a network of social relationshipsof the bloggers listed on the BlogCatalog website. The la-bels represent blogger interests inferred through the meta-data provided by the bloggers. The network has 10,312 nodes,333,983 edges and 39 different labels.

• Protein-Protein Interactions (PPI) [5]: We use a subgraphof the PPI network for Homo Sapiens. The subgraph cor-responds to the graph induced by nodes for which we couldobtain labels from the hallmark gene sets [21] and representbiological states. The network has 3,890 nodes, 76,584 edgesand 50 different labels.

• Wikipedia Cooccurrences [23]: This is a cooccurrence net-work of words appearing in the first million bytes of theWikipedia dump. The labels represent the Part-of-Speech(POS) tags as listed in the Penn Tree Bank [24] and inferredusing the Stanford POS-Tagger [37]. The network has 4,777nodes, 184,812 edges and 40 different labels.

All our networks exhibit a fair mix of homophilic and structuralequivalences. For example, we would expect the social networkof bloggers to exhibit strong homophily-based relationships, how-ever, there might also be some ‘familiar strangers’, that is, bloggersthat do not interact but share interests and hence are structurallyequivalent nodes. The biological states of proteins in a protein-protein interaction network also exhibit both types of equivalences.For example, they exhibit structural equivalence when proteins per-form functions complementary to those of neighboring proteins,and at other times, they organize based on homophily in assistingneighboring proteins in performing similar functions. The word co-occurence network is fairly dense, since edges exist between wordscooccuring in a 2-length window in the Wikipedia corpus. Hence,words having the same POS tags are not hard to find, lending a highdegree of homophily. At the same time, we expect some structuralequivalence in the POS tags due to syntactic grammar rules such asdeterminers following nouns, punctuations preceeding nouns etc.

Experimental results. The learned node feature representationsare input to a one-vs-rest logistic regression using the LIBLINEARimplementation with L2 regularization. The train and test data issplit equally over 10 random splits. We use the Macro-F

1

scoresfor comparing performance in Table 2 and the relative performance

Figure 3: Complementary visualizations of Les Misérables co-appearance network generated by node2vec with label colorsreflecting homophily (top) and structural equivalence (bottom).

also exclude a recent approach, GraRep [6], that generalizes LINEto incorporate information from network neighborhoods beyond 2-hops, but does not scale and hence, provides an unfair comparisonwith other neural embedding based feature learning methods. Apartfrom spectral clustering which has a slightly higher time complex-ity since it involves matrix factorization, our experiments stand outfrom prior work in the sense that all other comparison benchmarksare evaluated in settings that equalize for runtime. In doing so, wediscount for performance gain observed purely because of the im-plementation language (C/C++/Python) since it is secondary to thealgorithm. In order to create fair and reproducible comparisons, wenote that the runtime complexity is contributed from two distinctphases: sampling and optimization.

In the sampling phase, all benchmarks as well as node2vec pa-rameters are set such that they generate equal samples at runtime.As an example, if K is the overall sample constraint, then the node2vec

parameters satisfy K = r · l · |V |. In the optimization phase,all benchmarks optimize using a stochastic gradient descent algo-rithm with two key differences that we correct for. First, DeepWalkuses hierarchical sampling to approximate the softmax probabilitieswith an objective similar to the one use by node2vec in (2). How-ever, hierarchical softmax is inefficient when compared with neg-ative sampling [26]. Hence, keeping everything else the same, weswitch to negative sampling in DeepWalk which is also the de factoapproximation in node2vec and LINE. Second, both node2vec andDeepWalk have a parameter (k) for the number of context neigh-borhood nodes to optimize for and the greater the number, the morerounds of optimization are required. This parameter is set to unityfor LINE. Since LINE completes a single epoch quicker than otherapproaches, we let it run for k epochs.

The parameter settings used for node2vec are in line with typ-ical values used for DeepWalk and LINE. Specifically, d = 128,r = 10, l = 80, k = 10 and the optimization is run for a singleepoch. (Following prior work [34], we use d = 500 for spec-tral clustering.) All results for all tasks are statistically significantwith a p-value of less than 0.01.The best in-out and return hyperpa-rameters were learned using 10-fold cross-validation on just 10%

Algorithm DatasetBlogCatalog PPI Wikipedia

Spectral Clustering 0.0405 0.0681 0.0395DeepWalk 0.2110 0.1768 0.1274LINE 0.0784 0.1447 0.1164node2vec 0.2581 0.1791 0.1552node2vec settings (p,q) 0.25, 0.25 4, 1 4, 0.5Gain of node2vec [%] 22.3 1.3 21.8

Table 2: Macro-F1

scores for multilabel classification on Blog-Catalog, PPI (Homo sapiens) and Wikipedia word cooccur-rence networks with a balanced 50% train-test split.

labeled data with a grid search over p, q 2 {0.25, 0.50, 1, 2, 4}.Under the above experimental settings, we present our results fortwo tasks under consideration.

4.3 Multi-label classificationIn the multi-label classification setting, every node is assigned

one or more labels from a finite set L. During the training phase, weobserve a certain fraction of nodes and all their labels. The task isto predict the labels for the remaining nodes. This is a challengingtask especially if L is large. We perform multi-label classificationon the following datasets:

• BlogCatalog [44]: This is a network of social relationshipsof the bloggers listed on the BlogCatalog website. The la-bels represent blogger interests inferred through the meta-data provided by the bloggers. The network has 10,312 nodes,333,983 edges and 39 different labels.

• Protein-Protein Interactions (PPI) [5]: We use a subgraphof the PPI network for Homo Sapiens. The subgraph cor-responds to the graph induced by nodes for which we couldobtain labels from the hallmark gene sets [21] and representbiological states. The network has 3,890 nodes, 76,584 edgesand 50 different labels.

• Wikipedia Cooccurrences [23]: This is a cooccurrence net-work of words appearing in the first million bytes of theWikipedia dump. The labels represent the Part-of-Speech(POS) tags as listed in the Penn Tree Bank [24] and inferredusing the Stanford POS-Tagger [37]. The network has 4,777nodes, 184,812 edges and 40 different labels.

All our networks exhibit a fair mix of homophilic and structuralequivalences. For example, we would expect the social networkof bloggers to exhibit strong homophily-based relationships, how-ever, there might also be some ‘familiar strangers’, that is, bloggersthat do not interact but share interests and hence are structurallyequivalent nodes. The biological states of proteins in a protein-protein interaction network also exhibit both types of equivalences.For example, they exhibit structural equivalence when proteins per-form functions complementary to those of neighboring proteins,and at other times, they organize based on homophily in assistingneighboring proteins in performing similar functions. The word co-occurence network is fairly dense, since edges exist between wordscooccuring in a 2-length window in the Wikipedia corpus. Hence,words having the same POS tags are not hard to find, lending a highdegree of homophily. At the same time, we expect some structuralequivalence in the POS tags due to syntactic grammar rules such asdeterminers following nouns, punctuations preceeding nouns etc.

Experimental results. The learned node feature representationsare input to a one-vs-rest logistic regression using the LIBLINEARimplementation with L2 regularization. The train and test data issplit equally over 10 random splits. We use the Macro-F

1

scoresfor comparing performance in Table 2 and the relative performance

p=1, q=2Microscopic view of the network neighbourhood

p=1, q=0.5Macroscopic view of the network neighbourhood

Page 36: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 36

Experiments: Datasets• BlogCatalog: Social network of bloggers – 10,312 nodes, 333,983 edges, 39 labels

• Protein-protein interactions (PPI): subgraph of the PPI network for Homo Sapiens– 3,890 nodes, 76,584 edges, and 50 labels

• Wikipedia: co-occurrence network of words– 4,777 nodes, 184,812 edges, and 40 different

labels (Part-of-Speech tags of words)

Page 37: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 37

Experiments: Results

Page 38: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 38

How to Use Embeddings• How to use embeddings 𝒛𝒊 of nodes:– Clustering/community detection: Cluster

points 𝑧U– Node classification: Predict label 𝑓(𝑧U)of node 𝑖

based on 𝑧U– Link prediction: Predict edge (𝑖, 𝑗) based on 𝑓(𝑧U, 𝑧X)• Where we can: concatenate, avg, product, or take a

difference between the embeddings:– Concatenate: 𝑓(𝑧U, 𝑧X)= 𝑔([𝑧U, 𝑧X])– Hadamard: 𝑓(𝑧U, 𝑧X)= 𝑔(𝑧U ∗ 𝑧X) (per coordinate product)– Sum/Avg: 𝑓(𝑧U, 𝑧X)= 𝑔(𝑧U + 𝑧X)– Distance: 𝑓(𝑧U, 𝑧X)= 𝑔(||𝑧U − 𝑧X||@)

Page 39: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 39

Today’s Lecture • Introduction• Node embedding setup • Random walk approaches for node

embedding• Project midterm rubrik

Page 40: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 40

Project Midterm Expectations• Details are here

Page 41: Node Representation Learningsrijan/teaching/cse6240/spring2020/slides/15-nodeRL.pdfThese slides are inspired by Prof. Jure Leskovec’sCS224W lecture. Srijan Kumar, Georgia Tech,

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 41

Today’s Lecture • Introduction• Node embedding setup • Random walk approaches for node

embedding• Project midterm rubrik