DeepWalk: Online Learning of SocialRepresentations 1
Presented by Carlos Oliver for COMP 766
January 20, 2020
1Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. ”Deepwalk: Online learning of social representations.”
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,2014.
Motivation
I Can we predict the label of a node given the graph?
I Problem: labels not i.i.d so traditional methods can’t beused. (MRF, Graph Kernels and other structured learningmodels needed)
Motivation
I Can we predict the label of a node given the graph?
I Problem: labels not i.i.d so traditional methods can’t beused. (MRF, Graph Kernels and other structured learningmodels needed)
Motivation
I Can we predict the label of a node given the graph?
I Problem: labels not i.i.d so traditional methods can’t beused. (MRF, Graph Kernels and other structured learningmodels needed)
IdeaI Separate labels from underlying structure.
I Existing Approaches:I Graph statistics (neighbourhood overlap...)I Spectral clustering
I Limitations: often require full graph to compute ordomain-specific knowledge.
IdeaI Separate labels from underlying structure.
I Existing Approaches:I Graph statistics (neighbourhood overlap...)I Spectral clustering
I Limitations: often require full graph to compute ordomain-specific knowledge.
IdeaI Separate labels from underlying structure.
I Existing Approaches:I Graph statistics (neighbourhood overlap...)I Spectral clustering
I Limitations: often require full graph to compute ordomain-specific knowledge.
Relationship between social graphs and natural language
I This tells us that modeling the co-occurence of vertices givesus similar information to measuring co-occurence of words.
I Seeing words co-occur gives us information about thestructure of the language (or graph).
Relationship between social graphs and natural language
I This tells us that modeling the co-occurence of vertices givesus similar information to measuring co-occurence of words.
I Seeing words co-occur gives us information about thestructure of the language (or graph).
Relationship between social graphs and natural language
I This tells us that modeling the co-occurence of vertices givesus similar information to measuring co-occurence of words.
I Seeing words co-occur gives us information about thestructure of the language (or graph).
Random walks ∼ sentences
I Since co-occurence tells us about the structural context.Sample random walks from each node.
I Bonus: natural way to split up graph (i.e. don’t need wholegraph to get a node’s embedding)
Random walks ∼ sentences
I Since co-occurence tells us about the structural context.Sample random walks from each node.
I Bonus: natural way to split up graph (i.e. don’t need wholegraph to get a node’s embedding)
Learning objective
I Given a random walk, (v1, .., vi ), we update representationΦ(vi ) ∈ Rd to maximizes the likelihood of the walk.
arg minΦ− logP(v1, .., vi−1|Φ(vi ))
I Probabilities are given by a multi-label classifier which mapsΦ(vi )→ V k where k is the size of the walks.
I Nodes with similar Φ will have similar local graph ‘structure’.
Learning objective
I Given a random walk, (v1, .., vi ), we update representationΦ(vi ) ∈ Rd to maximizes the likelihood of the walk.
arg minΦ− logP(v1, .., vi−1|Φ(vi ))
I Probabilities are given by a multi-label classifier which mapsΦ(vi )→ V k where k is the size of the walks.
I Nodes with similar Φ will have similar local graph ‘structure’.
Learning objective
I Given a random walk, (v1, .., vi ), we update representationΦ(vi ) ∈ Rd to maximizes the likelihood of the walk.
arg minΦ− logP(v1, .., vi−1|Φ(vi ))
I Probabilities are given by a multi-label classifier which mapsΦ(vi )→ V k where k is the size of the walks.
I Nodes with similar Φ will have similar local graph ‘structure’.
Speedups: SkipGram
I Skip-gram: lets us split the walk into sliding windows size wand update nodes in each window using SGD.
I arg minΦ− logP(v1, .., vi−1|Φ(vi )) becomes
I arg minΦ− logP(vi−w , vi−1, vi+1.., vi+w |Φ(vi ))
Speedups: SkipGram
I Skip-gram: lets us split the walk into sliding windows size wand update nodes in each window using SGD.
I arg minΦ− logP(v1, .., vi−1|Φ(vi )) becomes
I arg minΦ− logP(vi−w , vi−1, vi+1.., vi+w |Φ(vi ))
Speedups: SkipGram
I Skip-gram: lets us split the walk into sliding windows size wand update nodes in each window using SGD.
I arg minΦ− logP(v1, .., vi−1|Φ(vi )) becomes
I arg minΦ− logP(vi−w , vi−1, vi+1.., vi+w |Φ(vi ))
Speedups: SkipGram
I Skip-gram: lets us split the walk into sliding windows size wand update nodes in each window using SGD.
I arg minΦ− logP(v1, .., vi−1|Φ(vi )) becomes
I arg minΦ− logP(vi−w , vi−1, vi+1.., vi+w |Φ(vi ))
Speedups: Hierarchical Softmax
I Hierarchical Softmax: reduces softmax normalization fromO(|V |) to O(log |V |) by building tree of binary classifiers.
I
P(bk |Φ(vj)) = Πlog |V |l=1 P(bl |Φ(vj))
Speedups: Hierarchical Softmax
I Hierarchical Softmax: reduces softmax normalization fromO(|V |) to O(log |V |) by building tree of binary classifiers.
I
P(bk |Φ(vj)) = Πlog |V |l=1 P(bl |Φ(vj))
Speedups: Hierarchical SoftmaxI Hierarchical Softmax: reduces softmax normalization fromO(|V |) to O(log |V |) by building tree of binary classifiers.
I
P(bk |Φ(vj)) = Πlog |V |l=1 P(bl |Φ(vj))
Training loop
Evaluation
I Task: Multi-label node classification on large social graphs.I Evaluation: micro,macro F1 score
I F1 score is the harmonic mean of precision and recallI Macro is the arithmetic mean of F1 over all classesI Micro is total proportion of correct labels over all samples.
Parameter Sensitivity
Summary
I Modelling random walks as sentences in a language gives us:
I Continuous & Low Dimensional representations →robustness
I Online training → walks naturally partition the graphI Scalable implementation → SkipGram & Hierarchical
Softmax
Summary
I Modelling random walks as sentences in a language gives us:
I Continuous & Low Dimensional representations →robustness
I Online training → walks naturally partition the graphI Scalable implementation → SkipGram & Hierarchical
Softmax
Summary
I Modelling random walks as sentences in a language gives us:
I Continuous & Low Dimensional representations →robustness
I Online training → walks naturally partition the graph
I Scalable implementation → SkipGram & HierarchicalSoftmax
Summary
I Modelling random walks as sentences in a language gives us:
I Continuous & Low Dimensional representations →robustness
I Online training → walks naturally partition the graphI Scalable implementation → SkipGram & Hierarchical
Softmax
Limitations
I Representations themselves left unexplored.
I Number of parameters ∝ number of nodes in the graph.
I Similarity is only defined on a single graph.
Limitations
I Representations themselves left unexplored.
I Number of parameters ∝ number of nodes in the graph.
I Similarity is only defined on a single graph.
Limitations
I Representations themselves left unexplored.
I Number of parameters ∝ number of nodes in the graph.
I Similarity is only defined on a single graph.