link prediction: a perspective
TRANSCRIPT
LINK PREDICTION: A PERSPECTIVE
Nitesh ChawlaStudents: Ryan Lichtenwalter, Jake LussierDepartment of computer Science and Engineering,iCeNSA, University of Notre Dame
“Ideas that connect”
Emergence of network structure• Certain nodes are more likely to develop a connection
between them• Certain connections more or less likely to dissolve over
time• What are the local and global phenomena that might
impact this process? What is the predictability in this process?
At the level of a whole network, link formation or decay may indicate changes in community structure
From an ego-centric viewpoint, it may indicate volatility, changes in peer groups
The link prediction problem• Prediction
• Given a graph G = (V,E) at time t, predict the set of edges EL’ ∉ Ethat appear in t’ and EL’c∉ E that do not appear in t’ based on either topological relationships between pairs Vi and Vj, attributes of the vertices Vi and Vj, or both of these.
Number of applications in social networks
• Security: Probable relationships in malicious networks
• Social: efficiently and effectively find companions, colleagues
• Anomalous link discovery: discover surprising links, given sparse networks
A perspective• Network types
• Different size, attributes
• Predictors• Unsupervised and supervised
• Network and predictors characteristics• Highly imbalanced• Saturation effects• Geodesic distance and temporality
• Evaluation• Skewed representations
Common Unsupervised Predictors• Degree
• in• out
• Neighbors• Jaccard’s Coefficient• Adamic/Adar
• Path• PropFlow• Rooted PageRank• Katz• Hitting Time• Commute Time
s s t t
s
t1
t2
s
t1
t2
PropFlow• Represents the probability of
information transfer from i to jbased on random transmission along all paths using edge weights.
Typical Prediction Model
Actual Negative
Actual Positive
Predict Negative
TN FN
Predict Positive
FP TP
Positives
Challenge: High Imbalance
The class imbalance ratio for link prediction in a sparse network islower-bounded by the number of vertices in the network.
Challenge: Evaluation
• Extreme sparsity• If I predict no new links, I will be very accurate
• Use ROC curves• Threshold agnostic
Supervised learning: High Performance Link Predictor: HPLP
Feature vector (ego/node level, dyad level, and triad level)
• Sampling• Bagged Hellinger
Distance Decision Trees
Classifier
Available as part of comprehensive software suite:
LpMade
Summary• Links form according to a competition between local and
global influences, and the dominant factor depends upon the network
• Our proposed supervised learning method, HPLP, outperforms unsupervised methods by as much as 30% AUROC
• Resulting data sets present a whole new world of imbalanced data challenges
• Combine the node and edge attributes with topology
Acknowledgements• Research sponsored by:
• Army Research Laboratory (ARL) Cooperative Agreement Number W911NF-09-2-0053
• National Science Foundation (NSF) Grant BCS-0826958
For papers and software, please visit
http://www.nd.edu/~dialhttp://icensa.nd.edu
Email: [email protected]