graph-based proximity measures...referencing same node n 2 in citation graph •cc=c c t n1 n2 n3 n4...
TRANSCRIPT
Graph-based Proximity Measures
Practical Graph Mining with R
Nagiza F. SamatovaWilliam Hendrix
John JenkinsKanchana Padmanabhan
Arpan ChakrabortyDepartment of Computer ScienceNorth Carolina State University
Outline
• Defining Proximity Measures
• Neumann Kernels
• Shared Nearest Neighbor
2
3
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]:
– Examples: Cosine, Jaccard, Tanimoto,
• Dissimilarity
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Src: “Introduction to Data Mining” by Vipin Kumar et al
4
Distance Metric
• Distance d (p, q) between two points p and q is a dissimilarity measure if it satisfies:
1. Positive definiteness:
d (p, q) ≥ 0 for all p and q and
d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.
3. Triangle Inequality:
d (p, r) ≤ d (p, q) + d (q, r) for all points p, q, and r.
• Examples:– Euclidean distance
– Minkowski distance
– Mahalanobis distance
Src: “Introduction to Data Mining” by Vipin Kumar et al
5
Is this a distance metric?
Not: Positive definite
Not: Symmetric
Not: Triangle Inequality
1( , ) max( , )j j
j dd p q p q
≤ ≤=
Distance Metric
1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q= ∈ = ∈� �
1( , ) max( )j j
j dd p q p q
≤ ≤= −
2
1
( , ) ( )d
j jj
d p q p q=
= −∑
1( , ) min | |j j
j dd p q p q
≤ ≤= −
6
Distance: Euclidean, Minkowski, Mahalanobis
1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q= ∈ = ∈� �
2
1
( , ) ( )d
j jj
d p q p q=
= −∑
Euclidean1
1
( , ) | |d r
rr j j
j
d p q p q=
= − ∑
Minkowski
1
1:
City block distance
Manhattan distance
-norm
r
L
=
2
2 :
Euclidean, -norm
r
L
=
Mahalanobis1( , ) ( ) ( )Td p q p q p q−= − Σ −
7
Euclidean Distance 2
1
( , ) ( )d
j jj
d p q p q=
= −∑
Standardization is necessary, if scales differ.
1 2( , ,...., ) ddp p p p= ∈�
1
1 d
kk
p pd =
= ∈∑ �
Mean of attributes Standard deviation of attributes
2
1
1( )
1
d
p kk
s p pd =
= − ∈− ∑ �
Ex: ( , )p age salary=
Standardized/Normalized Vector
1 2( , ,..., ) ddnew
p p p p
p pp p p pp pp
s s s s
−− −−= = ∈�
0
1new
new
p
p
s
==
8
Distance Matrix
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
2
1
( , ) ( )d
j jj
d p q p q=
= −∑
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
Output Distance Matrix: D
• P = as.matrix (read.table(file=“points.dat”));• D = dist (P[, 2;3], method = "euclidean");• L1 = dist (P[, 2;3], method = “minkowski", p=1);• help (dist)
point x yp1 0 2p2 2 0p3 3 1p4 5 1
Input Data Table: P
File name: points.dat
Src: “Introduction to Data Mining” by Vipin Kumar et al
9
Covariance of Two Vectors, cov(p,q)
1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q= ∈ = ∈� �
1
1 d
kk
p pd =
= ∈∑ �
Mean of attributes
1
1cov( , ) ( )( )
1
d
pq k kk
p q s p p q qd =
= = − − ∈− ∑ �
One definition:
cov( , ) [( ( ))( ( )) ]Tp q E p E p q E q= − − ∈�
Or a better definition:
E is the Expected values of a random variable.
10
Covariance, or Dispersion Matrix, ∑
1 11 12 1
1 2
( , ,...., )
.....
( , ,...., )
dd
dN N N Nd
P p p p
P p p p
= ∈
= ∈
�
�
1 1 1 2 1
2 1 2 2 21 2
1 2
cov( , ) cov( , ) ... cov( , )
cov( , ) cov( , ) ... cov( , )( , ,..., )
... ... ... ...
cov( , ) cov( , ) ... cov( , )
N
NN
N N N N
P P P P P P
P P P P P PP P P
P P P P P P
∑ =
N points in d-dimensional space:
The covariance, or dispersion matrix:
The inverse, Σ-1, is concentration matrix or precision matrix
11
Common Properties of a Similarity
• Similarities, also have some well known properties.
– s(p, q) = 1 (or maximum similarity) only if p = q.
– s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
Src: “Introduction to Data Mining” by Vipin Kumar et al
12
Similarity Between Binary Vectors
• Suppose p and q have only binary attributes
• Compute similarities using the following quantities
– M01 = the number of attributes where p was 0 and q was 1
– M10 = the number of attributes where p was 1 and q was 0
– M00 = the number of attributes where p was 0 and q was 0
– M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients:
SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
Src: “Introduction to Data Mining” by Vipin Kumar et al
13
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00)
= (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
14
Cosine Similarity
• If d1and d
2are two document vectors, then
cos( d1, d
2) = (d
1•••• d
2) / ||d
1|| ||d
2|| , where:
• indicates vector dot product and|| d || is the length of vector d.
• Example:
d1= 3 2 0 5 0 0 0 2 0 0
d2= 1 0 0 0 0 0 0 1 0 2 cos( d
1, d
2) = .3150
d1
• d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Src: “Introduction to Data Mining” by Vipin Kumar et al
15
Extended Jaccard Coefficient (Tanimoto)
• Variation of Jaccard for continuous or count attributes
– Reduces to Jaccard for binary attributes
Src: “Introduction to Data Mining” by Vipin Kumar et al
16
Correlation (Pearson Correlation)
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q, and then take their dot product
)(/))(( pstdpmeanpp kk −=′
)(/))(( qstdqmeanqq kk −=′
qpqpncorrelatio ′•′=),(
Src: “Introduction to Data Mining” by Vipin Kumar et al
17
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
18
General Approach for Combining Similarities
• Sometimes attributes are of many different types, but an overall similarity is needed.
Src: “Introduction to Data Mining” by Vipin Kumar et al
19
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
– Use weights wk which are between 0 and 1 and sum to 1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
Graph-Based Proximity Measures
In order to apply graph-based data mining techniques, such as
classification and clustering, it is necessary to define
proximity measures between data represented in graph
form.
In order to apply graph-based data mining techniques, such as
classification and clustering, it is necessary to define
proximity measures between data represented in graph
form.
Within-graph proximity measures:
Within-graph proximity measures:
Hyperlink-Induced
Topic Search (HITS)
Hyperlink-Induced
Topic Search (HITS)
The Neumann
Kernel
The Neumann
Kernel
Shared Nearest
Neighbor (SNN)
Shared Nearest
Neighbor (SNN)
√
Outline
• Defining Proximity Measures
• Neumann Kernels
• Shared Nearest Neighbor
21
Neumann Kernels: Agenda
Neumann Kernel
Introduction
Neumann Kernel
Introduction
Co-citation and Bibliographic Coupling
Co-citation and Bibliographic Coupling
Document and Term
Correlation
Document and Term
Correlation
Diffusion/Decay factors
Diffusion/Decay factors
Relationship to HITS
Relationship to HITS
Strengths and WeaknessesStrengths and Weaknesses
Neumann Kernels (NK)
von Neumann
� Generalization of HITS
� Input: Undirected or Directed Graph
� Output: Within Graph Proximity Measure
� Importance
� Relatedness
NK: Citation graph
• Input: Graph– n1…n8 vertices (articles)
– Graph is directed
– Edges indicate a citation
• Citation Matrix C can be formed– If an edge between two vertices exists then the
matrix cell = 1 else = 0
n1 n2 n3 n4
n5 n6 n7 n8
NK: Co-citation graph
• Co-citation graph: A graph which has two nodes connected if they appear simultaneously in the reference list of a third node in citation graph.
• In above graph n1 and n2 are connected because both are referenced by same node n5 in citation graph
• CC=CTC
n1 n2 n3 n4
n5 n6 n7 n8
NK: Bibliographic Coupling Graph
• Bibliographic coupling graph: A graph which has two nodes connected if they share one or more bibliographic references.
• In above graph n5 and n6 are connected because both are referencing same node n2 in citation graph
• CC=C CT
n1 n2 n3 n4
n5 n6 n7 n8
NK: Document and Term Correlation
Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function of their relationship (e.g. frequency of the given term in the document).
Example:D1: “I like this book” D2: “We wrote this book”
Term-Document Matrix X
NK: Document and Term Correlation (2)
Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents.
Example:D1: “I like this book” D2: “We wrote this book”
Document Correlation matrix K = (XTX)
NK: Document and Term Correlation (3)
Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms.
Example:D1: “I like this book” D2: “We wrote this book”
Term Correlation Matrix T = (XXT)
Neumann Kernel Block Diagram
.Input: Graph
Output: Two matrices of dimensions n x n called K γ and Tγ
Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance
NK: Diffusion Factor - Equation & Effect
Neumann Kernel defines two matrices incorporating a diffusion factor:
Simplifies with our definitions of K and T
When
When
Indegree = The indegree, δ-(v), of vertex vis the number of edges leading to vertex v.δ- (B)=1
Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v.δ+(A)=3
Maximal indegree= The maximal indegree, ∆-, of the graph is the maximum of all indegree counts of all vertices of graph.∆-(G)= 2
Maximal outdegree= The maximal outdegree, ∆+, of the graph is the maximum of all outdegree counts of all vertices of graph.∆+(G)= 3
A
B C D
NK: Diffusion Factor - Terminology
NK: Diffusion Factor - Algorithm
NK: Choice of Diffusion Factor and its effects on the Neumann Algorithm
• Neumann Kernel outputs relatedness between documents and between terms when g = γ
• Similarly when γ is larger, then the Kernel output matches with HITS
.HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8
Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following rankingn3 > n4 > n2 > n1 > n5 = n6 = n7 = n8
For higher values of gamma Neumann Kernel converges to HITS
n1 n2 n3 n4
n5 n6 n7 n8
Comparing NK, HITS, andCo-citation Bibliographic Coupling
Strengths and Weaknesses
Generalization of HITS
Merges relatedness and importance
Useful in many graph applications
StrengthsStrengths
Topic Drift
No penalty for loops in adjacency matrix
WeaknessesWeaknesses
Outline
• Defining Proximity Measures
• Neumann Kernels
• Shared Nearest Neighbor
37
Shared Nearest Neighbor (SNN)
• An indirect approach to similarity
• Uses a dynamic method of a k-Nearest Neighbor graph to determine the similarity between the nodes
• If two vertices have more than kneighbors in common then they can be considered similar to one another even if a direct link does not exist
SNN - Agenda
WeaknessesWeaknesses
StrengthsStrengths
Outlier/Anomally DetectionOutlier/Anomally Detection
R Code ExampleR Code Example
Time ComplexityTime Complexity
SNN AlgorithmSNN Algorithm
Shared Nearest Neighbor GraphShared Nearest Neighbor Graph
Proximity GraphsProximity Graphs
Understanding ProximityUnderstanding Proximity
SNN – Understanding Proximity
What makes a node a neighbor to another
node is based off of the definition of proximity
What makes a node a neighbor to another
node is based off of the definition of proximity
Definition: the closeness between
a set of objects
Proximity can measure the extent
to which the two nodes belong to the same cluster.
Proximity is a subtle notion
whose definition can depend on a
specific application
SNN - Proximity Graphs
• A graph obtained by connecting two points, in a set of points, by an edge if the two points, in some sense, are close to each other
SNN – Proximity Graphs(continued)
11
22
3344
55
11
2233
4455
66
77
1 2 3 4 5 6
CYCLIC
LINEAR
RADIAL
Various Types of Proximity Graphs
SNN – Proximity Graphs(continued)
Other types of proximity graphs.
MINIMUM SPANNING TREE
RELATIVE NEIGHBOR GRAPH
GABRIEL GRAPH
NEAREST NEIGHBOR GRAPH(Voronoi diagram)
SNN – Proximity Graphs (continued)
Represents neighbor relationships between objectsRepresents neighbor relationships between objects
Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason
Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason
Using a proximity graph increases the scale range over which good segmentations are possible
Using a proximity graph increases the scale range over which good segmentations are possible
Can be formulated with respect to many metricsCan be formulated with respect to many metrics
SNN – Kth Nearest Neighbor (k-NN) Graph
Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure
Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure
Has applications in cluster analysis and outlier detection
Has applications in cluster analysis and outlier detection
SNN – Shared Nearest Neighbor Graph
• An SNN graph is a special type of KNN graph.
• If an edge exists between two vertices, then they both belong to each other’s k-neighborhood
In the figure to the left, each of the twoblack vertices, i and j, have eight nearestneighbors, including each other. Four ofthose nearest neighbors are shared whichare shown in red. Thus, the two blackvertices are similar when parameter k=4for SNN graph.
SNN – The Algorithm
Input: G: an undirected graphInput: k: a natural number (number of shared neighb ors)
for i = 1 to N(G) dofor j = i+1 to N(G) do
if j < = N(G) thencounter = 0
end iffor m = 1 to N(G) do
if vertex i and vertex j both have an edge with verte x mthen
counter ++end if
end forif counter k then
Connect an edge between vertex i and vertex j in SNN graph.
end ifend for
end forreturn SNN graph
SNN – Time Complexity
O(n3)
for i = 1 to n
for i = 1 to n
for j = 1 to n
for j = 1 to n
for k = 1 to n
for k = 1 to n
� The number of vertices of graph G can be defined as n
� “for loops” i and k iterate once for each vertex in graph G (ntimes)
� Cumulatively this results in a total running time of:
� “for loop” jiterates at most n -1 times (O(n))
SNN – R Code Example
• library(“igraph”)• library(“ProximityMeasure”)• data = c( 0, 1, 0, 0, 1, 0,
1, 0, 1, 1, 1, 0,0, 1, 0, 1, 0, 0,0, 1, 1, 0, 1, 1,1, 1, 0, 1, 0, 0,0, 0, 0, 1, 0, 0)
• mat = matrix(data,6,6)• G = graph.adjacency(mat,mode=c("directed"),
weighted=NULL)• V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’)• tkplot(G)• SNN(mat, 2)
BB
AA
CC
D
E
F
[0] A -- D[1] B -- D[2] B -- E[3] C -- E
SNN – Outlier/Anomaly Detection
Outlier/AnomalyOutlier/Anomaly
• something that deviates from what is standard, normal, or expected
Outlier/Anomaly Detection
Outlier/Anomaly Detection
• detecting patterns in a given data set that do not conform to an established normal behavior
Outlier/Anomaly
0
0.5
1
1.5
2
2.5
3
3.5
0 1 2 3
SNN - Strengths
Ability to handle noise and outliers
Ability to handle clusters of different sizes and shapes
Very good at handling clusters of varying densities
SNN - Weaknesses
Does not take into accountthe weight of the linkbetween the nodes in anearest neighbor graph
A low similarity amongstnodes of the same cluster ina graph can cause it to findnearest neighbors that arenot in the same cluster
Time Complexity Comparison
HITSHITS
O(k*n2.376)
Nuemann KernelNuemann Kernel
O(n2.376)
Shared Nearest Neighbor
Shared Nearest Neighbor
O(n3)Run Time
Conclusion: Nuemann Kernel <= HITS < SNN