graph-based proximity measures...referencing same node n 2 in citation graph •cc=c c t n1 n2 n3 n4...

Graph-based Proximity Measures

Practical Graph Mining with R

Nagiza F. SamatovaWilliam Hendrix

John JenkinsKanchana Padmanabhan

Arpan ChakrabortyDepartment of Computer ScienceNorth Carolina State University

Outline

• Defining Proximity Measures

• Neumann Kernels

• Shared Nearest Neighbor

2

3

Similarity and Dissimilarity

• Similarity

– Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1]:

– Examples: Cosine, Jaccard, Tanimoto,

• Dissimilarity

– Numerical measure of how different two data objects are

– Lower when objects are more alike

– Minimum dissimilarity is often 0

– Upper limit varies

• Proximity refers to a similarity or dissimilarity

Src: “Introduction to Data Mining” by Vipin Kumar et al

4

Distance Metric

• Distance d (p, q) between two points p and q is a dissimilarity measure if it satisfies:

1. Positive definiteness:

d (p, q) ≥ 0 for all p and q and

d (p, q) = 0 only if p = q.

2. Symmetry: d (p, q) = d (q, p) for all p and q.

3. Triangle Inequality:

d (p, r) ≤ d (p, q) + d (q, r) for all points p, q, and r.

• Examples:– Euclidean distance

– Minkowski distance

– Mahalanobis distance


5

Is this a distance metric?

Not: Positive definite

Not: Symmetric

Not: Triangle Inequality

1( , ) max( , )j j

j dd p q p q

≤ ≤=

Distance Metric

1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q= ∈ = ∈� �

1( , ) max( )j j

j dd p q p q

≤ ≤= −

2

1

( , ) ( )d

j jj

d p q p q=

= −∑

1( , ) min | |j j

j dd p q p q

≤ ≤= −

6

Distance: Euclidean, Minkowski, Mahalanobis


2

1

( , ) ( )d

j jj

d p q p q=

= −∑

Euclidean1

1

( , ) | |d r

rr j j

j

d p q p q=

= − ∑

Minkowski

1

1:

City block distance

Manhattan distance

-norm

r

L

=

2

2 :

Euclidean, -norm

r

L

=

Mahalanobis1( , ) ( ) ( )Td p q p q p q−= − Σ −

7

Euclidean Distance 2

1

( , ) ( )d

j jj

d p q p q=

= −∑

Standardization is necessary, if scales differ.

1 2( , ,...., ) ddp p p p= ∈�

1

1 d

kk

p pd =

= ∈∑ �

Mean of attributes Standard deviation of attributes

2

1

1( )

1

d

p kk

s p pd =

= − ∈− ∑ �

Ex: ( , )p age salary=

Standardized/Normalized Vector

1 2( , ,..., ) ddnew

p p p p

p pp p p pp pp

s s s s

−− −−= = ∈�

0

1new

new

p

p

s

==

8

Distance Matrix

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

2

1

( , ) ( )d

j jj

d p q p q=

= −∑

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Output Distance Matrix: D

• P = as.matrix (read.table(file=“points.dat”));• D = dist (P[, 2;3], method = "euclidean");• L1 = dist (P[, 2;3], method = “minkowski", p=1);• help (dist)

point x yp1 0 2p2 2 0p3 3 1p4 5 1

Input Data Table: P

File name: points.dat


9

Covariance of Two Vectors, cov(p,q)


1

1 d

kk

p pd =

= ∈∑ �

Mean of attributes

1

1cov( , ) ( )( )

1

d

pq k kk

p q s p p q qd =

= = − − ∈− ∑ �

One definition:

cov( , ) [( ( ))( ( )) ]Tp q E p E p q E q= − − ∈�

Or a better definition:

E is the Expected values of a random variable.

10

Covariance, or Dispersion Matrix, ∑

1 11 12 1

1 2

( , ,...., )

.....

( , ,...., )

dd

dN N N Nd

P p p p

P p p p

= ∈

= ∈

�

�

1 1 1 2 1

2 1 2 2 21 2

1 2

cov( , ) cov( , ) ... cov( , )

cov( , ) cov( , ) ... cov( , )( , ,..., )

... ... ... ...

cov( , ) cov( , ) ... cov( , )

N

NN

N N N N

P P P P P P

P P P P P PP P P

P P P P P P

∑ =

N points in d-dimensional space:

The covariance, or dispersion matrix:

The inverse, Σ-1, is concentration matrix or precision matrix

11

Common Properties of a Similarity

• Similarities, also have some well known properties.

– s(p, q) = 1 (or maximum similarity) only if p = q.

– s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.


12

Similarity Between Binary Vectors

• Suppose p and q have only binary attributes

• Compute similarities using the following quantities

– M01 = the number of attributes where p was 0 and q was 1




• Simple Matching and Jaccard Coefficients:

SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)


13

SMC versus Jaccard: Example

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

M01 = 2 (the number of attributes where p was 0 and q was 1)




SMC = (M11 + M00)/(M01 + M10 + M11 + M00)

= (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

14

Cosine Similarity

• If d1and d

2are two document vectors, then

cos( d1, d

2) = (d

1•••• d

2) / ||d

1|| ||d

2|| , where:

• indicates vector dot product and|| d || is the length of vector d.

• Example:

d1= 3 2 0 5 0 0 0 2 0 0

d2= 1 0 0 0 0 0 0 1 0 2 cos( d

1, d

2) = .3150

d1

• d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245


15

Extended Jaccard Coefficient (Tanimoto)

• Variation of Jaccard for continuous or count attributes

– Reduces to Jaccard for binary attributes


16

Correlation (Pearson Correlation)

• Correlation measures the linear relationship between objects

• To compute correlation, we standardize data objects, p and q, and then take their dot product

)(/))(( pstdpmeanpp kk −=′

)(/))(( qstdqmeanqq kk −=′

qpqpncorrelatio ′•′=),(


17

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.


18

General Approach for Combining Similarities

• Sometimes attributes are of many different types, but an overall similarity is needed.


19

Using Weights to Combine Similarities

• May not want to treat all attributes the same.

– Use weights wk which are between 0 and 1 and sum to 1.


Graph-Based Proximity Measures

In order to apply graph-based data mining techniques, such as

classification and clustering, it is necessary to define

proximity measures between data represented in graph

form.

In order to apply graph-based data mining techniques, such as

classification and clustering, it is necessary to define

proximity measures between data represented in graph

form.

Within-graph proximity measures:

Within-graph proximity measures:

Hyperlink-Induced

Topic Search (HITS)

Hyperlink-Induced

Topic Search (HITS)

The Neumann

Kernel

The Neumann

Kernel

Shared Nearest

Neighbor (SNN)

Shared Nearest

Neighbor (SNN)

√

Outline


• Neumann Kernels


21

Neumann Kernels: Agenda

Neumann Kernel

Introduction

Neumann Kernel

Introduction

Co-citation and Bibliographic Coupling

Co-citation and Bibliographic Coupling

Document and Term

Correlation

Document and Term

Correlation

Diffusion/Decay factors

Diffusion/Decay factors

Relationship to HITS

Relationship to HITS

Strengths and WeaknessesStrengths and Weaknesses

Neumann Kernels (NK)

von Neumann

� Generalization of HITS

� Input: Undirected or Directed Graph

� Output: Within Graph Proximity Measure

� Importance

� Relatedness

NK: Citation graph

• Input: Graph– n1…n8 vertices (articles)

– Graph is directed

– Edges indicate a citation

• Citation Matrix C can be formed– If an edge between two vertices exists then the

matrix cell = 1 else = 0

n1 n2 n3 n4

n5 n6 n7 n8

NK: Co-citation graph

• Co-citation graph: A graph which has two nodes connected if they appear simultaneously in the reference list of a third node in citation graph.

• In above graph n1 and n2 are connected because both are referenced by same node n5 in citation graph

• CC=CTC

n1 n2 n3 n4

n5 n6 n7 n8

NK: Bibliographic Coupling Graph

• Bibliographic coupling graph: A graph which has two nodes connected if they share one or more bibliographic references.

• In above graph n5 and n6 are connected because both are referencing same node n2 in citation graph

• CC=C CT

n1 n2 n3 n4

n5 n6 n7 n8

NK: Document and Term Correlation

Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function of their relationship (e.g. frequency of the given term in the document).

Example:D1: “I like this book” D2: “We wrote this book”

Term-Document Matrix X

NK: Document and Term Correlation (2)

Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents.


Document Correlation matrix K = (XTX)

NK: Document and Term Correlation (3)

Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms.


Term Correlation Matrix T = (XXT)

Neumann Kernel Block Diagram

.Input: Graph

Output: Two matrices of dimensions n x n called K γ and Tγ

Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance

NK: Diffusion Factor - Equation & Effect

Neumann Kernel defines two matrices incorporating a diffusion factor:

Simplifies with our definitions of K and T

When

When

Indegree = The indegree, δ-(v), of vertex vis the number of edges leading to vertex v.δ- (B)=1

Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v.δ+(A)=3

Maximal indegree= The maximal indegree, ∆-, of the graph is the maximum of all indegree counts of all vertices of graph.∆-(G)= 2

Maximal outdegree= The maximal outdegree, ∆+, of the graph is the maximum of all outdegree counts of all vertices of graph.∆+(G)= 3

A

B C D

NK: Diffusion Factor - Terminology

NK: Diffusion Factor - Algorithm

NK: Choice of Diffusion Factor and its effects on the Neumann Algorithm

• Neumann Kernel outputs relatedness between documents and between terms when g = γ

• Similarly when γ is larger, then the Kernel output matches with HITS

.HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8

Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following rankingn3 > n4 > n2 > n1 > n5 = n6 = n7 = n8

For higher values of gamma Neumann Kernel converges to HITS

n1 n2 n3 n4

n5 n6 n7 n8

Comparing NK, HITS, andCo-citation Bibliographic Coupling

Strengths and Weaknesses

Generalization of HITS

Merges relatedness and importance

Useful in many graph applications

StrengthsStrengths

Topic Drift

No penalty for loops in adjacency matrix

WeaknessesWeaknesses

Outline


• Neumann Kernels


37

Shared Nearest Neighbor (SNN)

• An indirect approach to similarity

• Uses a dynamic method of a k-Nearest Neighbor graph to determine the similarity between the nodes

• If two vertices have more than kneighbors in common then they can be considered similar to one another even if a direct link does not exist

SNN - Agenda

WeaknessesWeaknesses

StrengthsStrengths

Outlier/Anomally DetectionOutlier/Anomally Detection

R Code ExampleR Code Example

Time ComplexityTime Complexity

SNN AlgorithmSNN Algorithm

Shared Nearest Neighbor GraphShared Nearest Neighbor Graph

Proximity GraphsProximity Graphs

Understanding ProximityUnderstanding Proximity

SNN – Understanding Proximity

What makes a node a neighbor to another

node is based off of the definition of proximity

What makes a node a neighbor to another

node is based off of the definition of proximity

Definition: the closeness between

a set of objects

Proximity can measure the extent

to which the two nodes belong to the same cluster.

Proximity is a subtle notion

whose definition can depend on a

specific application

SNN - Proximity Graphs

• A graph obtained by connecting two points, in a set of points, by an edge if the two points, in some sense, are close to each other

SNN – Proximity Graphs(continued)

11

22

3344

55

11

2233

4455

66

77

1 2 3 4 5 6

CYCLIC

LINEAR

RADIAL

Various Types of Proximity Graphs

SNN – Proximity Graphs(continued)

Other types of proximity graphs.

MINIMUM SPANNING TREE

RELATIVE NEIGHBOR GRAPH

GABRIEL GRAPH

NEAREST NEIGHBOR GRAPH(Voronoi diagram)

SNN – Proximity Graphs (continued)

Represents neighbor relationships between objectsRepresents neighbor relationships between objects

Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason

Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason

Using a proximity graph increases the scale range over which good segmentations are possible

Using a proximity graph increases the scale range over which good segmentations are possible

Can be formulated with respect to many metricsCan be formulated with respect to many metrics

SNN – Kth Nearest Neighbor (k-NN) Graph

Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure

Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure

Has applications in cluster analysis and outlier detection

Has applications in cluster analysis and outlier detection

SNN – Shared Nearest Neighbor Graph

• An SNN graph is a special type of KNN graph.

• If an edge exists between two vertices, then they both belong to each other’s k-neighborhood

In the figure to the left, each of the twoblack vertices, i and j, have eight nearestneighbors, including each other. Four ofthose nearest neighbors are shared whichare shown in red. Thus, the two blackvertices are similar when parameter k=4for SNN graph.

SNN – The Algorithm

Input: G: an undirected graphInput: k: a natural number (number of shared neighb ors)

for i = 1 to N(G) dofor j = i+1 to N(G) do

if j < = N(G) thencounter = 0

end iffor m = 1 to N(G) do

if vertex i and vertex j both have an edge with verte x mthen

counter ++end if

end forif counter k then

Connect an edge between vertex i and vertex j in SNN graph.

end ifend for

end forreturn SNN graph

SNN – Time Complexity

O(n3)

for i = 1 to n

for i = 1 to n

for j = 1 to n

for j = 1 to n

for k = 1 to n

for k = 1 to n

� The number of vertices of graph G can be defined as n

� “for loops” i and k iterate once for each vertex in graph G (ntimes)

� Cumulatively this results in a total running time of:

� “for loop” jiterates at most n -1 times (O(n))

SNN – R Code Example

• library(“igraph”)• library(“ProximityMeasure”)• data = c( 0, 1, 0, 0, 1, 0,

1, 0, 1, 1, 1, 0,0, 1, 0, 1, 0, 0,0, 1, 1, 0, 1, 1,1, 1, 0, 1, 0, 0,0, 0, 0, 1, 0, 0)

• mat = matrix(data,6,6)• G = graph.adjacency(mat,mode=c("directed"),

weighted=NULL)• V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’)• tkplot(G)• SNN(mat, 2)

BB

AA

CC

D

E

F

[0] A -- D[1] B -- D[2] B -- E[3] C -- E

SNN – Outlier/Anomaly Detection

Outlier/AnomalyOutlier/Anomaly

• something that deviates from what is standard, normal, or expected

Outlier/Anomaly Detection

Outlier/Anomaly Detection

• detecting patterns in a given data set that do not conform to an established normal behavior

Outlier/Anomaly

0

0.5

1

1.5

2

2.5

3

3.5

0 1 2 3

SNN - Strengths

Ability to handle noise and outliers

Ability to handle clusters of different sizes and shapes

Very good at handling clusters of varying densities

SNN - Weaknesses

Does not take into accountthe weight of the linkbetween the nodes in anearest neighbor graph

A low similarity amongstnodes of the same cluster ina graph can cause it to findnearest neighbors that arenot in the same cluster

Time Complexity Comparison

HITSHITS

O(k*n2.376)

Nuemann KernelNuemann Kernel

O(n2.376)

Shared Nearest Neighbor

Shared Nearest Neighbor

O(n3)Run Time

Conclusion: Nuemann Kernel <= HITS < SNN

graph-based proximity measures...referencing same node n 2 in citation graph •cc=c c t n1 n2 n3 n4...

Documents