1 2 34 graph path: sequence of edges connecting a sequence of vertices (usually) distinct from each...
TRANSCRIPT
1 2
3 4
Graph Path: sequence of edges connecting a sequence of vertices (usually) distinct from each other except for the endpoints.
EE10000000000010110 EE2
0000000000001010EE3
0001000000001100 EE4
0010000010000000
EE110000 EE12
0000
EE130001
EE140110
EE210000 EE22
0000
EE230000
EE241010EE31
0001 EE32
0000
EE330000 EE34
1100
EE410010 EE42
0000
EE431000
EE440000
For h=1 k=3: EE13=E3&M’1
For h=1, ListE1={3,4}E3&1001
M’1=0111
EE130001
For h=1 k=4: EE14=E4&M’1
For h=2, ListE2={4}
For h=2 k=4: EE24=E4&M’2
E41110
M’21011
EE241010
E41110
M’1=0111
EE140110
For h=3, ListE3={1,4}
For h=3 k=1: EE31=E1&M’3 E1&0011
M’3=1101
EE310001
For h=3 k=4: EE34=E4&M’3
E4&1110
M’3=1101
EE341100
For h=4, ListE4={1,2,3}
For h=4 k=1: EE41=E1&M’4
E1&0011
M’4=1110
EE410010
For h=4 k=2: EE42=E2&M’4 E2&0001
M’4=1110
EE420000 pure0
For h=4 k=3: EE43=E3&M’4 E3&1001
M’4=1110
EE431000
3Level, Stride=4 pTrees for paths of len=2 (2 edges and 3 vertices (unique except for endpts) )
Level=0 EE130001
EE140110
EE241010
EE310001
EE341100
EE410010
EE431000
Level=1=just E1,E2,E3,E4 with pure0 bits turned off.
E10011
E20001
E31001
E410 bit turned off10
Level=2 1111
0000000000000000000000000000000000000000000000000010000010000000
0000000000000000000000000000000000000000000011000000000010000000
0000000000000000000000000000000000000000000000000010000010000000
0000000000000110000000000000000000000000000000000010000000000000
0000000000010000000000000000000000010000000000000000000000000000
1111111211131114112111221123112411311132113311341141114211431144121112121213121412211222122312241231123212331234124112421243124413111312131313141321132213231324133113321333133413411342134313441411141214131414142114221423142414311432143314341441144214431444
E3key
2111211221132114212121222123212421312132213321342141214221432144221122122213221422212222222322242231223222332234224122422243224423112312231323142321232223232324233123322333233423412342234323442411241224132414242124222423242424312432243324342441244224432444
3111311231133114312131223123312431313132313331343141314231433144321132123213321432213222322332243231323232333234324132423243324433113312331333143321332233233324333133323333333433413342334333443411341234133414342134223423342434313432343334343441344234433444
4111411241134114412141224123412441314132413341344141414241434144421142124213421442214222422342244231423242334234424142424243424443114312431343144321432243234324433143324333433443414342434343444411441244134414442144224423442444314432443344344441444244434444
E3
0000000000000000000000000000000000000000000011000000000010000000
30000000000000110000000000000000000000000000000000010000000000000
40000000000010000000000000000000000010000000000000000000000000000
E3 E3 E3
0000000000000000
E3
11
0000000000000000
E3
12
0000000000001100
E3
13
0000000010000000
E3
14
E3
1 E3
2
0000
E3
111
0000
E3
112
0000
E3
113
0000
E3
114
0000
E3
121
0000
E3
122
0000
E3
123
0000
E3
124
0000
E3
131
0000
E3
132
0000
E3
133
1100
E3
134
0000
E3
141
0000
E3
142
1000
E3
143
0000
E3
144
kListE2hj, E3hjk=Ek&M’j.
h=1 j=4 k=3 E3143=E3&M’4
M’41110
E31001
1000
E3
143
h=1 j=4 k=2 E3142=E2&M’4
M’41110
E20001
0000 pure0
E3
142
h=1 j=3 ListE213={4} k=4 E3134=E4&M’3
M’31101
E41110
1100
E3
134
h=2 j=4 ListE224={1,3} k=1 E3241=E1&M’4 M’4
1110
E10011
0010
E3
241
h=2 j=4 k=3 E3243=E3&M’4
M’41110
E31001
1000
E3
243
h=3 j=1 k=4 E3314=E4&M’1
E41110
M’10111
0110
E3
314
h=3 j=4 k=1 E3341=E1&M’4
E10011
M’41110
0010
E3
341
h=3 j=4 k=2 E3342=E2&M’4
E20001
M’41110
0000
E3
341
h=4 j=1 k=3 E3413=E3&M’1
E31001
M’10111
0001
E3
413
h=4 j=3 k=1 E3431=E1&M’3
E10011
M’31101
0001
E3
431
Level=0 (We just computed these)
1100
E3
134
1000
E3
143
0010
E3
241
1000
E3
243
0110
E3
314
0010
E3
341
0001
E3
413
0001
E3
431
Level=1
0001
L13
13
0010
L13
14
1010
L13
24
0001
L13
31
1000
L13
34
0010
L13
41
1000
L13
43
Level=2 (These are exactly the Level=1 of E2)
0011
L23
1
0001
L23
2
1001
L23
3
1010
L23
4
(These are exactly the Level=0’s of E2)
Level=3 (So E2 is the upper 3 levels of E3)1111
Graph Path Analytics (using pTrees)
U0011_0001_0001_0000
UniqueEdgeMask
E10011 E2
0001E3
1001 E4
1110
Two-LevelStride=4,Edge pTrees
L1E1111
U10011 U2
0001U3
0001 U4
0000
L1U1111
L0
Two-LevelStr=4, UniqueEdge pTrees
M11000 M2
0100M3
0010 M4
0001
Useful L0 Masks
1 1 1
1
4321
V2
1 2 3 4V1
kListE3hij, E4hijk = Ek & M’j & M’i ListE3134={1,2}h=1 i=3 j=4 k=2
M’31101
E20001
0000
E41342M’41110 ListE3143={1}
ListE3241={3}h=2 i=4 j=1 k=3
M’10111
E30001
0000
E42413M’41110
ListE3243={1}h=2 i=4 j=3 k=1
M’31101
E10011
0000
E42431M’41110
ListE3314={2,3}h=3 i=1 j=4 k=2
M’10111
E20001
0000
E43142M’41110
ListE3341={3}
ListE3413={4}ListE3431={4}
No 5 vertex (4 edge) paths. Creation stops.The Stride=|V|, Levels=Diam Path Mask is:EE2
E3
:Ediam
1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4
EdgesV1 V2
E0011_0001_1001_1110
EdgeMaskpTree
The 2-vertex paths are the Edges
EE0000000000010110000000000000101000010000000011000010000010000000
111112113114121122123124131132133134141142143144211212213214221222223224231232233234241242243244311312313314321322323324331332333334341342343344411412413414421422423424431432433434441442443444
E 2 k eyv 1 v 2 v 3
We use pTrees to find and exhibit 3 vertex (2 edge) paths (EE or E2), 4 vertex (3 edge) paths (E3), etc.
kListEh, E2hk = Ek&M’h. (other k, E2
hk=0)
32
3
2
E Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1
δintC- δextC=1–1/3=2/3
PEL.,1 0001_0001_0001_1100
PEL.,1 0010_0001_1000_0110
EL0012_0003_0001_2310
PE
0011_0001_1001_1110
Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4
E=Adjacency matrix2 3 1
1
43
2
1
V1
1 2 3 4
2=|PC&PE&Pv1|=kv1
intInternal degree of v C∈ , kv
int =# of edges from v to vertices in C=134
External degree of v C∈ , kvext =# of edges from v to vertices in C’
Internal degree of C, kCint =vC kv
int
External degree of C, kCext =vC kv
ext
Total degree of C, kC= +kCext kC
int
2=|PC&PE&Pv3| =kv3
int 2=|PC&PE&Pv4|=kv4
int 6=kCint
1=kCext
kC=7
Intra-cluster density δint(C)=|edges(C,C)|/(nc(nc−1)/2)=|PE&PC&PLT|/(3*2/2)=3/3=1
PLT
0111_0011_0001_0000
Inter-cluster density δext(C)=|edges(C,C’)| / (nc(n-nc)) =|PE&P’C&PLT|=1/(3*1)=1/3
PC
1011_0000_1011_1011
Pv11111_0000_0000_0000
Pv20000_1111_0000_0000
Pv30000_0000_1111_0000
Pv40000_0000_0000_1111
0=|P’C&PE&Pv1|=kv1
ext 0=|P’C&PE&Pv3|=kv3
ext 1=|P’C&PE&Pv4|=kv4
ext
Useful masks
Tradeoff between large δint(C) and small δext(C) is goal of many community mining algorithms. A simple approach is to Maximize differences. Density Difference algorithm for Communities: δint(C)−δext(C) >Threshold? Degree Difference algorithm: kC
int – kCext > Threshold?
It is easy to compute these measurements with pTrees, even for Very Big Graphs. Graphs are ubiquitous for complex data in all of science.
Ignoring Subgraphs of 2 vertices, the four 3-vertex subgraphs are: C={1,3,4}, D={1,2,3}, F={1,2,4}, H={2,3,4}
Horiz Vertex data Vertical Vertex data Vkey VLabel 1 2 2 3 3 2 4 3
PVL,1 1111
PVL,0 0101
VL2323
PC
1011
δint(D) =|PE&PD&PLT|/(3*2/2)=1/3
δext(D)=|PE&P’D&PLT|=1/(3*1)=3/3=1 δintD - δextD=1/3–1=-2/3
PD
1110_1110_1110_0000
δint(F) =|PE&PF&PLT|/(3*2/2)=2/3
δext(F)=|PE&P’F&PLT|=1/(3*1)=2/3 δintF - δextF=2/3-2/3=0
PF
1101_1101_0000_1101
δint(H) =|PE&PH&PLT|/(3*2/2)=2/3
δext(H)=|PE&P’H&PLT|=1/(3*1)=2/3 δintH - δextH=2/3-2/3=0
PH
1101_1101_0000_1101
D F
H
Maximizing Difference of Cluster Densities: C is strongest community. One could use label values (weights) instead of the 0/1 existence values.
1 2
3 41
31 2
Vertex-Labelled, Edge-Labelled Graph
An Induced SubGraph (ISG) C, is a subgraph that inherits all of G’s edges on its own vertices. A k-ISG (k vertices), C, is a k-clique iff all of its (k-1)-Sub-ISGs are (k-1)-cliques.
Community Mining in Big Graphs Gene-Gene Interactions: # edges = 1B (109)E.g., Friend-Friend Social Nets: # edges = 4BB (1018)
Cust-Item Recommenders: # edges = 1MB (1015) Stock-price Stock Market Advisor: # edges = 1013
Person-Tweet HomeLand Security: # edges = 7B*10K= 1014
2 3
2 3
V2L
A community is a subgraph with more edges inside than linked to its outside.
V (vertex tbl)
Vkey VL 1 2 2 3 3 2 4 3
E (Edge Table) Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1
PEL.,1 0001_0001_0001_1100
PEL.,0 0010_0001_1000_0110
EL0012_0003_0001_2310
PE0011_0001_1001_1110
Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4E=Adjacency
matrix
2 3 1
1
4:33:2
2:3
1:2
V1
As a V2Rolodex card
1:2 2:3 3:2 4:3
C
PEC=PE&PC0011_0000_1001_1010
P11111_0000_0000_0000
P20000_1111_0000_0000
P30000_0000_1111_0000
P40000_0000_0000_1111
PVL,1 1111
PVL,0 0101
PC
1011
Apply EC to the 4 Induced 3 vertex subgraphs (3-Clique iff |PU|= 3!/(2!1!)=3)
A Clique Existence Algorithm is an algorithms that determines whether a given induced subgraph (given by a subset of vertices) is a clique or not.
Edge Count clique existence theorem (EC): |EC| |PUC| is COMB(|VC|,2) |VC|! / ((|VC|-2)!2!)
PUC=PU&PC0011_0000_0001_0000 PUC
0011_0000_0001_0000
Ct=3
PUD0010_0000_0000_0000
Ct=1
PUF0001_0001_0000_0000
Ct=2
PUH0000_0001_0001_0000
Ct=2
Thus, C is the only 3-Clique. We needed to form PC subgraph, C. Is that expensive?
12 2:3
3:2 4:31
31 2
SubGraph clique existence theorem (SG): (VC,EC) is a k-clique iff every induced k-1 subgraph, (VD,ED) is a (k-1)-clique.
Which is better? Which will extend more easily to quasi-cliques? Which can be extended to an algorithm that mines out all cliques from a graph?
A Clique Mining algorithm finds all cliques in a graph. For Clique-Mining we can use an ARM-Apriori-like downward closure property:
CSkkCliqueSet, CCSk+1Candidatek+1CliqueSet By the SG clique thm, CCSk+1= all s of CSk pairs having k-1 common vertices. Let CCCSk+1 be a union of two k-cliques with k-1 common vertices. Let v and w be the kth vertices (different) of the two k-cliques, then CCSk+1 iff (PE)(v,w)=1. (We just need to check a single bit in PE.)
Form CCSk+1 by union-ing CSk pairs sharing k-1 vertices, then check a single PE bit to determine if the union is in CSk+1. Below, k=2, so we check edge pairs sharing 1 vertex, then check the 1 new edge bit in PE.
CS2=E={13 14 24 34}
PE(3,4) =PE(4*[3-1]+4=12)=1134CS3
Already have 134
PE(1,2) =PE(4*[1-1]+2=2)=0
Already have 134
PE(2,3)=PE(4*[2-1]+3=7)=0
The only expensive part of this is forming CCSk.
And that is expensive only for CCS3 (as in Apriori ARM)
Next? List out CS3 = {134} form CCS4 = . Done.
G=Vertex-Labelled, Edge-Labelled Graph (C=Induced SubGraph with VC={1,3,4})
VC={1,3,4} VD={1,2,3} VF={1,2,4}
VH={2,3,4}
PU0011_0001_0001_0000
Clique Analytics for Big Graphs
A clique is a community in which
An edge between each vertex pair.
Bit offset12345678910111213141516
key1,11,21,31,41,51,61,72,12,22,32,42,52,62,73,13,23,33,43,53,63,74,14,24,34,44,54,64,75,15,25,35,45,55,65,76,16,26,36,46,56,66,77,17,27,37,47,57,67,7
1
2
4 3
6
7
5
E0111010101100011010001110000000001010001010000110
EU0111010001100000010000000000000001000000010000000
Using the EdgeCount thm: on C={1,2,3,4}, CU=C&EU
C is a clique since ct(CU)=comb(4, 2)=4!/2!2!=6
CU0111000001100000010000000000000000000000000000000
6
C1111000111100011110001111000000000000000000000000
Using the SubGraph clique theorem to find all k-Cliques. This graph, G is less trivial ;-) k=2: 12 13 14 15 16 17
23 24 25 26 27.
34 35 36 37
45 46 47
56 57
67 Turn PU into a positions list = {2 3 4 6 10 11 18 34 42}. Find the endpoints of each of these edges by ( Int((n-1)/7)+1, Mod(n-1,7) +1 )
k=3: 123 124 126 127 134 135 136 137 145 146 147 156 157 167
234 235 236 237 245 246 247 256 257 267
345 346 347 356 357 367
456 457 467
567
12345678910123456789201234567893012345678940123456789
12 13 14 16 23 24 34 56 67
k=4: 1234 (since the three 3subgraphs are all 3cliques, 123 124 234)
123 and 134 give 1234. 123 and 234 give 1234. 124 and 134 give 1234. 124 and 234 give 1234. 134 and 234 give 1234.
Therefore, 1234 is a 4-clique and the only 4-clique
So there are 5 cliques: 123 124 134 234 1234, 4 3-Cliques and 1 4-Clique.
Clique Mining using the SubGraph Algorithm
key1,11,21,31,41,51,61,72,12,22,32,42,52,62,73,13,23,33,43,53,63,74,14,24,34,44,54,64,75,15,25,35,45,55,65,76,16,26,36,46,56,66,77,17,27,37,47,57,67,7
1
2
4 3
6
7
5
PE0111010101100011010001110000000001110001010000110
More Clique Mining using the SubGraph thm (SG) In this example graph there are five 3Cliques and the one 4Clique. Let’s see if SG can find them (and how efficiently.).
k=2: 12 13 14 16 23 24 34 56 57 67 = E = CS2.
PE(2,3)=1So 123CS3
PE(2,4)=1124CS3
PE(2,6)=0
Pairs that share 1
Pairs that share 2
Already have 123CS3
Have 124CS3
Pairs that share 3
PE(1,4)=1134CS3
Pairs that share 4
Have 124CS3
Have 134
PE(2,3)=1234CS3
Pairs share 5
PE(6,7)=1567CS3
Pairs share 6PE(1,5)=0
PE(1,7)=0
already have 567
Pairs share 7
Have 567
k=3: 123 124 134 234 567= CCS3.
PE(2,4)=11234CS4
Triples that share 1,2
Triples share 1,4
Have 1334
Triples share 2,3
Have 1334
Triples share 2,4
Have 1234
Triples share 3,4
have 1234
The slowest part of this algorithm is the generation of CCS, the Candidate Clique Set?
Clearly, evaluating a given candidate as to whether it is actually a clique involves just a one bit lookup in the existing “Edge Existence” pTree mask, PE, which is instantaneous.
The generation of CCS is entirely identical here to the generation of Candidate Large Itemsets in Apriori ARM, and thus there should be plenty of algorithms around for doing that quickly by this time.
The other EdgeCount algorithm, EC, requires counting 1’s in the mask pTree of each Subgraph (or candidate Clique, if we want to take the time to generate the CCSs – but then clearly the fastest way to finish up is simply to lookup the single bit position in E, i.e., use EC).
EdgeCount Algorithm (EC): |PUC| = (k+1)!/(k-1)!2! then CCCS
I suppose, if one could come up with a fast way to create mask pTrees for each subgraph (and use Bryan’s pop-count procedure to compute the 1-count as the mask is being created) then this might be a competitive method?????
The SG algorithm seems to be a real winner since all we need is the Edge Mask pTree, E, and a fast way to find those pairs of subgraphs in CSk that share k-1 vertices (then check E to see if the two different kth vertices are an edge in G. Again this is a standard part of the Apriori ARM algorithm and has therefore been optimized and engineered ad infinitum!)
key1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8
E0111010110110001110100011110000100000110100010100000110011110000
( adding 1 vertex, V8, and 4 edges, (1,8) (2,8) (3,8) (4,8) )
k=2: 12 13 14 16 23 24 34 56 57 67 18 28 38 48 =E=CS2=edges.
PE(2,3)=1123CS3
PE(2,4)=1124CS3
PE(2,6)=0
Already have 123CS3
Have 124CS3
PE(1,4)=1134CS3 Have 124CS3
Have 134
PE(2,3)=1234CS3
PE(6,7)=1567CS3
PE(1,5)=0
PE(1,7)=0 have 567
Have 567
k=3: 123 124 134 234 567 128 138 148 238 248 348 = CS3.
PE(2,4)=11234CS4
Have 1334
Have 1334
Have 1234
have 1234
1
2
4 3
6
7
5
8
PE(2,8)=1128CS3
PE(3,8)=1138CS3
PE(4,8)=1148CS3
PE(6,8)=0
Have 128
PE(3,8)=1238CS3
Have 238
Have 138
PE(4,8)=1348CS3
PE(4,8)=1248CS3 Have 348
Have 248
Have 348
PE(3,8)=11238CS4
PE(4,8)=11248CS4
PE(3,8)=11348CS4
Have 1238
Have 1248
Have 1348PE(4,8)=12348CS4
Have 1348
There are 11 3cliques, 4 4cliques and 1 5clique.
k=4: 1234 1238 1248 1348 2348 = CS4.
PE(4,8)=112348CS5 have 12348
have 12348
have 12348
have 12348
have 12348
have 12348
have 12348
have 12348
Note there are many pTree and other data structures we can employ to aid in performing the CCS creation as well as other “path” based needs. These include the following (but there may be others????):
1.2-level, stride=|V|, pTree for E
2.An ExE relationship matrix showing (using a 1-bit) which edge pairs form a 2 path. Then an ExExE matrix showing which edge triples form a 3 path, etc.
k=5: 12348 = CS5.
More Clique Mining using the SubGraph thm (SG)
There are many cohesiveness definitions other than a Clique. Another criterion for subgraph cohesion relies on adjacency of its vertices. The idea is that a vertex must be adjacent to some minimum number of other vertices in the subgraph. In the literature on social network analysis there are two complementary ways of expressing this. A k-plex is a maximal subgraph in which each vertex is adjacent to all other vertices of the subgraph except at most k of them. A k-core is a maximal subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. In any graph there is a whole hierarchy of cores of different order. A k-core is essentially the same as a p-quasi complete subgraph, which is a subgraph such that the degree of each vertex is larger than p(k-1) , where p is a real number in [0; 1] and k the order of the subgraph.
As cohesive as a subgraph can be, it would hardly be a community if there is strong cohesion also between the subgraph and the rest of the graph. Therefore, it is important to compare the internal and external cohesion of a subgraph. In fact, this is what is usually done in the most recent definitions of community. The first recipe, however, is not recent and stems from social network analysis.
An LS-set is a subgraph such that the internal degree of each vertex is greater than its external degree. This condition is quite strict and can be relaxed into the so-called weak definition of community, for which it suffices that the internal degree of the subgraph exceeds its external degree.
A community is strong if the internal degree of any vertex exceeds the number of edges that the vertex shares with any other community.A community is weak if its total internal degree exceeds the number of edges shared by the community with the other communities.
Another definition focuses on the robustness of clusters to edge removal and uses the concept of edge connectivity. Edge connectivity of a pair of vertices is the minimal number of edges that need to be removed in order to disconnect them (no path between).
A lambda set is a subgraph such that any pair of vertices of the subgraph has a larger edge connectivity than any pair formed by one vertex of the subgraph and one outside the subgraph. However, vertices of a lambda-set need not be adjacent and may be quite distant from each other.
Communities can also be identified by a fitness measure, expressing to which extent a subgraph satisfies a given property related to its cohesion. The larger the fitness, the more definite is the community. This is the same principle behind quality functions, which give an estimate of the goodness of a graph partition. The simplest fitness measure for a cluster is its intra-cluster density int(C) (see slide 1). One could say subgraph C with k vertices is a cluster if int(C)>threshold. Finding such subgraphs is NP-complete, as it coincides with the NP-complete Clique Problem when the threshold =1. It is better to fix the size of the subgraph because, without this conditions, any clique would be one of the best possible communities, including trivial two-cliques (simple edges). Variants of this problem focus on the number of internal edges of the subgraph.
Another measure is the relative density (C) of a subgraph C, defined as the ratio between the internal and the total degree of C (see slide1). Finding subgraphs of a given size with (C) larger than a threshold is NP-complete.
Fitness measures can also be associated to the connectivity of the subgraph to the other vertices of the graph. A good community is expected to have a small cut size, i. e. small # of edges joining it to the rest of the graph.
Mining for Communities with more relaxed definitions than cliques (taken from Fortunato’s survey)
key1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8
Er0111010110110001110100011110000100000110100010100001010011110000
1
2
4 3
6
7
5
8
Degree Calculations using pTrees
key
12345678
Er1
11111111
101110101
210110001
311010001
411100001
500000110
610001010
700001100
Ur0000000010000000110000001110000000000000100010000000110011110000
Ur1
0111
1011
2
10000000
3
11000000
4
11100000
7
00001100
6
10001000
811110000
Er0
Ur0
key1,12,13,14,15,16,17,18,11,22,23,24,25,26,27,28,21,32,33,34,35,36,37,38,31,42,43,44,45,46,47,48,41,52,53,54,55,56,57,58,51,62,63,64,65,66,67,68,61,72,73,74,75,76,77,78,71,82,83,84,85,86,87,88,8
Ec1
01111111
101110101
210110001
311010001
411100001
500000110
610001010
700001100
Uc1
1111
1100
1
01110101
2
00110001
3
00010001
4
00000001
5
00000110
6
00000010
811110000
Ec0
Uc0
E 1 2 3 4 5 6 7 81 0 1 1 1 0 1 0 12 1 0 1 1 0 0 0 13 1 1 0 1 0 0 0 14 1 1 1 0 0 0 0 15 0 0 0 0 0 1 1 06 1 0 0 0 1 0 1 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0
U 1 2 3 4 5 6 7 81 0 0 0 0 0 0 0 02 1 0 0 0 0 0 0 03 1 1 0 0 0 0 0 04 1 1 1 0 0 0 0 05 0 0 0 0 0 0 0 06 1 0 0 0 1 0 0 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0
Ec0111010110110001110100011110000100000110100010100001010011110000
Uc0111010100110001000100010000000100000110000000100000000000000000
8
11110000
V11111111100000000000000000000000000000000000000000000000000000000
V1&Er1-8 = Er0.1 so we don’t need to precompute
the 2-level pTrees but it saves 1 AND each time.
Deg(Vk,C)=|PC&PVk|=|PCrk|
k-plex’s are subgraphs s.t. each vertex is adjacent to all other vertices of the subgraph except at most k of them.
k-plex existence algorithms: C is a k-plex iff vVC, |PUC| COMB(|VC|,2) – k
k-plex inheritance theorem: Every induced subgraph of a k-plex is a k-plex.
Proof: Let C be an induced subgraph of G. A vertex of C cannot be missing more adjacent C-edges in C than it is missing adjacent C-edges as a vertex in G, because every missing edge in C is also missing in G (If an edge (v,w) is missing in the induced graph, C then since v,w are vertices in G, that edge (v,w) cannot be in EG, lest it would have been induced into C).
Edge Count k-plex existence theorem: C is a k-plex iff |PUC| (|VC|!/((|VC|-2)!2!))-k
Mining all maximal k-plexes: Start with G by checking |PUG|. If G is a k-plex, so are all induced subgraphs (Inheritance Thm.) Done.
Else check |PUC| induced subgraph C s.t. |VC|=|VG|-1. such C that is not a k-plex, check |PUD| induced subgraph, D of C s.t. |VD|=|VC|-1.
Continue this until all induced subgraphs that are maximal k-plexes have been identified.
A k-core is a subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. There is a hierarchy of cores of different order.
k-core inheritance theorem: If a cover of G by induced k-cores, then G is a k-core.
Edge Count k-core existence theorem: C is a k-core iff |PUC| k
Mining k-cores: If C is s k-core and D is a supergraph s.t. VD - VC = {w1,…,wW} then D is s k-core iff degD(wh) k h=1..W
Note degD(w)=|PDU&PW| = |PD0n| where w is the nth vertex.
So if one computes all |PD0k| then one can build the hierarchy of k-cores in D by examining the set of vertices where this deg is k=max.
Any k-core, would have to be a subset of that set. Then go to k=max-1, etc.
Springer, May 2015 Charu C. Aggarwal. Comprehensive textbook on data mining (see our secret site)The emergence of data science as a discipline requires the development of a book that goes beyond the focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive book explores the different aspects of data mining, from the fundamentals to the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses.
The chapters fall into one of three categories:1. Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. 2. Domain chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. 3. Application chapters study applications: stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation.
About the Author: Charu Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. He has worked extensively in the field of data mining, with particular interests in data streams, privacy, uncertain data and social network analysis. He has published 14 (3 authored and 11 edited) books, over 250 papers in refereed venues, and has applied for or been granted over 80 patents. His h-index is 70. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has received two best paper awards and an EDBT Test-of-Time Award (2014). He has served as the general or program co-chair of the IEEE Big Data Conference (2014), the ICDM Conference (2015), the ACM CIKM Conference (2015), and the KDD Conference (2016). He also co-chaired the data mining track at the WWW Conference 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery and Data Mining Journal , an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the IEEE Transactions on Big Data, and an associate editor of the Knowledge and Information Systems Journal. He is editor-in-chief of the ACM SIGKDD Explorations. He is a fellow of the SIAM (2015), ACM (2013) and the IEEE (2010) for "contributions to knowledge discovery and data mining techniques."
Mohammad Zaki’s Data Mining book (See our secret site)
Bipartite Communities Matthew P. Yancey April 15, 2015
A recent trend in data-mining is finding communities in a graph. A community is a vertex set s.t. # edges inside it is > expected.
(cliques in social networks, families of proteins in protein-protein interaction networks, constructing groups of similar products in recommendation systems… )
An up-to-the moment survey on community detection: S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2.
S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2. In graph clustering, look for a quantitative defi of community. No definition is universally accepted.
Intuitively, community has more edges “ inside” than linked to the outside. Algorithmically defined (final product of an algorithm, without a precise a priori def.)
Let Subgraph C have nc verticies and G having n vertices. Internall, external degree of v C∈ , kvint [kvext] # of edges connecting v to other vertices of C [to the rest of graph]
If kvext=0, vertex has nbrs only in C. If kvint=0, instead, the vertex is disjoint from C and it should be better assigned to a different cluster.
internal degree kintC of C =sum of internal vertex degrees. external degree kextC of C =sum of vertex external degrees. total degree kC =sum of degrees of the vertices of C.
intra-cluster density δint(C) = # internal Cedges / # possible internal edges, [=#int_edges_C / (nc(nc−1)/2) ] inter-cluster density δext(C) =# inter-cluster_edges_C/(nc(n−nc).
Finding the best tradeoff between large δint(C) and small δext(C) is implicitly or explicitly the goal of most clustering algorithms.
A hop is a relationship, R, hopping from entity, E, to entity, F. Strong Rule Mining finds all frequent, confident rulesSRMs are categorized by the number of hops, k, whether transitive or non-transitive and by the focus entity. ARM is 1-hop, non-transitive (A,CE), F-focused SRM (1nF)
ct(&eARe &PC) / ct(&eARe) mncfct(&eARe) mnsp
consequent upward closure: If AC is non-confident, then so is AD for all subsets, D, of C. So frequent antecedent, A, use upward closure to mine for all of its' confident consequents.
antecedent downward closure: If A is frequent, all subsets are frequent (A is infrequent, supersets infreq)Since frequency involves only A, we can mine for all qualifying antecedents using downward closure.
Transitive (a+c)-hop Apriori strong rule mining with a focus entity which is a hops from the antecedent and c hops from the consequent, if a/c is odd/even then one can use downward/upward closure on that step in the mining of strong (frequent and confident) rules.
In this case A is 1-hop from F (odd, use downward closure). C is 0-hops from F (even, use upward closure).
We will be checking more examples to see if the Odddownward Evenupward theorem seems to hold.
1-hop, transitive (AE,CF), F-focused SRM (1tF)
1-hop, transitive, E-focused rule, AC SRM (1tE) ct(PA&fCRf) / ct(PA) mncf|A|=ct(PA) mnsp
antecedent upward closure: If A is infrequent, then so are all of its subsets.
consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C.
In this case A is 0-hops from E (even, use upward closure). C is 1-hop from E (odd, use downward closure).
AC strong if: ct(&eARe &gCSg) / ct(&eARe) mncfct(&eARe) mnsp and2-hop transitive F-focused
S(F,G)
R(E,F)
0 0 0 1
0 0 1 0
0 0 0 1
0 1 0 1
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0
1
2
3
4
E
F2 3 4 5
1
2
3
4
G
A
C
Apriori for 2-hops: Find all freq antecedents, A, using downward closure. find C1G, the set of g's s.t. A{g} is confident. Find C2G, set of C1G pairs that are confident consequents for antecedent, A. Find C3G, set of triples (from C2G) s.t. all subpairs are in C2G (ala Apriori), etc.
1,1 odd so down, down correct.
2-hop trans G-foc mncfct(&flist&eAReSf & PC) / &flist&eARe
Sf ct(&flist&eARe
Sf)mnsp
1. (antecedent upward closure) If A is infrequent, then so for are all subsets.
2. (consequent upward closure) If AC non-conf, so is AD for all subsets, D.
2,0 even so up,up is correct.
2-hop trans E-foc
antecedent upward closure: If A is infrequent, so are all subsets.
consequent upward closuree: If AC non-conf so is AD for all subsets, D.
0,2 even so up,up is correct.
mncfct(PA&f&gCSgRf ) / ct(PA)ct(PA)mnsp mncfct(&f&eARe
Sf & PC) / &f&eAReSf
ct(&fl&eAReSf)mnsp
APPENDIX: AC, is confident if a high fraction of the fF which are related to every aA, are also related to every cCF is the Focus Entity and the high fraction is the MinimumConfidence ratio.
R(E,F)
1 0 1 11 0 1 11 1 0 11 1 1 1
1234
E
F2 3 4 5
A
C
SuppSetA (set of F’s related to every element of A) = {2,3,5} FSuppSetC = {2,4,5} 2/3 = ConfAC = |SuppSetAC|/|SuppSetA| = ct(&eACPe) / ct(&eAPe)
1 0 0 1 1 1 0 1
Question: Why isn’t ConfAC = SuppC / SuppA?
key1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8
E0111010110110001110100011110000100000110100010100001010011110000
12 13 14 16 18 23 24 28 34 38 48 56 57 67 = CS2
1
2
4 3
6
7
5
8
2-level, stride=|V|=8, pTrees for E to aid in performing the CCS creation steps in Alg B key
1234 E-L15678
E
1111
1111
key
1234 E-L05678
1
0111
0101
2
1011
0001
3
1101
0001
4
1110
0001
5
0000
0110
6
1000
1010
7
0000
1100
U0111010100110001000100010000000100000110000000100000000000000000
key
1234 U-L15678
U
1111
1100
key
1234 U-L05678
1
0111
0101
2
0011
0001
3
0001
0001
4
0000
0001
5
0000
0110
6
0000
0010
123 124 126 128 134 136 138 146 148 168 CCS2 (3sets formed from pairs of 2sets that share V1) 2-lev don’t seem to aid CCS creation.
Etc.
8
1111
0000
E2key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
pTree path-based analytics? Pre-construct length=2_path_pTrees (E2) length=3 (E3) etc.
PE30000000000000000000000000000000000000000000001000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
1111111111111111111111111111111111111111111111111111111111111111
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
2222222222222222222222222222222222222222222222222222222222222222
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
3333333333333333333333333333333333333333333333333333333333333333
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
4444444444444444444444444444444444444444444444444444444444444444
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
PE20000000000010110000000000000001000010000000000000000000000000000
E (Edge Table) Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1
12 2:3
3:2 4:31
31 2
E2key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
PE30000000000000000000000000000000000000000000011000000000000000000
0000000000000000000000000000000000000000000000000010000010000000
0000000000000100000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
1111111111111111111111111111111111111111111111111111111111111111
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
2222222222222222222222222222222222222222222222222222222222222222
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
3333333333333333333333333333333333333333333333333333333333333333
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
4444444444444444444444444444444444444444444444444444444444444444
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4
PE20000000000010110000000000000101000010000000011000010000010000000
1 2
3 4
PE0011_0001_1001_1110
Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4
PU0011_0001_0001_0000
PU20000000000010110000000000000001000010000000000000000000000000000
A path in a graph is a finite or infinite sequence of edges which connect a sequence of vertices which, by most definitions, are all distinct from one another except possibly the endpoints.
V2 1 2 3 4
1234
V3
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V1
2
3
4
V2 1 2 3 4
1234
V10 0 1 10 0 0 11 0 0 11 1 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 00 0 0 00 0 0 10 0 0 0
0 0 0 00 0 0 00 0 0 01 0 0 0
0 0 0 00 0 0 00 0 0 00 0 0 0
V3 1 2 3 4
1234V4
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V2
2
3
4
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 00 0 0 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 00 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V1 1 2 3 4
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 00 0 0 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 00 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V2 1 2 3 4
0 0 1 1 1
V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V3
V1 1 2 3 4
1234
V20 0 0 00 0 0 01 0 0 01 1 1 0
V2 1 2 3 4
0 0 0 0
0 0 0 0
1 0 0 0
1 1 1 0
1
V1
2
3
4
EE0000000000010110000000000000101000010000000011000010000010000000
1
2
3 4
E0011_0001_1001_1110
1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4
A path is a sequence of edges connecting a sequence of vertices which are (usually) all distinct from one another except endpts.
V2 1 2 3 4
1234
V3
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V1
2
3
4
V2 1 2 3 4
1
2
3
4
V1
0 0 1 10 0 0 11 0 0 11 1 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V3 1 2 3 4
1234V4
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V2
2
3
4
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 00 0 0 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 00 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V1 1 2 3 4
V3
V2 1 2 3 4
1234V3
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V1
2
3
4
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 00 0 0 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 00 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V4 1 2 3 4
111112113114121122123124131132133134141142143144211212213214221222223224231232233234241242243244311312313314321322323324331332333334341342343344411412413414421422423424431432433434441442443444
E E k ey
1 2 3 4
0 0 1 11V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V2
V1 V2 E10011 E2
0001E3
1001 E4
1110
EE10000000000010110
U10011 U2
0001U3
0001 U4
0000
U0011_0001_0001_0000
EE20000000000001010EE3
0001000000001100 EE4
0010000010000000
EE110000 EE12
0000
EE130001
EE140110
EE210000 EE22
0000
EE230000
EE241010EE31
0001 EE32
0000
EE330000 EE34
1100
EE410010 EE42
0000
EE431000
EE440000
For h=1 k=3: EE13=E3&M’1
M11000 M2
0100M3
0010 M4
0001
kListEh, EEhk=Ek&M’h. other k, EEhk=0
For h=1, ListE1={3,4}
E3&1001
M’1=0111
EE130001For h=1 k=4: EE14=E4&M’1
For h=2, ListE2={4}
For h=2 k=4: EE24=E4&M’2 E41110
M’21011
EE241010
E41110
M’1=0111
EE140110
For h=3, ListE3={1,4}For h=3 k=1: EE31=E1&M’3 E1&
0011
M’3=1101
EE310001
For h=3 k=4: EE34=E4&M’3
E4&1110
M’3=1101
EE341100
V2 1 2 3 4
1234
V3
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V1
2
3
4
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V3 1 2 3 4
1234V4
0 0 0 00 0 0 10 0 0 10 0 1 0
1
V2
2
3
4
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 00 0 0 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 00 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V1 1 2 3 4
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 00 0 0 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 01 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
0 0 0 00 0 0 10 0 0 10 0 1 0
0 0 0 10 0 0 00 0 0 10 0 0 0
0 0 0 10 0 0 10 0 0 00 0 0 0
0 0 1 00 0 0 01 0 0 00 0 0 0
V2 1 2 3 4
0 0 1 1 1
V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V3
V2 1 2 3 4
0 0 1 1 1
V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V2 1 2 3 4
0 0 1 1 1
V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V2 1 2 3 4
0 0 1 1 1
V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V2 1 2 3 4
0 0 1 1 1
V1
2
3
4
0 0 0 1
1 0 0 1
1 1 1 0
V31
2
3
4
key
1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8
E
0111010110110001110100011110000100000110100010100000110011110000
12
4 3
6
7
5
8
2-lev, str=|V|=8, pTrees for path analytics? E1key
1234
5678
E
1111
1111
E0key
1234
5678
1
0111
0101
2
1011
0001
3
1101
0001
4
1110
0001
5
0000
0110
6
1000
1010
7
0000
1100
8
1111
0000
U
0111010100110001000100010000000100000110000000100000000000000000
U1key
1234
5678
U
1111
1100
U0key
1234 5678
1
0111
0101
2
0011
0001
3
0001
0001
4
0000
0001
5
0000
0110
6
0000
0010
E 1 2 3 4 5 6 7 81 0 1 1 1 0 1 0 12 1 0 1 1 0 0 0 13 1 1 0 1 0 0 0 14 1 1 1 0 0 0 0 15 0 0 0 0 0 1 1 06 1 0 0 0 1 0 1 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0
U 1 2 3 4 5 6 7 81 0 0 0 0 0 0 0 02 1 0 0 0 0 0 0 03 1 1 0 0 0 0 0 04 1 1 1 0 0 0 0 05 0 0 0 0 0 0 0 06 1 0 0 0 1 0 0 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0
E1key
1234
5678
E
1111
1111
E0key
1234
5678
1
0111
0101
2
1011
0001
3
1101
0001
4
1110
0001
5
0000
0110
6
1000
1010
7
0000
1100
8
1111
0000
U1key
1234
5678
U
1111
0111
U0key
1234 5678
1
0000
0000
2
1000
0000
3
1100
0000
4
1110
0000
7
0000
1100
6
1000
1000
8
1111
0000
E04= 142143148
1110
0001
E06= 165167
1000
1010
E08= 182183184
1111
0000
E03= 231234238
1101
0001
E04= 248
1110
0001
E08= 281283284
1111
0000
E01=312314316318
0111
0101
E08= 381382384
1111
0000
Find all paths of length=3 that start at vertex:
E03= 132134 138
1101
0001
3rd: P’h&E0k kEOhV=h 1st, 2nd EOh
E02= 123 124128
1011
0001
h=1 EO1
0111
0101
h=2 EO2=
1011
0001
E01= 213214216218
0111
0101
h=3 EO3=
1101
0001
E02=321324328
1011
0001
E04=341342348
1110
0001
h=4 EO4=
1110
0001
E01=412313416418
0111
0101
E08= 481482483
1111
0000
E02=421423428
1011
0001
E03=431432438
1101
0001
h=5 EO5=
0000
0110
E06=561567
1000
1010
E07=576
0000
1100
h=6 EO6=
1000
1010
E01=612613614618
0111
0101
E05=657
0000
0110
E07=675
0000
1100
h=7 EO7=
0000
1100
E05=756
0000
0110
E06=765
1000
1010
h=8 EO8=
1111
0000
E01=812813814816
0111
0101
E04= 841842843
1110
0001
E02=821823824
1011
0001
E03=831832834
1101
0001
The # of 3paths starting at: 1 2 3 4 5 6 7 8 Tot
14 11 13 13 3 6 2 13 76 Find 4paths that ending with each 3path
142143148165167 182183184prefE01
132134 138
123 124128
0111
0101
Concat with each elim if digit duplicates
2134 2138 2143 2148 2165 2183 2184 3124 3128 3142 3148 3165 3167 3182 3184 4123 4128 4132 4128 4165 4167 4182 4183 5123 5124 5128 5132 5143 5138 5154 5143 5148 5167 5182 5183 5184
6123 6124 6128 6132 6134 6138 6142 6143 6148 6182 6183 6184 7123 7124 7128 7132 7134 7138 7142 7143 7148 7165 7182 7183 7184 8123 8124 8132 8134 8142 8143 8165 8167
312314316318321324328341342348381382384
888888
234238248283284
11111
213214216231234
88888
214216218248281284
333333
213216218231238281283
4444444
111111
324328342348382384314316318341348381384
2222222
312316318321328381382
4444444
312314316321324341342
8888888
h=4 next
1
2
…
9
Term
1 2 3 D
DTPe k=1..7 TDRolodexCd
1
2
…
7
Pos1 2 3 D
DTPe k=1..9 PDCd
1
2
…
7
Pos1 2 … 9 T
DTPe k=1..3 PTCdWe can form multi-hop relationships from RoloDex cards. AC confident if most of the fF related to every aA, are also related to every cC.F is the Focus Entity and “most” means at least a MinimumConfidence ratio.
DT (P=k)
DT (P=h)
0 0 0
0 0 1
0 0 0
1 0 0
0 1 1
1 0 0
3
…
1
D
T1 … 9
3
…
1
D
A
C A confident DThk rule means:A high fraction of the terms, tT in Position=h of every doc A, are also in Position=k of every doc C.
Is there a high payoff research area here?
DP (T=k)
DP (T=h)
0 0 00 0 10 0 0
1 0 0
0 1 1
1 0 0
3
…
1
D
P1 … 7
3
…
1
D
A
C A confident DPhk rule means:Hi fraction of Positions, pP which hold Term=h for every doc A, hold Term=k in Pos=p for every doc C
TP (D=k)
TP (D=h)
0 0 00 0 10 0 0
1 0 0
0 1 1
1 0 0
9
…
1
T
P1 … 7
9
…
1
T
A
C
Conf TPhk: Hi fraction of pP in Doc=h holding every t A, also hold every t C in Doc=kThis only makes sense for A ,C singleton Terms.Also it seems like P would have to be singleton?
TD (P=k)
TD (P=h)
0 0 0
0 0 1
0 0 0
1 0 0
0 1 1
1 0 0
9
…
1
T
D1 … 3
9
…
1
T
A
C
Confident TDhk rule means a high fraction of the Documents, dD having in Position=h, every Term, t A, also have in Position=k, every Term, t C. Again, A,C must be singletons. Hi payoff? It suggests in 1-hop ARM:
Conf TD rules: hi fraction of Docs, dD having every term t A also have every term t C. Again, A,C must be singletons.Is there a high payoff research area here?
PD (T=k)
PD (T=h)
0 0 00 0 10 0 0
1 0 0
0 1 1
1 0 0
7
…
1
P
D1 … 3
7
…
1
P
A
C Conf PDhk rule: A high fraction of the Documents, dD having Term=h in every Pos, pA, also have Term=k in every Pos. pC.
PT (D=k)
PT (D=h)
0 0 00 0 10 0 0
1 0 0
0 1 1
1 0 0
7
…
1
P
T1 … 9
7
…
1
P
A
C A confident PThk rule means:A high fraction of the Terms, tT in Doc=h which occur at every Pos, p A, also occur at every Pos, pC in Doc=k
Is this a high payoff research area?
Market Basket RoloDex w different Cust-Item card for each day
Buys (Day=k)
0 0 0
0 0 1
0 0 0
3
…
1
Cust1 … 9
Item
Buys (Day=2)
Buys (Day=1)
0 0 0
0 0 1
0 0 0
1 0 0
0 1 1
1 0 0
3
…
1
I
C1 … 9
3
…
1
I
A
B
Conf Buy12 rule: Custs who Buy A on Day=1, Buy B on Day=2 w hi prob
“Buys” pathways?
Buys (Day=2)
Buys day=1
0 0 0
0 0 1
0 0 0
1 0 0
0 1 1
1 0 0
3
…
1
I
C1 … 9
3
…
1
A
I
Conf Buy123 pathway: Most custs who Buy A Day=1 Buy B Day=2. Most of those custs Buy all of D on Day=3
Buys day=3
0 0 0
0 0 1
0 0 0
DC1 … 9
Buys (Day=2)
Buys (Day=1)
0 0 0
0 0 1
0 0 0
1 0 0
0 1 1
1 0 0
3
…
1
I
C1 … 9
3
…
1
I
A
I
Conf Buy1234 pathway: Some customers Buys all of A on Day=1, then most of those customers will Buy all of B on Day=2, then most of those customers will Buy all of D on Day=3 And most of those customers Buy all of E Day=4
Buys (Day=3)
0 0 0
0 0 1
0 0 0
C1 … 9
Buys (Day=4)
1 0 0
0 1 1
1 0 0
3
…
1
EI
Protein-Protein Interaction RoloDex (different card for each interaction in some pathway)
Interaction=k
0 0 0
0 0 1
0 0 0
3
…
1
Gene1 … 9
Gene
Customer
1
2
3
4
Item
6
5
4
3
Gene
11
1
Doc
1
2
3
4
Gene
11
3
Exp
11
11
11
11
1 2 3 4 Author
1 2 3 4 G 5 6term 7
5 6 7People
11
11
11
3
2
1
Doc
2 3 4 5PI
People
cust item card
authordoc card
termdoc card
docdoc
expgene card
genegene card (ppi)
expPI card
genegene card (ppi)
mov
ie
0 0 0 0
0 2
0 0
3 0 0 0
1 0 0
5 0
0
0
0
5
1
2
3
4
4 0 0
0 0 0
5
0
0
1
0
3
0
0
customer rates movie card
0 0 0 0
0 0
0 0
0 0 0 0
0 0 0
1 0 0
0
0
0
1
0 0 0
0 0 0
1
0
0
0
0
0
customer rates movie as 5 card
4
3
2
1
Course
Enrollments
1 5people 2 3 4
1
2
3
4
item
s
3 2
1
term
s
DataCube Model for 3 entities, items, people and terms.
76
54
32
t
1
termterm card (share stem?)
Items: i1 i2 i3 i4 i5
|0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10|
People: p1 p2 p3 p4
|0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M|
Terms: t1 t2 t3 t4 t5 t6
|1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00|
Relationship: p1 i1 t1
|0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2
Relational Model:
2 3 4 5PI
RoloDex Model: 2 Entities many relationships
One can form multi-hops with any of these cards.Are there any that provide and interesting setting for ARM data mining?
3-hopS(F,G)
R(E,F)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
1234
E
F 2 3 4 5
1234
G
A
C
T(G,H)
0 0 0 11 0 1 00 0 0 10 1 0 1
H2 3 4 5Collapse T: TC≡ {gG|T(g,h) hC} That's just 2-hop w TCG replacing C. ( can be replaced by . Collapse T and S: STC≡{fF |S(f,g) gTC} Then it's 1-hop w STC replacing C.
Focus on G
mncnfct(&eARe &g&hCThSg) / ct(&eARe
mncnf&hCTh) ct(&f&eAReSf / ct(&f&eARe
Sf)
ct( 1001 &g=1,3,4 Sg ) /ct(1001) = ct( 1001 &1001&1000&1100) / 2 = ct( 1000 )/2 = 1/2
Focus on F different because the confidences can be different. Focus on G. ct(&f=2,5Sf &1101 ) / ct(&f=2,5Sf
ct(1101 & 0011 & &1101 ) / ct(1101 & 0011 ) = 1/1 =1
mnsup ct(&eARe
mnspct(&f&eAReSf)
Focus on F
antecedent downward closure: A infreq. implies supersets infreq. A 1-hop from F (downconsequent upward closure: AC noncnf implies AD noncnf. DC. C 2-hops (up
antecedent upward closure: A infreq. implies all subsets infreq. A 2-hop from G (up) consequent downward closure: AC noncnf impl AD noncnf. DC. C 1-hops (down)
4-hop
S(F,G)
R(E,F)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
1234
E
F 2 3 4 5
1234
G
A
C
T(G,H)
0 0 0 11 0 1 00 0 0 10 1 0 1
H2 3 4 5
U(H,I)
1 0 0 10 1 0 11 0 0 01 1 0 0
1234
I
Focus on G? Replace C by UC; A by RA as above (not different from 2 hop?)
Focus on H (RA for A, use 3-hop) or focus on F (UC for C, use 3-hop).
Another focus on G (main) mncnf ct( &f&eAReSf &h&iCUi
Th ) / ct(&f&eAReSf)
&iCUi))+(ct(S1(&eARe mncnf/ ( (ct(&eARe))n* ct(&iCUi) )&iCUi))+... ct(S2(&eARe &iCUi)) ) ct(Sn(&eARe
...
R(E,G)
0 0 1 10 0 1 10 0 0 10 1 0 0
1234
E
G 2 3 4 5
A
1234
GSn(G,G)
S1(G,G)
1 0 0 10 1 1 11 0 0 01 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
U(G,I)
1 0 1 10 1 1 11 0 0 01 1 0 0
CI2 3 4 5
2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C (the "list" will be larger, so the AND over the list will produce fewer ones) So frequent antecedent, A, use upward closure to mine out all confident consequents, C.
1. (antecedent upward closure) If A is infrequent, then so are all of its subsets (the "list" will be larger, so the AND over the list will produce fewer ones) Frequency involves only A, so mine all qualifying antecedents using upward closure.
4-hop APRIORI focus on G: ct(&f&eARe
Sf &h&iCUiTh) / ct(&f&eARe
Sf) mnsupct(&f&eAReSf)
5-hop
S(F,G)
R(E,F)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
1234
E
F 2 3 4 5
1234
G
A
C
T(G,H)
0 0 0 11 0 1 00 0 0 10 1 0 1
H2 3 4 5
U(H,I)
1 0 0 10 1 0 11 0 0 01 1 0 0
1234
I V(I,J)
0 0 0 11 0 1 00 0 0 10 1 0 1
J 2 3 4 5
Given any 1-hop labeled relationship (e.g., cells have values from {1,2,…,n} then there is:1. a natural n-hop transitive relationship, A implies D, by alternating entities for each specific label value relationship.2. cards for each entity consisting of the bitslices of cell values.
E.g., in netflix, Rating(Cust,Movie) has label set {0,1,2,3,4,5}, so in 1. it generates a bonafide 6-hop transitive relationship.In 2. an alternative is to bitmap each label value (rather than bitslicing them). Below Rn-i can be bitslices or bitmaps
R3(C,M)
R2(M,C)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 0 0 10 1 1 11 0 0 01 1 0 0
1234M
C2 3 4 5
1234M
A
D
R4(M,C)
0 0 0 11 0 1 00 0 0 10 1 0 1
C2 3 4 5
R5(C,M)
1 0 0 10 1 0 11 0 0 01 1 0 0
1234
M
R0(M,C)
0 0 0 11 0 1 00 0 0 10 1 0 1
C2 3 4 5
C2 3 4 5
R1(C,M)
1 1 0 10 0 0 11 1 0 11 1 0 0
E.g., equity trading on a given day, QuantityBought(Cust,Stock) w labels {0,1,2,3,4,5} (where n means n thousand shares) so that generates a bonafide 6-hop transitive relationship:
equity trading - moved similarly, (moved similarly on a day --> StockStock(#DaysMovedSimilarlyOfLast10)equity trading - moved similarly2, (define moved similarly to mean that stock2 moved similarly to what
stock1 did the previous day.Define relationship StockStock(#DaysMovedSimilarlyOfLast10)Gene-Experiment, Label values could be "expression level". Intervalize and go!
Has Strong Transitive Rule Mining (STRM) been done? Are their downward/upward closure theorems already for it? Is it useful? That is, are there good examples of use: stocks, gene-experiment, MBR, Netflix predictor,...
R0(E,F)
Rn-2(E,F)Rn-1(E,F)
F 2 3 4 5
1234
EA
0 0 0 10 0 1 00 0 0 10 1 0 0
0 0 0 10 0 1 00 0 0 10 1 0 0
0 0 0 10 0 1 00 0 0 10 1 0 0
0 0 0 10 0 1 00 0 0 10 1 0 0
0 0 0 10 0 1 00 0 0 10 1 0 0
...
D
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
Buys(C,T)
BoughtBy(I,C,)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0
ItemsCustomers2 3 4 5
1
2
3
4
Types (of Items)
A
D
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1718
1920
Let Types be an entity which clusters Items (moves Items up the semantic hierarchy), E.g., in a store, Types might include; dairy, hardware, household, canned, snacks, baking, meats, produce, bakery, automotive, electronics, toddler, boys, girls, women, men, pharmacy, garden, toys, farm). Let A be an ItemSet wholly of one Type, TA, and let D by a TypesSet which does not include TA. Then:
AD might mean If iA s.t. BB(i,c) then tT, B(c,t) AD might mean If iA s.t. BB(i,c) then tT, B(c,t)
AD might mean If iA s.t. BB(i,c) then tT, B(c,t)AD might mean If iA s.t. BB(i,c) then tT, B(c,t)
AD frequent might mean ct(&iABBi) mnsp ct( | iABBi) mnsp ct(&tDBt) mnsp ct( | tDBt) mnsp
ct(&iABBi &tDBt) mnsp, etc.
ct(&iABBi &tDBt) / ct(&iABBi) mncfAD confident might mean ct(&iABBi | tDBt) / ct(&iABBi) mncf
ct( | iABBi | tDBt) / ct( | iABBi) mncf ct( | iABBi &tDBt) / ct( | iABBi) mncf
Text Mining using pTrees
Pos
1 0 0 0 0 1 0 . . .
Term buy
DTPe in PpTreeSet index (T,D)
Doc3
Doc2
Doc1
1 0
DTPe Position TablePos T1D1 T1D2 T1D3...T9D1…T9D3
1 1 0 1 ... 0 … 0
7 0 … 0 . . . 1 … 1
.
.
.
1 2 3 4 5 6 7 3 2
1 1
.Doc
... T
erm
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 . . .
0 . . .
0 . . .
1
0
0
0 0 0 0 0 0 0 . . .
0 1 0 0 0 0 0 . . .
1 0 0 0 0 1 0 . . .
DTPe Data Cube
1
2
…
9
Term1 2 3 D
TDcardP=kk=1..7
DTPe k=1..7 TDRolodexCd
1
2
…
7
Pos1 2 3 D
PDcardT=kk=1..9
DTPe k=1..9 PDCd
1
2
…
7
Pos1 2 … 9 T
PT cardD=kk=1,2,3
DTPe k=1..3 PTCd
DTPe Document Table: Doc T1P1…T1P7 . . . T9P1…T9P71 1 … 0 . . . 0 … 0
2 0 … 0 . . . 1 … 0
3 0 … 0 . . . 1 … 1
Classical Document Table:Doc Auth… Date . . .Subj1 …Subjm1 1 1/2/13 . . . 0 … 0
2 0 2/2/15 . . . 1 … 0
3 0 3/3/14 . . . 1 … 1
0 0 0 0 0 0 0 . . .
DTPe DocTbl DpTreeSet indexed by (T,P))Position 1 2 3 4 5 6 7Term
an
and
April
are
apple
0 0 0 0 0 0 0 . . .
0 0 1 0 0 0 0 . . .
0 0 0 1 0 0 1 . . .
0 0 0 0 0 0 0 . . .
always 1 0 0 0 0 0 0 . . .
all 0 0 0 0 0 0 0 . . .
AAPL
buy
0 1 0 0 0 0 0 . . .
01 0 0 0 0 1 0 . . .
Classical DocTbl DpTreeSet
1
Auth Date
0
Subj1
0
Subjm
DTPe Term Table:Term P1D1 P1D2 P1D3...P7D1…P7D3
1 1 0 1 ... 0 … 0
9 0 … 0 . . . 1 … 1
.
.
.
DTPe Term Usage Table:Term P1D1 P1D2 P1D3...P7D1…P7D3
1 noun verb adj adv …noun
9 adj noun noun adj noun
.
.
.
Doc3
Doc2
Doc1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
DTPe TpTreeSet index (D,P) Positions 1 2 …
0
0
0
0
0
1
0
0
1
P1D1noun
1
0
0
0
0
0
0
0
0
P1D1 adj
tf is the +rollup of the DTPe datacube along the position dimension. One can use any measurement or data structure of measurements, e.g., DT tfidf in which each cell has a decimal tfidf, which can be bitsliced directly into whole number bitslices plus fractional bitslices (one for each binary digit to the right of the binary point-no need to shift!) using: MOD(INT(x/(2k),2), e.g., a tfidf =3.5 is
k: 3 2 1 0 -1 -2 bit: 0 0 1 1 1 0
3 2
1
.Doc
s
T
erm
s
0
0
1
2
0
0
0
1
2
DTtf DocTerm termfreq Data Cube
DT tfidf Doc Table: Doc T1 T2 . . . T9
1 .75 0 . . . 1
2 0 1 .25
3 0 0 0
DT tfidf DpTreeSet
0
T1k1
0 1
T1k0 T1k-1 T1k-2
1
Rating of T=stock at doc date close:1=sell, 2=hold,3=buy0=non-stock Term
3 2
1
.Doc
s
T
erm
s
0
0
0
0
0
0
0
3
0
DT SR DocTerm StockRating Cube
DT SR bitslice DpTreeSet1
T2k2
1
T2k1
DT SR bitmap DpTreeSet
1
T2,R=buy
0 0
T2,R=hold T2,R=sell
key1,11,2 :1,N_2,12,2 :2,N_ . . . _M,1M,2 :M,N
E01:000:0...10:1
Closure: An induced Subgraph (ISG), C, of a graph, G, inherits all of G’s edges between its own vertices.
A k-ISG (k vertices), C, is a k-clique iff all of its (k-1)-Sub-ISGs are (k-1)-cliques.
U01:000:0...00:0
Big Graph Mining (BipartiteGraphs) Gene-Gene Ints: N=M=25K, NM=625MSocial Nets: N=M=2B, NM=4BB Recommenders: N=B, M=M, NM=MB
Assume graph is Bipartite G=(I,C,E) (Unipartite iff C=I) |I|=N, |C|=M (|E|=MN) 2 level pTrees stride=N: Lev1 Level-0Ctkey12:M
#E11:1
#U11:1
#E101::0
#U101::0
Ctkey12::N
U=Unique.
For Bipartite
& Directed
graphs, E=U#E200::0
#U200::0
#EM10::1
#UM00::0
e.g., UM masks items of cust=M, friends of person=M, genes interacting with gene=M.
v1 w1
v2 v3
w2
0
0
0 0
v1 v2 v3 w1 w2
w2
w1
v3
v2
v1
1 0
1 0
1 1
G=Unipartite graph (VW, EVW)
w1 w2
v3
v2
v1
1 0
1 0
1 1
Bipartite G=((V,W),E)
So, Communities in bipartite graphs studied as unipartite?
A tree is bipartite. Cycle graphs w even # of vertices bipartite.Planar graph whose faces all have even length is bipartite
δintC- δextC=1–1/3=2/3
12 2:3
3:2 4:31
31 2
PEL.,1 0001_0001_0001_1100
PEL.,1 0010_0001_1000_0110
EL0012_0003_0001_2310
PE
0011_0001_1001_1110
Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4
L=1PE =PLT
1111
L=0PE,1
0011
L=0PE,2
0001
L=0PE,3
1001
L=0PE,4
1110
E=Adjacency matrix2 3 1
1
4:3
3:2
2:3
1:2
V1
V2As Rolodex card 1:2 2:3 3:2 4:3
C
2=|PC&PE&Pv1|=kv1
int
A community has more edges inside than linked to the outside.
Let Subgraph, C, have nc vertices of a graph, G, having n vertices.
Internal degree of v C∈ , kvint =# of edges from v to vertices in C
External degree of v C∈ , kvext =# of edges from v to vertices in C’
Internal degree of C, kCint =vC kv
int
External degree of C, kCext =vC kv
ext
Total degree of C, kC= +kCext kC
int
2=|PC&PE&Pv3| =kv3
int 2=|PC&PE&Pv4|=kv4
int 6=kCint
1=kCext
kC=7
Intra-cluster density δint(C)=|edges(C,C)|/(nc(nc−1)/2)=|PE&PC&PLT|/(3*2/2)=3/3=1
PLT
0111_0011_0001_0000
PLT,1
0111
PLT,2
0011
PLT,3
0001
PLT,4
0000
Inter-cluster density δext(C)=|edges(C,C’)| / (nc(n-nc)) =|PE&P’C&PLT|=1/(3*1)=1/3
PC
1011_0000_1011_1011
Pv11111_0000_0000_0000
Pv20000_1111_0000_0000
Pv30000_0000_1111_0000
Pv40000_0000_0000_1111
0=|P’C&PE&Pv1|=kv1
ext 0=|P’C&PE&Pv3|=kv3
ext 1=|P’C&PE&Pv4|=kv4
ext
Useful masks
The tradeoff between large δint(C) and small δext(C) is goal of community mining and clustering algorithms. The simple ways is to Maximize Differences, δint(C)−δext(C) = D (or Dk=kC
int – kCext ) over all clusters (use Sum of Differences for partitions).
It is easy to compute each SD and SDk with pTrees, even for BigDataGraphs.
One can use downward (upward?) closure properties (precisely) to facilitate maximizing differences over all clusters, C?
Graphs are the ubiquitous data structures for complex data in all of science. A table is a graph with no edges, a relationship is a bipartite graph…
Extend to multigraphs (edge sets =vertex triples, quadruples, etc.).
Ignoring Subgraphs of 1 or 2 vertices, the other three 3subgraphs are D={1,2,3}, F={1,2,4}, H={2,3,4}
Stride=4. Two-Level pTreesHoriz Vertex data Vertical Vertex data Vkey VLabel 1 2 2 3 3 2 4 3
PVL,1 1111
PVL,0 0101
VL2323
PC 1011
Fixed Pt Colmn
Vertex-Labelled, Edge-Labelled Graph
δint(D) =|PE&PD&PLT|/(3*2/2)=1/3
δext(D)=|PE&P’D&PLT|=1/(3*1)=3/3=1 δintD - δextD=1/3–1=-2/3
PD
1110_1110_1110_0000
δint(F) =|PE&PF&PLT|/(3*2/2)=2/3
δext(F)=|PE&P’F&PLT|=1/(3*1)=2/3 δintF - δextF=2/3-2/3=0
PF
1101_1101_0000_1101
δint(H) =|PE&PH&PLT|/(3*2/2)=2/3
δext(H)=|PE&P’H&PLT|=1/(3*1)=2/3 δintH - δextH=2/3-2/3=0
PH
1101_1101_0000_1101
D F
H
Maximizing Difference of Cluster Densities: C is strongest community (subgraph/cluster). One could use label values (weights) instead of the 0/1 existence values.
E Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1