1 2 34 graph path: sequence of edges connecting a sequence of vertices (usually) distinct from each...

1 2

3 4

Graph Path: sequence of edges connecting a sequence of vertices (usually) distinct from each other except for the endpoints.

EE10000000000010110 EE2

0000000000001010EE3

0001000000001100 EE4

0010000010000000

EE110000 EE12

0000

EE130001

EE140110

EE210000 EE22

0000

EE230000

EE241010EE31

0001 EE32

0000

EE330000 EE34

1100

EE410010 EE42

0000

EE431000

EE440000

For h=1 k=3: EE13=E3&M’1

For h=1, ListE1={3,4}E3&1001

M’1=0111

EE130001

For h=1 k=4: EE14=E4&M’1

For h=2, ListE2={4}

For h=2 k=4: EE24=E4&M’2

E41110

M’21011

EE241010

E41110

M’1=0111

EE140110

For h=3, ListE3={1,4}

For h=3 k=1: EE31=E1&M’3 E1&0011

M’3=1101

EE310001

For h=3 k=4: EE34=E4&M’3

E4&1110

M’3=1101

EE341100

For h=4, ListE4={1,2,3}

For h=4 k=1: EE41=E1&M’4

E1&0011

M’4=1110

EE410010

For h=4 k=2: EE42=E2&M’4 E2&0001

M’4=1110

EE420000 pure0

For h=4 k=3: EE43=E3&M’4 E3&1001

M’4=1110

EE431000

3Level, Stride=4 pTrees for paths of len=2 (2 edges and 3 vertices (unique except for endpts) )

Level=0 EE130001

EE140110

EE241010

EE310001

EE341100

EE410010

EE431000

Level=1=just E1,E2,E3,E4 with pure0 bits turned off.

E10011

E20001

E31001

E410 bit turned off10

Level=2 1111

0000000000000000000000000000000000000000000000000010000010000000

0000000000000000000000000000000000000000000011000000000010000000

0000000000000000000000000000000000000000000000000010000010000000

0000000000000110000000000000000000000000000000000010000000000000

0000000000010000000000000000000000010000000000000000000000000000

1111111211131114112111221123112411311132113311341141114211431144121112121213121412211222122312241231123212331234124112421243124413111312131313141321132213231324133113321333133413411342134313441411141214131414142114221423142414311432143314341441144214431444

E3key

2111211221132114212121222123212421312132213321342141214221432144221122122213221422212222222322242231223222332234224122422243224423112312231323142321232223232324233123322333233423412342234323442411241224132414242124222423242424312432243324342441244224432444

3111311231133114312131223123312431313132313331343141314231433144321132123213321432213222322332243231323232333234324132423243324433113312331333143321332233233324333133323333333433413342334333443411341234133414342134223423342434313432343334343441344234433444

4111411241134114412141224123412441314132413341344141414241434144421142124213421442214222422342244231423242334234424142424243424443114312431343144321432243234324433143324333433443414342434343444411441244134414442144224423442444314432443344344441444244434444

E3

0000000000000000000000000000000000000000000011000000000010000000

30000000000000110000000000000000000000000000000000010000000000000

40000000000010000000000000000000000010000000000000000000000000000

E3 E3 E3

0000000000000000

E3

11

0000000000000000

E3

12

0000000000001100

E3

13

0000000010000000

E3

14

E3

1 E3

2

0000

E3

111

0000

E3

112

0000

E3

113

0000

E3

114

0000

E3

121

0000

E3

122

0000

E3

123

0000

E3

124

0000

E3

131

0000

E3

132

0000

E3

133

1100

E3

134

0000

E3

141

0000

E3

142

1000

E3

143

0000

E3

144

kListE2hj, E3hjk=Ek&M’j.

h=1 j=4 k=3 E3143=E3&M’4

M’41110

E31001

1000

E3

143

h=1 j=4 k=2 E3142=E2&M’4

M’41110

E20001

0000 pure0

E3

142

h=1 j=3 ListE213={4} k=4 E3134=E4&M’3

M’31101

E41110

1100

E3

134

h=2 j=4 ListE224={1,3} k=1 E3241=E1&M’4 M’4

1110

E10011

0010

E3

241

h=2 j=4 k=3 E3243=E3&M’4

M’41110

E31001

1000

E3

243

h=3 j=1 k=4 E3314=E4&M’1

E41110

M’10111

0110

E3

314

h=3 j=4 k=1 E3341=E1&M’4

E10011

M’41110

0010

E3

341

h=3 j=4 k=2 E3342=E2&M’4

E20001

M’41110

0000

E3

341

h=4 j=1 k=3 E3413=E3&M’1

E31001

M’10111

0001

E3

413

h=4 j=3 k=1 E3431=E1&M’3

E10011

M’31101

0001

E3

431

Level=0 (We just computed these)

1100

E3

134

1000

E3

143

0010

E3

241

1000

E3

243

0110

E3

314

0010

E3

341

0001

E3

413

0001

E3

431

Level=1

0001

L13

13

0010

L13

14

1010

L13

24

0001

L13

31

1000

L13

34

0010

L13

41

1000

L13

43

Level=2 (These are exactly the Level=1 of E2)

0011

L23

1

0001

L23

2

1001

L23

3

1010

L23

4

(These are exactly the Level=0’s of E2)

Level=3 (So E2 is the upper 3 levels of E3)1111

Graph Path Analytics (using pTrees)

U0011_0001_0001_0000

UniqueEdgeMask

E10011 E2

0001E3

1001 E4

1110

Two-LevelStride=4,Edge pTrees

L1E1111

U10011 U2

0001U3

0001 U4

0000

L1U1111

L0

Two-LevelStr=4, UniqueEdge pTrees

M11000 M2

0100M3

0010 M4

0001

Useful L0 Masks

1 1 1

1

4321

V2

1 2 3 4V1

kListE3hij, E4hijk = Ek & M’j & M’i ListE3134={1,2}h=1 i=3 j=4 k=2

M’31101

E20001

0000

E41342M’41110 ListE3143={1}

ListE3241={3}h=2 i=4 j=1 k=3

M’10111

E30001

0000

E42413M’41110

ListE3243={1}h=2 i=4 j=3 k=1

M’31101

E10011

0000

E42431M’41110

ListE3314={2,3}h=3 i=1 j=4 k=2

M’10111

E20001

0000

E43142M’41110

ListE3341={3}

ListE3413={4}ListE3431={4}

No 5 vertex (4 edge) paths. Creation stops.The Stride=|V|, Levels=Diam Path Mask is:EE2

E3

:Ediam

1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4

EdgesV1 V2

E0011_0001_1001_1110

EdgeMaskpTree

The 2-vertex paths are the Edges

EE0000000000010110000000000000101000010000000011000010000010000000

111112113114121122123124131132133134141142143144211212213214221222223224231232233234241242243244311312313314321322323324331332333334341342343344411412413414421422423424431432433434441442443444

E 2 k eyv 1 v 2 v 3

We use pTrees to find and exhibit 3 vertex (2 edge) paths (EE or E2), 4 vertex (3 edge) paths (E3), etc.

kListEh, E2hk = Ek&M’h. (other k, E2

hk=0)

32

3

2

E Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1

δintC- δextC=1–1/3=2/3

PEL.,1 0001_0001_0001_1100

PEL.,1 0010_0001_1000_0110

EL0012_0003_0001_2310

PE

0011_0001_1001_1110

Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4

E=Adjacency matrix2 3 1

1

43

2

1

V1

1 2 3 4

2=|PC&PE&Pv1|=kv1

intInternal degree of v C∈ , kv

int =# of edges from v to vertices in C=134

External degree of v C∈ , kvext =# of edges from v to vertices in C’

Internal degree of C, kCint =vC kv

int

External degree of C, kCext =vC kv

ext

Total degree of C, kC= +kCext kC

int

2=|PC&PE&Pv3| =kv3

int 2=|PC&PE&Pv4|=kv4

int 6=kCint

1=kCext

kC=7

Intra-cluster density δint(C)=|edges(C,C)|/(nc(nc−1)/2)=|PE&PC&PLT|/(3*2/2)=3/3=1

PLT

0111_0011_0001_0000

Inter-cluster density δext(C)=|edges(C,C’)| / (nc(n-nc)) =|PE&P’C&PLT|=1/(3*1)=1/3

PC

1011_0000_1011_1011

Pv11111_0000_0000_0000

Pv20000_1111_0000_0000

Pv30000_0000_1111_0000

Pv40000_0000_0000_1111

0=|P’C&PE&Pv1|=kv1

ext 0=|P’C&PE&Pv3|=kv3


ext

Useful masks

Tradeoff between large δint(C) and small δext(C) is goal of many community mining algorithms. A simple approach is to Maximize differences. Density Difference algorithm for Communities: δint(C)−δext(C) >Threshold? Degree Difference algorithm: kC

int – kCext > Threshold?

It is easy to compute these measurements with pTrees, even for Very Big Graphs. Graphs are ubiquitous for complex data in all of science.

Ignoring Subgraphs of 2 vertices, the four 3-vertex subgraphs are: C={1,3,4}, D={1,2,3}, F={1,2,4}, H={2,3,4}

Horiz Vertex data Vertical Vertex data Vkey VLabel 1 2 2 3 3 2 4 3

PVL,1 1111

PVL,0 0101

VL2323

PC

1011

δint(D) =|PE&PD&PLT|/(3*2/2)=1/3

δext(D)=|PE&P’D&PLT|=1/(3*1)=3/3=1 δintD - δextD=1/3–1=-2/3

PD

1110_1110_1110_0000

δint(F) =|PE&PF&PLT|/(3*2/2)=2/3

δext(F)=|PE&P’F&PLT|=1/(3*1)=2/3 δintF - δextF=2/3-2/3=0

PF

1101_1101_0000_1101

δint(H) =|PE&PH&PLT|/(3*2/2)=2/3

δext(H)=|PE&P’H&PLT|=1/(3*1)=2/3 δintH - δextH=2/3-2/3=0

PH

1101_1101_0000_1101

D F

H

Maximizing Difference of Cluster Densities: C is strongest community. One could use label values (weights) instead of the 0/1 existence values.

1 2

3 41

31 2

Vertex-Labelled, Edge-Labelled Graph

An Induced SubGraph (ISG) C, is a subgraph that inherits all of G’s edges on its own vertices. A k-ISG (k vertices), C, is a k-clique iff all of its (k-1)-Sub-ISGs are (k-1)-cliques.

Community Mining in Big Graphs Gene-Gene Interactions: # edges = 1B (109)E.g., Friend-Friend Social Nets: # edges = 4BB (1018)

Cust-Item Recommenders: # edges = 1MB (1015) Stock-price Stock Market Advisor: # edges = 1013

Person-Tweet HomeLand Security: # edges = 7B*10K= 1014

2 3

2 3

V2L

A community is a subgraph with more edges inside than linked to its outside.

V (vertex tbl)

Vkey VL 1 2 2 3 3 2 4 3

E (Edge Table) Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1

PEL.,1 0001_0001_0001_1100

PEL.,0 0010_0001_1000_0110

EL0012_0003_0001_2310

PE0011_0001_1001_1110

Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4E=Adjacency

matrix

2 3 1

1

4:33:2

2:3

1:2

V1

As a V2Rolodex card

1:2 2:3 3:2 4:3

C

PEC=PE&PC0011_0000_1001_1010

P11111_0000_0000_0000

P20000_1111_0000_0000

P30000_0000_1111_0000

P40000_0000_0000_1111

PVL,1 1111

PVL,0 0101

PC

1011

Apply EC to the 4 Induced 3 vertex subgraphs (3-Clique iff |PU|= 3!/(2!1!)=3)

A Clique Existence Algorithm is an algorithms that determines whether a given induced subgraph (given by a subset of vertices) is a clique or not.

Edge Count clique existence theorem (EC): |EC| |PUC| is COMB(|VC|,2) |VC|! / ((|VC|-2)!2!)

PUC=PU&PC0011_0000_0001_0000 PUC

0011_0000_0001_0000

Ct=3

PUD0010_0000_0000_0000

Ct=1

PUF0001_0001_0000_0000

Ct=2

PUH0000_0001_0001_0000

Ct=2

Thus, C is the only 3-Clique. We needed to form PC subgraph, C. Is that expensive?

12 2:3

3:2 4:31

31 2

SubGraph clique existence theorem (SG): (VC,EC) is a k-clique iff every induced k-1 subgraph, (VD,ED) is a (k-1)-clique.

Which is better? Which will extend more easily to quasi-cliques? Which can be extended to an algorithm that mines out all cliques from a graph?

A Clique Mining algorithm finds all cliques in a graph. For Clique-Mining we can use an ARM-Apriori-like downward closure property:

CSkkCliqueSet, CCSk+1Candidatek+1CliqueSet By the SG clique thm, CCSk+1= all s of CSk pairs having k-1 common vertices. Let CCCSk+1 be a union of two k-cliques with k-1 common vertices. Let v and w be the kth vertices (different) of the two k-cliques, then CCSk+1 iff (PE)(v,w)=1. (We just need to check a single bit in PE.)

Form CCSk+1 by union-ing CSk pairs sharing k-1 vertices, then check a single PE bit to determine if the union is in CSk+1. Below, k=2, so we check edge pairs sharing 1 vertex, then check the 1 new edge bit in PE.

CS2=E={13 14 24 34}

PE(3,4) =PE(4*[3-1]+4=12)=1134CS3

Already have 134

PE(1,2) =PE(4*[1-1]+2=2)=0

Already have 134

PE(2,3)=PE(4*[2-1]+3=7)=0

The only expensive part of this is forming CCSk.

And that is expensive only for CCS3 (as in Apriori ARM)

Next? List out CS3 = {134} form CCS4 = . Done.

G=Vertex-Labelled, Edge-Labelled Graph (C=Induced SubGraph with VC={1,3,4})

VC={1,3,4} VD={1,2,3} VF={1,2,4}

VH={2,3,4}

PU0011_0001_0001_0000

Clique Analytics for Big Graphs

A clique is a community in which

An edge between each vertex pair.

Bit offset12345678910111213141516

key1,11,21,31,41,51,61,72,12,22,32,42,52,62,73,13,23,33,43,53,63,74,14,24,34,44,54,64,75,15,25,35,45,55,65,76,16,26,36,46,56,66,77,17,27,37,47,57,67,7

1

2

4 3

6

7

5

E0111010101100011010001110000000001010001010000110

EU0111010001100000010000000000000001000000010000000

Using the EdgeCount thm: on C={1,2,3,4}, CU=C&EU

C is a clique since ct(CU)=comb(4, 2)=4!/2!2!=6

CU0111000001100000010000000000000000000000000000000

6

C1111000111100011110001111000000000000000000000000

Using the SubGraph clique theorem to find all k-Cliques. This graph, G is less trivial ;-) k=2: 12 13 14 15 16 17

23 24 25 26 27.

34 35 36 37

45 46 47

56 57

67 Turn PU into a positions list = {2 3 4 6 10 11 18 34 42}. Find the endpoints of each of these edges by ( Int((n-1)/7)+1, Mod(n-1,7) +1 )

k=3: 123 124 126 127 134 135 136 137 145 146 147 156 157 167

234 235 236 237 245 246 247 256 257 267

345 346 347 356 357 367

456 457 467

567

12345678910123456789201234567893012345678940123456789

12 13 14 16 23 24 34 56 67

k=4: 1234 (since the three 3subgraphs are all 3cliques, 123 124 234)

123 and 134 give 1234. 123 and 234 give 1234. 124 and 134 give 1234. 124 and 234 give 1234. 134 and 234 give 1234.

Therefore, 1234 is a 4-clique and the only 4-clique

So there are 5 cliques: 123 124 134 234 1234, 4 3-Cliques and 1 4-Clique.

Clique Mining using the SubGraph Algorithm

key1,11,21,31,41,51,61,72,12,22,32,42,52,62,73,13,23,33,43,53,63,74,14,24,34,44,54,64,75,15,25,35,45,55,65,76,16,26,36,46,56,66,77,17,27,37,47,57,67,7

1

2

4 3

6

7

5

PE0111010101100011010001110000000001110001010000110

More Clique Mining using the SubGraph thm (SG) In this example graph there are five 3Cliques and the one 4Clique. Let’s see if SG can find them (and how efficiently.).

k=2: 12 13 14 16 23 24 34 56 57 67 = E = CS2.

PE(2,3)=1So 123CS3

PE(2,4)=1124CS3

PE(2,6)=0

Pairs that share 1

Pairs that share 2

Already have 123CS3

Have 124CS3

Pairs that share 3

PE(1,4)=1134CS3

Pairs that share 4

Have 124CS3

Have 134

PE(2,3)=1234CS3

Pairs share 5

PE(6,7)=1567CS3

Pairs share 6PE(1,5)=0

PE(1,7)=0

already have 567

Pairs share 7

Have 567

k=3: 123 124 134 234 567= CCS3.

PE(2,4)=11234CS4

Triples that share 1,2

Triples share 1,4

Have 1334

Triples share 2,3

Have 1334

Triples share 2,4

Have 1234

Triples share 3,4

have 1234

The slowest part of this algorithm is the generation of CCS, the Candidate Clique Set?

Clearly, evaluating a given candidate as to whether it is actually a clique involves just a one bit lookup in the existing “Edge Existence” pTree mask, PE, which is instantaneous.

The generation of CCS is entirely identical here to the generation of Candidate Large Itemsets in Apriori ARM, and thus there should be plenty of algorithms around for doing that quickly by this time.

The other EdgeCount algorithm, EC, requires counting 1’s in the mask pTree of each Subgraph (or candidate Clique, if we want to take the time to generate the CCSs – but then clearly the fastest way to finish up is simply to lookup the single bit position in E, i.e., use EC).

EdgeCount Algorithm (EC): |PUC| = (k+1)!/(k-1)!2! then CCCS

I suppose, if one could come up with a fast way to create mask pTrees for each subgraph (and use Bryan’s pop-count procedure to compute the 1-count as the mask is being created) then this might be a competitive method?????

The SG algorithm seems to be a real winner since all we need is the Edge Mask pTree, E, and a fast way to find those pairs of subgraphs in CSk that share k-1 vertices (then check E to see if the two different kth vertices are an edge in G. Again this is a standard part of the Apriori ARM algorithm and has therefore been optimized and engineered ad infinitum!)

key1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8

E0111010110110001110100011110000100000110100010100000110011110000

( adding 1 vertex, V8, and 4 edges, (1,8) (2,8) (3,8) (4,8) )

k=2: 12 13 14 16 23 24 34 56 57 67 18 28 38 48 =E=CS2=edges.

PE(2,3)=1123CS3

PE(2,4)=1124CS3

PE(2,6)=0

Already have 123CS3

Have 124CS3

PE(1,4)=1134CS3 Have 124CS3

Have 134

PE(2,3)=1234CS3

PE(6,7)=1567CS3

PE(1,5)=0

PE(1,7)=0 have 567

Have 567

k=3: 123 124 134 234 567 128 138 148 238 248 348 = CS3.

PE(2,4)=11234CS4

Have 1334

Have 1334

Have 1234

have 1234

1

2

4 3

6

7

5

8

PE(2,8)=1128CS3

PE(3,8)=1138CS3

PE(4,8)=1148CS3

PE(6,8)=0

Have 128

PE(3,8)=1238CS3

Have 238

Have 138

PE(4,8)=1348CS3

PE(4,8)=1248CS3 Have 348

Have 248

Have 348

PE(3,8)=11238CS4

PE(4,8)=11248CS4

PE(3,8)=11348CS4

Have 1238

Have 1248

Have 1348PE(4,8)=12348CS4

Have 1348

There are 11 3cliques, 4 4cliques and 1 5clique.

k=4: 1234 1238 1248 1348 2348 = CS4.

PE(4,8)=112348CS5 have 12348

have 12348

have 12348

have 12348

have 12348

have 12348

have 12348

have 12348

Note there are many pTree and other data structures we can employ to aid in performing the CCS creation as well as other “path” based needs. These include the following (but there may be others????):

1.2-level, stride=|V|, pTree for E

2.An ExE relationship matrix showing (using a 1-bit) which edge pairs form a 2 path. Then an ExExE matrix showing which edge triples form a 3 path, etc.

k=5: 12348 = CS5.

More Clique Mining using the SubGraph thm (SG)

There are many cohesiveness definitions other than a Clique. Another criterion for subgraph cohesion relies on adjacency of its vertices. The idea is that a vertex must be adjacent to some minimum number of other vertices in the subgraph. In the literature on social network analysis there are two complementary ways of expressing this. A k-plex is a maximal subgraph in which each vertex is adjacent to all other vertices of the subgraph except at most k of them. A k-core is a maximal subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. In any graph there is a whole hierarchy of cores of different order. A k-core is essentially the same as a p-quasi complete subgraph, which is a subgraph such that the degree of each vertex is larger than p(k-1) , where p is a real number in [0; 1] and k the order of the subgraph.

As cohesive as a subgraph can be, it would hardly be a community if there is strong cohesion also between the subgraph and the rest of the graph. Therefore, it is important to compare the internal and external cohesion of a subgraph. In fact, this is what is usually done in the most recent definitions of community. The first recipe, however, is not recent and stems from social network analysis.

An LS-set is a subgraph such that the internal degree of each vertex is greater than its external degree. This condition is quite strict and can be relaxed into the so-called weak definition of community, for which it suffices that the internal degree of the subgraph exceeds its external degree.

A community is strong if the internal degree of any vertex exceeds the number of edges that the vertex shares with any other community.A community is weak if its total internal degree exceeds the number of edges shared by the community with the other communities.

Another definition focuses on the robustness of clusters to edge removal and uses the concept of edge connectivity. Edge connectivity of a pair of vertices is the minimal number of edges that need to be removed in order to disconnect them (no path between).

A lambda set is a subgraph such that any pair of vertices of the subgraph has a larger edge connectivity than any pair formed by one vertex of the subgraph and one outside the subgraph. However, vertices of a lambda-set need not be adjacent and may be quite distant from each other.

Communities can also be identified by a fitness measure, expressing to which extent a subgraph satisfies a given property related to its cohesion. The larger the fitness, the more definite is the community. This is the same principle behind quality functions, which give an estimate of the goodness of a graph partition. The simplest fitness measure for a cluster is its intra-cluster density int(C) (see slide 1). One could say subgraph C with k vertices is a cluster if int(C)>threshold. Finding such subgraphs is NP-complete, as it coincides with the NP-complete Clique Problem when the threshold =1. It is better to fix the size of the subgraph because, without this conditions, any clique would be one of the best possible communities, including trivial two-cliques (simple edges). Variants of this problem focus on the number of internal edges of the subgraph.

Another measure is the relative density (C) of a subgraph C, defined as the ratio between the internal and the total degree of C (see slide1). Finding subgraphs of a given size with (C) larger than a threshold is NP-complete.

Fitness measures can also be associated to the connectivity of the subgraph to the other vertices of the graph. A good community is expected to have a small cut size, i. e. small # of edges joining it to the rest of the graph.

Mining for Communities with more relaxed definitions than cliques (taken from Fortunato’s survey)

key1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8

Er0111010110110001110100011110000100000110100010100001010011110000

1

2

4 3

6

7

5

8

Degree Calculations using pTrees

key

12345678

Er1

11111111

101110101

210110001

311010001

411100001

500000110

610001010

700001100

Ur0000000010000000110000001110000000000000100010000000110011110000

Ur1

0111

1011

2

10000000

3

11000000

4

11100000

7

00001100

6

10001000

811110000

Er0

Ur0

key1,12,13,14,15,16,17,18,11,22,23,24,25,26,27,28,21,32,33,34,35,36,37,38,31,42,43,44,45,46,47,48,41,52,53,54,55,56,57,58,51,62,63,64,65,66,67,68,61,72,73,74,75,76,77,78,71,82,83,84,85,86,87,88,8

Ec1

01111111

101110101

210110001

311010001

411100001

500000110

610001010

700001100

Uc1

1111

1100

1

01110101

2

00110001

3

00010001

4

00000001

5

00000110

6

00000010

811110000

Ec0

Uc0

E 1 2 3 4 5 6 7 81 0 1 1 1 0 1 0 12 1 0 1 1 0 0 0 13 1 1 0 1 0 0 0 14 1 1 1 0 0 0 0 15 0 0 0 0 0 1 1 06 1 0 0 0 1 0 1 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0

U 1 2 3 4 5 6 7 81 0 0 0 0 0 0 0 02 1 0 0 0 0 0 0 03 1 1 0 0 0 0 0 04 1 1 1 0 0 0 0 05 0 0 0 0 0 0 0 06 1 0 0 0 1 0 0 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0

Ec0111010110110001110100011110000100000110100010100001010011110000

Uc0111010100110001000100010000000100000110000000100000000000000000

8

11110000

V11111111100000000000000000000000000000000000000000000000000000000

V1&Er1-8 = Er0.1 so we don’t need to precompute

the 2-level pTrees but it saves 1 AND each time.

Deg(Vk,C)=|PC&PVk|=|PCrk|

k-plex’s are subgraphs s.t. each vertex is adjacent to all other vertices of the subgraph except at most k of them.

k-plex existence algorithms: C is a k-plex iff vVC, |PUC| COMB(|VC|,2) – k

k-plex inheritance theorem: Every induced subgraph of a k-plex is a k-plex.

Proof: Let C be an induced subgraph of G. A vertex of C cannot be missing more adjacent C-edges in C than it is missing adjacent C-edges as a vertex in G, because every missing edge in C is also missing in G (If an edge (v,w) is missing in the induced graph, C then since v,w are vertices in G, that edge (v,w) cannot be in EG, lest it would have been induced into C).

Edge Count k-plex existence theorem: C is a k-plex iff |PUC| (|VC|!/((|VC|-2)!2!))-k

Mining all maximal k-plexes: Start with G by checking |PUG|. If G is a k-plex, so are all induced subgraphs (Inheritance Thm.) Done.

Else check |PUC| induced subgraph C s.t. |VC|=|VG|-1. such C that is not a k-plex, check |PUD| induced subgraph, D of C s.t. |VD|=|VC|-1.

Continue this until all induced subgraphs that are maximal k-plexes have been identified.

A k-core is a subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. There is a hierarchy of cores of different order.

k-core inheritance theorem: If a cover of G by induced k-cores, then G is a k-core.

Edge Count k-core existence theorem: C is a k-core iff |PUC| k

Mining k-cores: If C is s k-core and D is a supergraph s.t. VD - VC = {w1,…,wW} then D is s k-core iff degD(wh) k h=1..W

Note degD(w)=|PDU&PW| = |PD0n| where w is the nth vertex.

So if one computes all |PD0k| then one can build the hierarchy of k-cores in D by examining the set of vertices where this deg is k=max.

Any k-core, would have to be a subset of that set. Then go to k=max-1, etc.

Springer, May 2015 Charu C. Aggarwal. Comprehensive textbook on data mining (see our secret site)The emergence of data science as a discipline requires the development of a book that goes beyond the focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive book explores the different aspects of data mining, from the fundamentals to the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses.

The chapters fall into one of three categories:1. Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. 2. Domain chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. 3. Application chapters study applications: stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation.

About the Author: Charu Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. He has worked extensively in the field of data mining, with particular interests in data streams, privacy, uncertain data and social network analysis. He has published 14 (3 authored and 11 edited) books, over 250 papers in refereed venues, and has applied for or been granted over 80 patents. His h-index is 70. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has received two best paper awards and an EDBT Test-of-Time Award (2014). He has served as the general or program co-chair of the IEEE Big Data Conference (2014), the ICDM Conference (2015), the ACM CIKM Conference (2015), and the KDD Conference (2016). He also co-chaired the data mining track at the WWW Conference 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery and Data Mining Journal , an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the IEEE Transactions on Big Data, and an associate editor of the Knowledge and Information Systems Journal. He is editor-in-chief of the ACM SIGKDD Explorations. He is a fellow of the SIAM (2015), ACM (2013) and the IEEE (2010) for "contributions to knowledge discovery and data mining techniques."

Mohammad Zaki’s Data Mining book (See our secret site)

Bipartite Communities Matthew P. Yancey April 15, 2015

A recent trend in data-mining is finding communities in a graph. A community is a vertex set s.t. # edges inside it is > expected.

(cliques in social networks, families of proteins in protein-protein interaction networks, constructing groups of similar products in recommendation systems… )

An up-to-the moment survey on community detection: S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2.

S. Fortunato, “Community Detection in Graphs.“ arXiv 0906.0612v2. In graph clustering, look for a quantitative defi of community. No definition is universally accepted.

Intuitively, community has more edges “ inside” than linked to the outside. Algorithmically defined (final product of an algorithm, without a precise a priori def.)

Let Subgraph C have nc verticies and G having n vertices. Internall, external degree of v C∈ , kvint [kvext] # of edges connecting v to other vertices of C [to the rest of graph]

If kvext=0, vertex has nbrs only in C. If kvint=0, instead, the vertex is disjoint from C and it should be better assigned to a different cluster.

internal degree kintC of C =sum of internal vertex degrees. external degree kextC of C =sum of vertex external degrees. total degree kC =sum of degrees of the vertices of C.

intra-cluster density δint(C) = # internal Cedges / # possible internal edges, [=#int_edges_C / (nc(nc−1)/2) ] inter-cluster density δext(C) =# inter-cluster_edges_C/(nc(n−nc).

Finding the best tradeoff between large δint(C) and small δext(C) is implicitly or explicitly the goal of most clustering algorithms.

http://www.computer.org/tkde/

http://tkdd.cs.uiuc.edu/

A hop is a relationship, R, hopping from entity, E, to entity, F. Strong Rule Mining finds all frequent, confident rulesSRMs are categorized by the number of hops, k, whether transitive or non-transitive and by the focus entity. ARM is 1-hop, non-transitive (A,CE), F-focused SRM (1nF)

ct(&eARe &PC) / ct(&eARe) mncfct(&eARe) mnsp

consequent upward closure: If AC is non-confident, then so is AD for all subsets, D, of C. So frequent antecedent, A, use upward closure to mine for all of its' confident consequents.

antecedent downward closure: If A is frequent, all subsets are frequent (A is infrequent, supersets infreq)Since frequency involves only A, we can mine for all qualifying antecedents using downward closure.

Transitive (a+c)-hop Apriori strong rule mining with a focus entity which is a hops from the antecedent and c hops from the consequent, if a/c is odd/even then one can use downward/upward closure on that step in the mining of strong (frequent and confident) rules.

In this case A is 1-hop from F (odd, use downward closure). C is 0-hops from F (even, use upward closure).

We will be checking more examples to see if the Odddownward Evenupward theorem seems to hold.

1-hop, transitive (AE,CF), F-focused SRM (1tF)

1-hop, transitive, E-focused rule, AC SRM (1tE) ct(PA&fCRf) / ct(PA) mncf|A|=ct(PA) mnsp

antecedent upward closure: If A is infrequent, then so are all of its subsets.

consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C.

In this case A is 0-hops from E (even, use upward closure). C is 1-hop from E (odd, use downward closure).

AC strong if: ct(&eARe &gCSg) / ct(&eARe) mncfct(&eARe) mnsp and2-hop transitive F-focused

S(F,G)

R(E,F)

0 0 0 1

0 0 1 0

0 0 0 1

0 1 0 1

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

1

2

3

4

E

F2 3 4 5

1

2

3

4

G

A

C

Apriori for 2-hops: Find all freq antecedents, A, using downward closure. find C1G, the set of g's s.t. A{g} is confident. Find C2G, set of C1G pairs that are confident consequents for antecedent, A. Find C3G, set of triples (from C2G) s.t. all subpairs are in C2G (ala Apriori), etc.

1,1 odd so down, down correct.

2-hop trans G-foc mncfct(&flist&eAReSf & PC) / &flist&eARe

Sf ct(&flist&eARe

Sf)mnsp

1. (antecedent upward closure) If A is infrequent, then so for are all subsets.

2. (consequent upward closure) If AC non-conf, so is AD for all subsets, D.

2,0 even so up,up is correct.

2-hop trans E-foc

antecedent upward closure: If A is infrequent, so are all subsets.

consequent upward closuree: If AC non-conf so is AD for all subsets, D.

0,2 even so up,up is correct.

mncfct(PA&f&gCSgRf ) / ct(PA)ct(PA)mnsp mncfct(&f&eARe

Sf & PC) / &f&eAReSf

ct(&fl&eAReSf)mnsp

APPENDIX: AC, is confident if a high fraction of the fF which are related to every aA, are also related to every cCF is the Focus Entity and the high fraction is the MinimumConfidence ratio.

R(E,F)

1 0 1 11 0 1 11 1 0 11 1 1 1

1234

E

F2 3 4 5

A

C

SuppSetA (set of F’s related to every element of A) = {2,3,5} FSuppSetC = {2,4,5} 2/3 = ConfAC = |SuppSetAC|/|SuppSetA| = ct(&eACPe) / ct(&eAPe)

1 0 0 1 1 1 0 1

Question: Why isn’t ConfAC = SuppC / SuppA?

key1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8

E0111010110110001110100011110000100000110100010100001010011110000

12 13 14 16 18 23 24 28 34 38 48 56 57 67 = CS2

1

2

4 3

6

7

5

8

2-level, stride=|V|=8, pTrees for E to aid in performing the CCS creation steps in Alg B key

1234 E-L15678

E

1111

1111

key

1234 E-L05678

1

0111

0101

2

1011

0001

3

1101

0001

4

1110

0001

5

0000

0110

6

1000

1010

7

0000

1100

U0111010100110001000100010000000100000110000000100000000000000000

key

1234 U-L15678

U

1111

1100

key

1234 U-L05678

1

0111

0101

2

0011

0001

3

0001

0001

4

0000

0001

5

0000

0110

6

0000

0010

123 124 126 128 134 136 138 146 148 168 CCS2 (3sets formed from pairs of 2sets that share V1) 2-lev don’t seem to aid CCS creation.

Etc.

8

1111

0000

E2key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

pTree path-based analytics? Pre-construct length=2_path_pTrees (E2) length=3 (E3) etc.

PE30000000000000000000000000000000000000000000001000000000000000000

0000000000000000000000000000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000000000000000000

1111111111111111111111111111111111111111111111111111111111111111

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

2222222222222222222222222222222222222222222222222222222222222222

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

3333333333333333333333333333333333333333333333333333333333333333

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

4444444444444444444444444444444444444444444444444444444444444444

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

PE20000000000010110000000000000001000010000000000000000000000000000

E (Edge Table) Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1

12 2:3

3:2 4:31

31 2

E2key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

PE30000000000000000000000000000000000000000000011000000000000000000

0000000000000000000000000000000000000000000000000010000010000000

0000000000000100000000000000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000000000000000000

1111111111111111111111111111111111111111111111111111111111111111

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

2222222222222222222222222222222222222222222222222222222222222222

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

3333333333333333333333333333333333333333333333333333333333333333

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

E3key1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

4444444444444444444444444444444444444444444444444444444444444444

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

1,1,11,1,21,1,31,1,41,2,11,2,21,2,31,2,41,3,11,3,21,3,31,3,41,4,11,4,21,4,31,4,42,1,12,1,22,1,32,1,42,2,12,2,22,2,32,2,42,3,12,3,22,3,32,3,42,4,12,4,22,4,32,4,43,1,13,1,23,1,33,1,43,2,13,2,23,2,33,2,43,3,13,3,23,3,33,3,43,4,13,4,23,4,33,4,44,1,14,1,24,1,34,1,44,2,14,2,24,2,34,2,44,3,14,3,24,3,34,3,44,4,14,4,24,4,34,4,4

PE20000000000010110000000000000101000010000000011000010000010000000

1 2

3 4

PE0011_0001_1001_1110

Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4

PU0011_0001_0001_0000

PU20000000000010110000000000000001000010000000000000000000000000000

A path in a graph is a finite or infinite sequence of edges which connect a sequence of vertices which, by most definitions, are all distinct from one another except possibly the endpoints.

V2 1 2 3 4

1234

V3

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V1

2

3

4

V2 1 2 3 4

1234

V10 0 1 10 0 0 11 0 0 11 1 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 00 0 0 00 0 0 10 0 0 0

0 0 0 00 0 0 00 0 0 01 0 0 0

0 0 0 00 0 0 00 0 0 00 0 0 0

V3 1 2 3 4

1234V4

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V2

2

3

4

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 00 0 0 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 00 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V1 1 2 3 4

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 00 0 0 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 00 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V2 1 2 3 4

0 0 1 1 1

V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V3

V1 1 2 3 4

1234

V20 0 0 00 0 0 01 0 0 01 1 1 0

V2 1 2 3 4

0 0 0 0

0 0 0 0

1 0 0 0

1 1 1 0

1

V1

2

3

4

EE0000000000010110000000000000101000010000000011000010000010000000

1

2

3 4

E0011_0001_1001_1110

1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4

A path is a sequence of edges connecting a sequence of vertices which are (usually) all distinct from one another except endpts.

V2 1 2 3 4

1234

V3

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V1

2

3

4

V2 1 2 3 4

1

2

3

4

V1

0 0 1 10 0 0 11 0 0 11 1 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V3 1 2 3 4

1234V4

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V2

2

3

4

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 00 0 0 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 00 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V1 1 2 3 4

V3

V2 1 2 3 4

1234V3

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V1

2

3

4

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 00 0 0 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 00 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V4 1 2 3 4

111112113114121122123124131132133134141142143144211212213214221222223224231232233234241242243244311312313314321322323324331332333334341342343344411412413414421422423424431432433434441442443444

E E k ey

1 2 3 4

0 0 1 11V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V2

V1 V2 E10011 E2

0001E3

1001 E4

1110

EE10000000000010110

U10011 U2

0001U3

0001 U4

0000

U0011_0001_0001_0000

EE20000000000001010EE3

0001000000001100 EE4

0010000010000000

EE110000 EE12

0000

EE130001

EE140110

EE210000 EE22

0000

EE230000

EE241010EE31

0001 EE32

0000

EE330000 EE34

1100

EE410010 EE42

0000

EE431000

EE440000

For h=1 k=3: EE13=E3&M’1

M11000 M2

0100M3

0010 M4

0001

kListEh, EEhk=Ek&M’h. other k, EEhk=0

For h=1, ListE1={3,4}

E3&1001

M’1=0111

EE130001For h=1 k=4: EE14=E4&M’1

For h=2, ListE2={4}

For h=2 k=4: EE24=E4&M’2 E41110

M’21011

EE241010

E41110

M’1=0111

EE140110

For h=3, ListE3={1,4}For h=3 k=1: EE31=E1&M’3 E1&

0011

M’3=1101

EE310001

For h=3 k=4: EE34=E4&M’3

E4&1110

M’3=1101

EE341100

V2 1 2 3 4

1234

V3

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V1

2

3

4

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V3 1 2 3 4

1234V4

0 0 0 00 0 0 10 0 0 10 0 1 0

1

V2

2

3

4

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 00 0 0 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 00 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V1 1 2 3 4

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 00 0 0 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 01 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

0 0 0 00 0 0 10 0 0 10 0 1 0

0 0 0 10 0 0 00 0 0 10 0 0 0

0 0 0 10 0 0 10 0 0 00 0 0 0

0 0 1 00 0 0 01 0 0 00 0 0 0

V2 1 2 3 4

0 0 1 1 1

V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V3

V2 1 2 3 4

0 0 1 1 1

V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V2 1 2 3 4

0 0 1 1 1

V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V2 1 2 3 4

0 0 1 1 1

V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V2 1 2 3 4

0 0 1 1 1

V1

2

3

4

0 0 0 1

1 0 0 1

1 1 1 0

V31

2

3

4

key

1,11,21,31,41,51,61,71,82,12,22,32,42,52,62,72,83,13,23,33,43,53,63,73,84,14,24,34,44,54,64,74,85,15,25,35,45,55,65,75,86,16,26,36,46,56,66,76,87,17,27,37,47,57,67,77,88.18,28,38,48,58,68,78.8

E

0111010110110001110100011110000100000110100010100000110011110000

12

4 3

6

7

5

8

2-lev, str=|V|=8, pTrees for path analytics? E1key

1234

5678

E

1111

1111

E0key

1234

5678

1

0111

0101

2

1011

0001

3

1101

0001

4

1110

0001

5

0000

0110

6

1000

1010

7

0000

1100

8

1111

0000

U

0111010100110001000100010000000100000110000000100000000000000000

U1key

1234

5678

U

1111

1100

U0key

1234 5678

1

0111

0101

2

0011

0001

3

0001

0001

4

0000

0001

5

0000

0110

6

0000

0010

E 1 2 3 4 5 6 7 81 0 1 1 1 0 1 0 12 1 0 1 1 0 0 0 13 1 1 0 1 0 0 0 14 1 1 1 0 0 0 0 15 0 0 0 0 0 1 1 06 1 0 0 0 1 0 1 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0

U 1 2 3 4 5 6 7 81 0 0 0 0 0 0 0 02 1 0 0 0 0 0 0 03 1 1 0 0 0 0 0 04 1 1 1 0 0 0 0 05 0 0 0 0 0 0 0 06 1 0 0 0 1 0 0 07 0 0 0 0 1 1 0 08 1 1 1 1 0 0 0 0

E1key

1234

5678

E

1111

1111

E0key

1234

5678

1

0111

0101

2

1011

0001

3

1101

0001

4

1110

0001

5

0000

0110

6

1000

1010

7

0000

1100

8

1111

0000

U1key

1234

5678

U

1111

0111

U0key

1234 5678

1

0000

0000

2

1000

0000

3

1100

0000

4

1110

0000

7

0000

1100

6

1000

1000

8

1111

0000

E04= 142143148

1110

0001

E06= 165167

1000

1010

E08= 182183184

1111

0000

E03= 231234238

1101

0001

E04= 248

1110

0001

E08= 281283284

1111

0000

E01=312314316318

0111

0101

E08= 381382384

1111

0000

Find all paths of length=3 that start at vertex:

E03= 132134 138

1101

0001

3rd: P’h&E0k kEOhV=h 1st, 2nd EOh

E02= 123 124128

1011

0001

h=1 EO1

0111

0101

h=2 EO2=

1011

0001

E01= 213214216218

0111

0101

h=3 EO3=

1101

0001

E02=321324328

1011

0001

E04=341342348

1110

0001

h=4 EO4=

1110

0001

E01=412313416418

0111

0101

E08= 481482483

1111

0000

E02=421423428

1011

0001

E03=431432438

1101

0001

h=5 EO5=

0000

0110

E06=561567

1000

1010

E07=576

0000

1100

h=6 EO6=

1000

1010

E01=612613614618

0111

0101

E05=657

0000

0110

E07=675

0000

1100

h=7 EO7=

0000

1100

E05=756

0000

0110

E06=765

1000

1010

h=8 EO8=

1111

0000

E01=812813814816

0111

0101

E04= 841842843

1110

0001

E02=821823824

1011

0001

E03=831832834

1101

0001

The # of 3paths starting at: 1 2 3 4 5 6 7 8 Tot

14 11 13 13 3 6 2 13 76 Find 4paths that ending with each 3path

142143148165167 182183184prefE01

132134 138

123 124128

0111

0101

Concat with each elim if digit duplicates

2134 2138 2143 2148 2165 2183 2184 3124 3128 3142 3148 3165 3167 3182 3184 4123 4128 4132 4128 4165 4167 4182 4183 5123 5124 5128 5132 5143 5138 5154 5143 5148 5167 5182 5183 5184

6123 6124 6128 6132 6134 6138 6142 6143 6148 6182 6183 6184 7123 7124 7128 7132 7134 7138 7142 7143 7148 7165 7182 7183 7184 8123 8124 8132 8134 8142 8143 8165 8167

312314316318321324328341342348381382384

888888

234238248283284

11111

213214216231234

88888

214216218248281284

333333

213216218231238281283

4444444

111111

324328342348382384314316318341348381384

2222222

312316318321328381382

4444444

312314316321324341342

8888888

h=4 next

1

2

…

9

Term

1 2 3 D

DTPe k=1..7 TDRolodexCd

1

2

…

7

Pos1 2 3 D

DTPe k=1..9 PDCd

1

2

…

7

Pos1 2 … 9 T

DTPe k=1..3 PTCdWe can form multi-hop relationships from RoloDex cards. AC confident if most of the fF related to every aA, are also related to every cC.F is the Focus Entity and “most” means at least a MinimumConfidence ratio.

DT (P=k)

DT (P=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

D

T1 … 9

3

…

1

D

A

C A confident DThk rule means:A high fraction of the terms, tT in Position=h of every doc A, are also in Position=k of every doc C.

Is there a high payoff research area here?

DP (T=k)

DP (T=h)

0 0 00 0 10 0 0

1 0 0

0 1 1

1 0 0

3

…

1

D

P1 … 7

3

…

1

D

A

C A confident DPhk rule means:Hi fraction of Positions, pP which hold Term=h for every doc A, hold Term=k in Pos=p for every doc C

TP (D=k)

TP (D=h)

0 0 00 0 10 0 0

1 0 0

0 1 1

1 0 0

9

…

1

T

P1 … 7

9

…

1

T

A

C

Conf TPhk: Hi fraction of pP in Doc=h holding every t A, also hold every t C in Doc=kThis only makes sense for A ,C singleton Terms.Also it seems like P would have to be singleton?

TD (P=k)

TD (P=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

9

…

1

T

D1 … 3

9

…

1

T

A

C

Confident TDhk rule means a high fraction of the Documents, dD having in Position=h, every Term, t A, also have in Position=k, every Term, t C. Again, A,C must be singletons. Hi payoff? It suggests in 1-hop ARM:

Conf TD rules: hi fraction of Docs, dD having every term t A also have every term t C. Again, A,C must be singletons.Is there a high payoff research area here?

PD (T=k)

PD (T=h)

0 0 00 0 10 0 0

1 0 0

0 1 1

1 0 0

7

…

1

P

D1 … 3

7

…

1

P

A

C Conf PDhk rule: A high fraction of the Documents, dD having Term=h in every Pos, pA, also have Term=k in every Pos. pC.

PT (D=k)

PT (D=h)

0 0 00 0 10 0 0

1 0 0

0 1 1

1 0 0

7

…

1

P

T1 … 9

7

…

1

P

A

C A confident PThk rule means:A high fraction of the Terms, tT in Doc=h which occur at every Pos, p A, also occur at every Pos, pC in Doc=k

Is this a high payoff research area?

Market Basket RoloDex w different Cust-Item card for each day

Buys (Day=k)

0 0 0

0 0 1

0 0 0

3

…

1

Cust1 … 9

Item

Buys (Day=2)

Buys (Day=1)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

I

C1 … 9

3

…

1

I

A

B

Conf Buy12 rule: Custs who Buy A on Day=1, Buy B on Day=2 w hi prob

“Buys” pathways?

Buys (Day=2)

Buys day=1

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

I

C1 … 9

3

…

1

A

I

Conf Buy123 pathway: Most custs who Buy A Day=1 Buy B Day=2. Most of those custs Buy all of D on Day=3

Buys day=3

0 0 0

0 0 1

0 0 0

DC1 … 9

Buys (Day=2)

Buys (Day=1)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

I

C1 … 9

3

…

1

I

A

I

Conf Buy1234 pathway: Some customers Buys all of A on Day=1, then most of those customers will Buy all of B on Day=2, then most of those customers will Buy all of D on Day=3 And most of those customers Buy all of E Day=4

Buys (Day=3)

0 0 0

0 0 1

0 0 0

C1 … 9

Buys (Day=4)

1 0 0

0 1 1

1 0 0

3

…

1

EI

Protein-Protein Interaction RoloDex (different card for each interaction in some pathway)

Interaction=k

0 0 0

0 0 1

0 0 0

3

…

1

Gene1 … 9

Gene

Customer

1

2

3

4

Item

6

5

4

3

Gene

11

1

Doc

1

2

3

4

Gene

11

3

Exp

11

11

11

11

1 2 3 4 Author

1 2 3 4 G 5 6term 7

5 6 7People

11

11

11

3

2

1

Doc

2 3 4 5PI

People

cust item card

authordoc card

termdoc card

docdoc

expgene card

genegene card (ppi)

expPI card

genegene card (ppi)

mov

ie

0 0 0 0

0 2

0 0

3 0 0 0

1 0 0

5 0

0

0

0

5

1

2

3

4

4 0 0

0 0 0

5

0

0

1

0

3

0

0

customer rates movie card

0 0 0 0

0 0

0 0

0 0 0 0

0 0 0

1 0 0

0

0

0

1

0 0 0

0 0 0

1

0

0

0

0

0

customer rates movie as 5 card

4

3

2

1

Course

Enrollments

1 5people 2 3 4

1

2

3

4

item

s

3 2

1

term

s

DataCube Model for 3 entities, items, people and terms.

76

54

32

t

1

termterm card (share stem?)

Items: i1 i2 i3 i4 i5

|0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10|

People: p1 p2 p3 p4

|0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M|

Terms: t1 t2 t3 t4 t5 t6

|1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00|

Relationship: p1 i1 t1

|0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2

Relational Model:

2 3 4 5PI

RoloDex Model: 2 Entities many relationships

One can form multi-hops with any of these cards.Are there any that provide and interesting setting for ARM data mining?

3-hopS(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5Collapse T: TC≡ {gG|T(g,h) hC} That's just 2-hop w TCG replacing C. ( can be replaced by . Collapse T and S: STC≡{fF |S(f,g) gTC} Then it's 1-hop w STC replacing C.

Focus on G

mncnfct(&eARe &g&hCThSg) / ct(&eARe

mncnf&hCTh) ct(&f&eAReSf / ct(&f&eARe

Sf)

ct( 1001 &g=1,3,4 Sg ) /ct(1001) = ct( 1001 &1001&1000&1100) / 2 = ct( 1000 )/2 = 1/2

Focus on F different because the confidences can be different. Focus on G. ct(&f=2,5Sf &1101 ) / ct(&f=2,5Sf

ct(1101 & 0011 & &1101 ) / ct(1101 & 0011 ) = 1/1 =1

mnsup ct(&eARe

mnspct(&f&eAReSf)

Focus on F

antecedent downward closure: A infreq. implies supersets infreq. A 1-hop from F (downconsequent upward closure: AC noncnf implies AD noncnf. DC. C 2-hops (up

antecedent upward closure: A infreq. implies all subsets infreq. A 2-hop from G (up) consequent downward closure: AC noncnf impl AD noncnf. DC. C 1-hops (down)

4-hop

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

U(H,I)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

I

Focus on G? Replace C by UC; A by RA as above (not different from 2 hop?)

Focus on H (RA for A, use 3-hop) or focus on F (UC for C, use 3-hop).

Another focus on G (main) mncnf ct( &f&eAReSf &h&iCUi

Th ) / ct(&f&eAReSf)

&iCUi))+(ct(S1(&eARe mncnf/ ( (ct(&eARe))n* ct(&iCUi) )&iCUi))+... ct(S2(&eARe &iCUi)) ) ct(Sn(&eARe

...

R(E,G)

0 0 1 10 0 1 10 0 0 10 1 0 0

1234

E

G 2 3 4 5

A

1234

GSn(G,G)

S1(G,G)

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

U(G,I)

1 0 1 10 1 1 11 0 0 01 1 0 0

CI2 3 4 5

2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C (the "list" will be larger, so the AND over the list will produce fewer ones) So frequent antecedent, A, use upward closure to mine out all confident consequents, C.

1. (antecedent upward closure) If A is infrequent, then so are all of its subsets (the "list" will be larger, so the AND over the list will produce fewer ones) Frequency involves only A, so mine all qualifying antecedents using upward closure.

4-hop APRIORI focus on G: ct(&f&eARe

Sf &h&iCUiTh) / ct(&f&eARe

Sf) mnsupct(&f&eAReSf)

5-hop

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

U(H,I)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

I V(I,J)

0 0 0 11 0 1 00 0 0 10 1 0 1

J 2 3 4 5

Given any 1-hop labeled relationship (e.g., cells have values from {1,2,…,n} then there is:1. a natural n-hop transitive relationship, A implies D, by alternating entities for each specific label value relationship.2. cards for each entity consisting of the bitslices of cell values.

E.g., in netflix, Rating(Cust,Movie) has label set {0,1,2,3,4,5}, so in 1. it generates a bonafide 6-hop transitive relationship.In 2. an alternative is to bitmap each label value (rather than bitslicing them). Below Rn-i can be bitslices or bitmaps

R3(C,M)

R2(M,C)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234M

C2 3 4 5

1234M

A

D

R4(M,C)

0 0 0 11 0 1 00 0 0 10 1 0 1

C2 3 4 5

R5(C,M)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

M

R0(M,C)

0 0 0 11 0 1 00 0 0 10 1 0 1

C2 3 4 5

C2 3 4 5

R1(C,M)

1 1 0 10 0 0 11 1 0 11 1 0 0

E.g., equity trading on a given day, QuantityBought(Cust,Stock) w labels {0,1,2,3,4,5} (where n means n thousand shares) so that generates a bonafide 6-hop transitive relationship:

equity trading - moved similarly, (moved similarly on a day --> StockStock(#DaysMovedSimilarlyOfLast10)equity trading - moved similarly2, (define moved similarly to mean that stock2 moved similarly to what

stock1 did the previous day.Define relationship StockStock(#DaysMovedSimilarlyOfLast10)Gene-Experiment, Label values could be "expression level". Intervalize and go!

Has Strong Transitive Rule Mining (STRM) been done? Are their downward/upward closure theorems already for it? Is it useful? That is, are there good examples of use: stocks, gene-experiment, MBR, Netflix predictor,...

R0(E,F)

Rn-2(E,F)Rn-1(E,F)

F 2 3 4 5

1234

EA

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

...

D

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

Buys(C,T)

BoughtBy(I,C,)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

ItemsCustomers2 3 4 5

1

2

3

4

Types (of Items)

A

D

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1718

1920

Let Types be an entity which clusters Items (moves Items up the semantic hierarchy), E.g., in a store, Types might include; dairy, hardware, household, canned, snacks, baking, meats, produce, bakery, automotive, electronics, toddler, boys, girls, women, men, pharmacy, garden, toys, farm). Let A be an ItemSet wholly of one Type, TA, and let D by a TypesSet which does not include TA. Then:

AD might mean If iA s.t. BB(i,c) then tT, B(c,t) AD might mean If iA s.t. BB(i,c) then tT, B(c,t)

AD might mean If iA s.t. BB(i,c) then tT, B(c,t)AD might mean If iA s.t. BB(i,c) then tT, B(c,t)

AD frequent might mean ct(&iABBi) mnsp ct( | iABBi) mnsp ct(&tDBt) mnsp ct( | tDBt) mnsp

ct(&iABBi &tDBt) mnsp, etc.

ct(&iABBi &tDBt) / ct(&iABBi) mncfAD confident might mean ct(&iABBi | tDBt) / ct(&iABBi) mncf

ct( | iABBi | tDBt) / ct( | iABBi) mncf ct( | iABBi &tDBt) / ct( | iABBi) mncf

Text Mining using pTrees

Pos

1 0 0 0 0 1 0 . . .

Term buy

DTPe in PpTreeSet index (T,D)

Doc3

Doc2

Doc1

1 0

DTPe Position TablePos T1D1 T1D2 T1D3...T9D1…T9D3

1 1 0 1 ... 0 … 0

7 0 … 0 . . . 1 … 1

.

.

.

1 2 3 4 5 6 7 3 2

1 1

.Doc

... T

erm

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 . . .

0 . . .

0 . . .

1

0

0

0 0 0 0 0 0 0 . . .

0 1 0 0 0 0 0 . . .

1 0 0 0 0 1 0 . . .

DTPe Data Cube

1

2

…

9

Term1 2 3 D

TDcardP=kk=1..7

DTPe k=1..7 TDRolodexCd

1

2

…

7

Pos1 2 3 D

PDcardT=kk=1..9

DTPe k=1..9 PDCd

1

2

…

7

Pos1 2 … 9 T

PT cardD=kk=1,2,3

DTPe k=1..3 PTCd

DTPe Document Table: Doc T1P1…T1P7 . . . T9P1…T9P71 1 … 0 . . . 0 … 0

2 0 … 0 . . . 1 … 0

3 0 … 0 . . . 1 … 1

Classical Document Table:Doc Auth… Date . . .Subj1 …Subjm1 1 1/2/13 . . . 0 … 0

2 0 2/2/15 . . . 1 … 0

3 0 3/3/14 . . . 1 … 1

0 0 0 0 0 0 0 . . .

DTPe DocTbl DpTreeSet indexed by (T,P))Position 1 2 3 4 5 6 7Term

an

and

April

are

apple

0 0 0 0 0 0 0 . . .

0 0 1 0 0 0 0 . . .

0 0 0 1 0 0 1 . . .

0 0 0 0 0 0 0 . . .

always 1 0 0 0 0 0 0 . . .

all 0 0 0 0 0 0 0 . . .

AAPL

buy

0 1 0 0 0 0 0 . . .

01 0 0 0 0 1 0 . . .

Classical DocTbl DpTreeSet

1

Auth Date

0

Subj1

0

Subjm

DTPe Term Table:Term P1D1 P1D2 P1D3...P7D1…P7D3

1 1 0 1 ... 0 … 0

9 0 … 0 . . . 1 … 1

.

.

.

DTPe Term Usage Table:Term P1D1 P1D2 P1D3...P7D1…P7D3

1 noun verb adj adv …noun

9 adj noun noun adj noun

.

.

.

Doc3

Doc2

Doc1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

1

0

0

0

1

0

DTPe TpTreeSet index (D,P) Positions 1 2 …

0

0

0

0

0

1

0

0

1

P1D1noun

1

0

0

0

0

0

0

0

0

P1D1 adj

tf is the +rollup of the DTPe datacube along the position dimension. One can use any measurement or data structure of measurements, e.g., DT tfidf in which each cell has a decimal tfidf, which can be bitsliced directly into whole number bitslices plus fractional bitslices (one for each binary digit to the right of the binary point-no need to shift!) using: MOD(INT(x/(2k),2), e.g., a tfidf =3.5 is

k: 3 2 1 0 -1 -2 bit: 0 0 1 1 1 0

3 2

1

.Doc

s

T

erm

s

0

0

1

2

0

0

0

1

2

DTtf DocTerm termfreq Data Cube

DT tfidf Doc Table: Doc T1 T2 . . . T9

1 .75 0 . . . 1

2 0 1 .25

3 0 0 0

DT tfidf DpTreeSet

0

T1k1

0 1

T1k0 T1k-1 T1k-2

1

Rating of T=stock at doc date close:1=sell, 2=hold,3=buy0=non-stock Term

3 2

1

.Doc

s

T

erm

s

0

0

0

0

0

0

0

3

0

DT SR DocTerm StockRating Cube

DT SR bitslice DpTreeSet1

T2k2

1

T2k1

DT SR bitmap DpTreeSet

1

T2,R=buy

0 0

T2,R=hold T2,R=sell

key1,11,2 :1,N_2,12,2 :2,N_ . . . _M,1M,2 :M,N

E01:000:0...10:1

Closure: An induced Subgraph (ISG), C, of a graph, G, inherits all of G’s edges between its own vertices.

A k-ISG (k vertices), C, is a k-clique iff all of its (k-1)-Sub-ISGs are (k-1)-cliques.

U01:000:0...00:0

Big Graph Mining (BipartiteGraphs) Gene-Gene Ints: N=M=25K, NM=625MSocial Nets: N=M=2B, NM=4BB Recommenders: N=B, M=M, NM=MB

Assume graph is Bipartite G=(I,C,E) (Unipartite iff C=I) |I|=N, |C|=M (|E|=MN) 2 level pTrees stride=N: Lev1 Level-0Ctkey12:M

#E11:1

#U11:1

#E101::0

#U101::0

Ctkey12::N

U=Unique.

For Bipartite

& Directed

graphs, E=U#E200::0

#U200::0

#EM10::1

#UM00::0

e.g., UM masks items of cust=M, friends of person=M, genes interacting with gene=M.

v1 w1

v2 v3

w2

0

0

0 0

v1 v2 v3 w1 w2

w2

w1

v3

v2

v1

1 0

1 0

1 1

G=Unipartite graph (VW, EVW)

w1 w2

v3

v2

v1

1 0

1 0

1 1

Bipartite G=((V,W),E)

So, Communities in bipartite graphs studied as unipartite?

A tree is bipartite. Cycle graphs w even # of vertices bipartite.Planar graph whose faces all have even length is bipartite

δintC- δextC=1–1/3=2/3

12 2:3

3:2 4:31

31 2

PEL.,1 0001_0001_0001_1100

PEL.,1 0010_0001_1000_0110

EL0012_0003_0001_2310

PE

0011_0001_1001_1110

Ekey1,11,21,31,4_2,12,22,32,4_3,13,23,33,4_4,14,24,34,4

L=1PE =PLT

1111

L=0PE,1

0011

L=0PE,2

0001

L=0PE,3

1001

L=0PE,4

1110

E=Adjacency matrix2 3 1

1

4:3

3:2

2:3

1:2

V1

V2As Rolodex card 1:2 2:3 3:2 4:3

C

2=|PC&PE&Pv1|=kv1

int

A community has more edges inside than linked to the outside.

Let Subgraph, C, have nc vertices of a graph, G, having n vertices.

Internal degree of v C∈ , kvint =# of edges from v to vertices in C

External degree of v C∈ , kvext =# of edges from v to vertices in C’

Internal degree of C, kCint =vC kv

int

External degree of C, kCext =vC kv

ext

Total degree of C, kC= +kCext kC

int

2=|PC&PE&Pv3| =kv3

int 2=|PC&PE&Pv4|=kv4

int 6=kCint

1=kCext

kC=7

Intra-cluster density δint(C)=|edges(C,C)|/(nc(nc−1)/2)=|PE&PC&PLT|/(3*2/2)=3/3=1

PLT

0111_0011_0001_0000

PLT,1

0111

PLT,2

0011

PLT,3

0001

PLT,4

0000

Inter-cluster density δext(C)=|edges(C,C’)| / (nc(n-nc)) =|PE&P’C&PLT|=1/(3*1)=1/3

PC

1011_0000_1011_1011

Pv11111_0000_0000_0000

Pv20000_1111_0000_0000

Pv30000_0000_1111_0000

Pv40000_0000_0000_1111

0=|P’C&PE&Pv1|=kv1



ext

Useful masks

The tradeoff between large δint(C) and small δext(C) is goal of community mining and clustering algorithms. The simple ways is to Maximize Differences, δint(C)−δext(C) = D (or Dk=kC

int – kCext ) over all clusters (use Sum of Differences for partitions).

It is easy to compute each SD and SDk with pTrees, even for BigDataGraphs.

One can use downward (upward?) closure properties (precisely) to facilitate maximizing differences over all clusters, C?

Graphs are the ubiquitous data structures for complex data in all of science. A table is a graph with no edges, a relationship is a bipartite graph…

Extend to multigraphs (edge sets =vertex triples, quadruples, etc.).

Ignoring Subgraphs of 1 or 2 vertices, the other three 3subgraphs are D={1,2,3}, F={1,2,4}, H={2,3,4}

Stride=4. Two-Level pTreesHoriz Vertex data Vertical Vertex data Vkey VLabel 1 2 2 3 3 2 4 3

PVL,1 1111

PVL,0 0101

VL2323

PC 1011

Fixed Pt Colmn

Vertex-Labelled, Edge-Labelled Graph

δint(D) =|PE&PD&PLT|/(3*2/2)=1/3

δext(D)=|PE&P’D&PLT|=1/(3*1)=3/3=1 δintD - δextD=1/3–1=-2/3

PD

1110_1110_1110_0000

δint(F) =|PE&PF&PLT|/(3*2/2)=2/3

δext(F)=|PE&P’F&PLT|=1/(3*1)=2/3 δintF - δextF=2/3-2/3=0

PF

1101_1101_0000_1101

δint(H) =|PE&PH&PLT|/(3*2/2)=2/3

δext(H)=|PE&P’H&PLT|=1/(3*1)=2/3 δintH - δextH=2/3-2/3=0

PH

1101_1101_0000_1101

D F

H

Maximizing Difference of Cluster Densities: C is strongest community (subgraph/cluster). One could use label values (weights) instead of the 0/1 existence values.

E Ekey V1 V2 ELabel 1,3 1 | 3 1 1,4 1 | 4 2 2,4 2 | 4 3 3,4 3 | 4 1

1 2 34 graph path: sequence of edges connecting a sequence of vertices (usually) distinct from each...

Documents

e4m2 e4

e4m3 e4

e3m4 e3

endpts level

e1m3 e1

e1m4 e1

sequence of edges

sequence of vertices