![Page 1: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/1.jpg)
Mohammad Hasan, Mohammed Zaki
RPI, Troy, NY
![Page 2: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/2.jpg)
![Page 3: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/3.jpg)
![Page 4: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/4.jpg)
Consider the following problem from Medical Informatics
Healthy
Diseased
Damaged
Tissue Images
Cell Graphs
Discriminatory Subgraphs
Classifier
404/20/23
![Page 5: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/5.jpg)
Mining Task Dataset
30 graphs Average vertex count: 2154 Average edge count: 36945
Support 40%
Result No Result (used gSpan, Gaston) in a week of
running on 2 GHz dual-core PC with 4 GB running Linux
504/20/23
![Page 6: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/6.jpg)
Limitations of Existing Subgraph Mining Algorithms Work only for small graphs
The most popular datasets in graph mining are chemical graphs Chemical graphs are mostly tree In DTP dataset (most popular dataset) average vertex count is
43 and average edge count is 45
Perform a complete enumeration For large input graph, output set is neither enumerable nor
usable
They follow a fixed enumeration order
Partial run does not efficiently generate the interesting subgraphs
avoid complete enumeration to sample a set of
interesting subgraphs from the output set 604/20/23
![Page 7: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/7.jpg)
Why sampling a solution? Observation 1:
Mining is only exploratory step, mined patterns are generally used in subsequent KD task
Not all frequent patterns are equally important for the desired task at hand
Large output set leads to information overload problem
Observation 2: Traditional mining algorithms explore the output space with a fixed
enumeration order Good for generating non-duplicate candidate patterns But, subsequent patterns in that order are very similar
complete enumeration is generally unnecessary
Sampling can change enumeration order to sample interesting
and non-redundant subgraphs with a higher chance 704/20/23
![Page 8: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/8.jpg)
![Page 9: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/9.jpg)
Output Space Traditional frequent subgraphs for a given support threshold
Can also augment with other constraint To find good patterns for the desired KD task
Input Space
Output Space for FPM with support = 2
904/20/23
![Page 10: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/10.jpg)
Sampling from Output Space
Return a random pattern from the output set
Random pattern is obtained by sampling from a desired distribution
Define an interestingness function, f : FR+; f(p) returns the score of pattern p
The desired sampling distribution is proportional to the interestingness score
If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution
Efficiency consideration Enumerate as few auxiliary patterns as possible
1004/20/23
![Page 11: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/11.jpg)
How to choose f?
Depends on application needs
For exploratory data analysis (EDA), every frequent pattern can have a uniform score
For Top-K pattern mining, support values can be used as scores, which is support biased sampling.
For subgraph summarization task, only maximal graph patterns has uniform non-zero score
For graph classification, discriminatory subgraphs should have high scores
1104/20/23
![Page 12: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/12.jpg)
![Page 13: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/13.jpg)
Challenges
The output space can not be instantiate
Complete statistics about the output space is not known.
Target distribution is not known entirely
Output Space of Graph Mining
g1
g3
g2
g4
g5
s1 s2 s3 sn
GraphsScores
We want, ( ) i
ii
si
s
1304/20/23
![Page 14: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/14.jpg)
MCMC Sampling
In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge
Solution Approach (MCMC Sampling)
Perform random walk in the output space
Represent the output space as a transition graph to allow local transitions
Edges of transition graph are chosen based on structural similarity
Make sure that the random walk is ergodic
POG as transition graph
1404/20/23
![Page 15: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/15.jpg)
Algorithm
Define the transition graph (for instance, POG)
Define interestingness function that select desired sampling distribution
Perform random walk on the transition graph
Compute the neighborhood locally
Compute Transition probability Utilize the interestingness score makes the method generic
Return the currently visiting pattern after k iterations.
1504/20/23
![Page 16: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/16.jpg)
Local Computation of Output Space
g0
Super Patterns
Sub Patterns
Pattern that are not part of the output space is discarded during local neighborhood computation
P01 p02 p03 p04 p05 p00
g1
g2 g3
g5g4
g5g2 g4g3g1 u
Σ =11604/20/23
![Page 17: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/17.jpg)
Compute P to achieve Target Distribution
If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have,
Main task is to choose P, so that the desired stationary distribution is achieved
In fact, we compute only one row of P (local computation)
P
s1 s2 s3 sn
Graphs
Scores
We want,
( ) i
ii
si
s
1704/20/23
![Page 18: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/18.jpg)
Use Metropolis-Hastings (MH) Algorithm
1. Fix an arbitrary proposal distribution beforehand (q)
2. Find a neighbor j (to move to) by using the above distribution
3. Compute acceptance probability and accept the move with this probability
4. If accept move to j; otherwise, go to step 2
1 2 3
0
4 5
q01 q02 q03 q04 q05 q00
Select 3
1,min030
30303 qs
qs
04/20/23
![Page 19: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/19.jpg)
Uniform Sampling of Frequent Patterns
Target Distribution1/n, 1/n, . . . , 1/n
How to achieve it?Use uniform proposal
distributionAcceptance probability is:
dx: Degree of a vertex x
min 1, u
v
d
d
1904/20/23
![Page 20: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/20.jpg)
Uniform Sampling, Transition Probability Matrix
B
A
D
A
DP14
2004/20/23
![Page 21: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/21.jpg)
Discriminatory Subgraph Sampling
Database graphs are labeled
Subgraphs may be used as Feature for supervised classification Graph Kernel
Graph Label
G1G2G3
+1+1-1
Subgraph
Mininggraphs
g1
g2
g3
. .
.
G1
G2
G3Em
beddin
g
Counts
Or
Binar
y
2104/20/23
![Page 22: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/22.jpg)
Sampling in Proportion to Discriminatory Score (f)
Interestingness score (feature quality) Entropy Delta score = abs (positive support – negative
support)
Direct Mining is difficult
Score values (entropy, delta score) are neither monotone nor anti-monotone
P
C
Score(P) <=> Score(C)
2204/20/23
![Page 23: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/23.jpg)
Discriminatory Subgraph Sampling
Use Metropis-Hastings Algorithm Choose neighbor uniformly as proposal
distribution Compute acceptance probability from the
delta score
Delta Score of j and i
Ratio of degree of i and j
2304/20/23
![Page 24: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/24.jpg)
![Page 25: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/25.jpg)
Datasets
Name # of Graphs
Average Vertex count
Average Edge Count
DTP 1084 43 45
Chess 3196 10.25 -
Mutagenicity
2401 (+) 1936 (-)
17 18
PPI 3 2154 81607
Cell-Graphs
30 2184 36945
2504/20/23
![Page 26: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/26.jpg)
Result Evaluation Metrics Sampling Quality
Our sampling distribution vs target sampling distribution
Median and standard deviation of visit count
How the sampling converges (convergence rate)
Variation Distance:
Scalability Test Experiments on large datasets
Quality of Sampled Patterns
1( , ) ( )
2t
yP x y y
2604/20/23
![Page 27: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/27.jpg)
Uniform Sampling ResultsExperiment Setup
Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution
For a dataset with n frequent patterns, we perform 200*n iterations
Result on DTP Chemical Dataset
Uniform Sampling
Maxcount
Mincount
Median Std
338 32 209 59.02
Ideal Sampling
Median Std
200 14.11
2704/20/23
![Page 28: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/28.jpg)
Sampling QualityDepends on the choice of proposal distribution
If the vertices of POG have similar degree values, sampling is good
Earlier dataset have patterns with widely varying degree values
[
For clique dataset, sampling quality is almost perfect
Result on Chess (Itemset) Dataset
(100*n iterations)
Uniform Sampling
Maxcount
Mincount
Median Std
156 6 100 13.64
Ideal Sampling
Median Std
100 102804/20/23
![Page 29: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/29.jpg)
Discriminatory sampling results (Mutagenicity dataset)
Distribution of Delta Score among all frequent
Patterns
Relation between sampling rate and Delta Score
2904/20/23
![Page 30: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/30.jpg)
Discriminatory sampling results (cont)Sample No Delta
ScoreRan
k% of POG explored
1 404 132 5.7
2 644 21 11.0
3 707 10 10.8
4 725 4 8.9
5 280 595 2.8
6 725 4 8.9
7 627 27 3.3
8 709 9 7.7
9 721 5 9.1
10 725 4 8.9
3004/20/23
![Page 31: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/31.jpg)
Discriminatory sampling results (cell Graphs)
Total graphs 30, min-sup = 6
No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine)
3104/20/23
Number of subgraphs with delta score > 9
0
5
10
15
20
25
30
traditional algorithm OSS
Series1
![Page 32: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/32.jpg)
SummaryExisting Algorithms Output Space Sampling
Random walk on the subgraph space
Arbitrary ExtensionSampling algorithm
Depth-first or Breadth first walk on the subgraph space
Rightmost ExtensionComplete algorithm
Quality: Sampling quality guaranty
Scalability: Visits only a small part of the search space
Non-Redundant: finds very dissimilar patterns by virtue of randomness
Genericity: In terms of pattern type and sampling objective
3204/20/23
![Page 33: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/33.jpg)
Future Works and Discussion Important to choose proposal distribution wisely
to get better sampling
For large graph, support counting is still a bottleneck
How to scrap the isomorphism checking entirely How to effectively parallelize the support counting
How to make the random walk to converge faster The POG graph generally have smaller spectral
gap, as a result the convergence is slow. This makes the algorithm costly (more steps to find
good samples)
3304/20/23
![Page 34: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/34.jpg)
![Page 35: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/35.jpg)
![Page 36: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/36.jpg)
Acceptance Probability Computation
Desired Distribution
Proposal Distribution
Interestingness value
3604/20/23
![Page 37: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/37.jpg)
Support Biased Sampling
s1 s2 s3 sn
Graphs
Support
We want,
( ) i
ii
si
s
What proposal distribution to choose?
α=1, if Nup(u) = ø, α=0, if Ndown(u) = ø
1( )
| |( , )
1(1 ) ( )
| |
if
if
upup
downdown
v N uN
Q u v
v N uN
u
link3704/20/23
![Page 38: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/38.jpg)
Example of Support Biased Sampling
B
A
D
A
DP3 x 1/92 X 1/2
α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9s(u) = 2s(v) = 3
31
3804/20/23
![Page 39: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/39.jpg)
Sampling Convergence
3904/20/23
![Page 40: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/40.jpg)
Support Biased SamplingScatter plot of Visit count and Support shows
positive Correlation
Correlation: 0.76
4004/20/23
![Page 41: Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f1d5503460f94c34a00/html5/thumbnails/41.jpg)
Specific Sampling Examples and Utilization Uniform Sampling of Frequent Pattern
To explore the frequent patterns To set a proper value of minimum support To make an approximate counting
Support Biased Sampling To find Top-k Pattern in terms of support value
Discriminatory subgraph sampling Finding subgraphs that are good features for
classification
4104/20/23