discovering patterns in multiple datasets raj bhatnagar university of cincinnati
Post on 21-Dec-2015
217 views
TRANSCRIPT
![Page 1: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/1.jpg)
Discovering Patterns in Multiple Datasets
Raj Bhatnagar
University of Cincinnati
![Page 2: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/2.jpg)
Nature of Distributed Datasets
Horizontal Partitioning
A B C D E
A B C D E
A B C D E
Vertical Partitioning
D E F G H
A H J K M
A B C D E
Data components may be Geographically Distributed
![Page 3: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/3.jpg)
Nature of Distributed DatasetsMulti-Domain Datasets
Gen
es
Diseases
Gen
es
Drugs
Dru
gs
Adverse Reactions
![Page 4: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/4.jpg)
Nature of Distributed DatasetsMulti-Domain Datasets
Doc
umen
ts
Keywords
Doc
umen
ts
Cited-Documents
Key
wor
ds
Topics
![Page 5: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/5.jpg)
Types of Patterns• Decision Trees• Association Rules• Principal Component Analysis• K-Nearest Neighbor Analysis• Clusters
– Hierarchical
– K-Means
– Subspace
![Page 6: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/6.jpg)
Nature of Clusters
Patterns Ξ Unsupervised, Data Driven, Clusters
Single-Domain Clustering
Gen
es
Diseases
Gen
es
Diseases
Clusters of similar genes;In the context of diseases
Clusters of similar diseasesIn the context of genes
Clusters may be: - Mutually Exclusive - Overlapping
![Page 7: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/7.jpg)
Nature of Patterns
Simultaneous Two-Domain Clustering
Gen
es
Diseases A cluster of similar genes - in a subspace of diseases;
A cluster of similar diseases - in a subspace of genes
Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains
G
D
![Page 8: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/8.jpg)
Nature of Patterns
Simultaneous Three (Multi)-Domain Clustering
Gen
es
Diseases
Gen
es
Dise.
Gen
es
Drugs
Gen
es
Drugs
Match “genes” subsets in two clusters
Phase-III of this research
![Page 9: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/9.jpg)
Part-I
Patterns in Vertically Distributed Databases
![Page 10: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/10.jpg)
Learning Decision Trees
D = D1 X D2 X . . . X Dn
- D is implicitly specified
Goal: Build decision tree for implicit D, using the explicit Di’s
D1 D2 Dn
A B C C D E A E G
Limitations:- Can’t move Di’s to a common site
- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples
Geographically distributed databases
Vertically Partitioned Dataset
![Page 11: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/11.jpg)
Explicit and Implicit Databases
321162
121162
211221
211261
321161
121161
FEDCBA
Implicit Database
Explicit Component Databases
22
12
21
11
CA
SharedSet
------162
122311161
121111261
211221221
CEAFCDCBA
Node 3Node 2Node 1
![Page 12: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/12.jpg)
Decomposition of Computations
- Since D is implicit,
- For a computation:- Decompose F into G and g’s
- Decomposition depends on- F- Di’s and Set of shared attributes
D1 D2 Dn
A B C C D E A E G
)]()...(),([ 2211 nn DgDgDgGR
)(DFR
![Page 13: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/13.jpg)
Count All Tuples in Implicit D
)(# DtuplesR
m
j
n
iCondi j
DNR1 1
))(((
– condJ : Jth tuple in Shareds
– n: number of databases (Dis)
– (N(Dt)condJ): count of tuples in Dt satisfying condJ
– Local computation: gi(Di,) = N(Dt)condJ
– G is a sum-of-products
– If each Di knows “shared” values, then• Only one message per site needed for #tuples
22
12
21
11
CA
Shareds
L shared attributes;k values each;
kl tuples
![Page 14: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/14.jpg)
Learning Decision Trees
Consists of various counts only:
))log( 2b
bc
c b
bc
b N
N
N
NE
b branchesa=?
a1 a2 ab
ID3 Algorithm
c classes in the dataset
Nbc and Nb can be computed using g and G as for #tuple - one message/database needed for computing each Entropy value
![Page 15: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/15.jpg)
Compute Covariance Matrix for D• Covariance matrix for D
– Needed for eigen vectors/principal components
– Needs second order moments
– Helps compute terms of the type:
– This matrix can be computed at one of the databases
Dt
tt yx
)])([( jjii xxE
![Page 16: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/16.jpg)
G-and-g Decomposition for 2nd order moments
• Sum of products for two attributes:
• Six different ways in which x and y may be distributed– Each requires a different decomposition
– Case 1: x same as y; and x belongs to the SharedSet.
– Case 2: x same as y; and x does not belong to the SharedSet.
– Case 3: x and y both belong to the SharedSet.
Dt
tt yx
)....(*2 DinxCountx jj
j
)(*)....( 2kk k condCountcondforxAvg
)....(** SharedincondCountyx kkk k
![Page 17: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/17.jpg)
Sum of Products
– Case 4: x belongs to SharedSet and y does not.
– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.
• For each tuple t in SharedSet, obtain
• and then
– Case 6: x, y don’t belong to the SharedSet and reside on the same
node.
)(** jj j xxCountyx
)(,)( tytx
t
tySumtxSum ))((*))((
t
tCounttod )(*)(Pr where
Prod(t) is average of product of x and y for cond-t of SharedSet
![Page 18: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/18.jpg)
Nearest Neighbor Algorithm
Find nearest neighbor of r1 in D1
• with virtual extensions in D for all tuples in D1
• Need to Compute all pair wise distances• The same distance values can be used for clustering algorithms
![Page 19: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/19.jpg)
Problem: Closed-Loops in Databases
![Page 20: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/20.jpg)
Extracting Communication Graph
The learner is D1
Covariance, k-NN, etc. algorithms developed for this situation
![Page 21: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/21.jpg)
Part-II
Subspace Clusters and Lattice Organization
![Page 22: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/22.jpg)
Clustering in Multi-Domains
• Example 3-D dataset with 4 clusters.• Each cluster is in 2-D • Points from two subspace clusters can be very close --
making traditional clustering algorithms inapplicable.• Overlapping between clusters
![Page 23: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/23.jpg)
Subspace Clustering
• “Interestingness” of a subspace cluster: – Domain dependent / user defined– Similarity-based clusters
![Page 24: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/24.jpg)
Subspace Clusters
• Number of Subspaces
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
30
1
30
k k
![Page 25: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/25.jpg)
Nature of Real Datasets1 0 1 1 0 0 0 1 1 1 1 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0 0 0 0
0 1 0 1 0 1 0 1 1 1 0 0 0 1 1
1 0 0 0 0 0 0 0 1 1 1 1 0 0 1
1 1 0 0 0 0 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
0 1 0 0 0 0 0 1 1 1 1 1 0 0 1
0 1 0 1 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
5 6 3 4 5 1 2 5 4 6 5 7 6 7 5
6 8 8 9 9 9 7 6 5 4 3 2 1 2 3
4 3 2 1 2 3 4 5 6 7 8 7 6 5 6
7 8 7 6 5 6 7 8 9 0 9 8 7 6 0
0 9 8 0 9 8 7 6 5 4 3 2 3 4 5
4 3 4 5 0 0 0 0 0 9 8 7 6 5 4
3 4 5 6 5 4 3 4 3 4 3 6 3 7 2
7 3 9 0 7 0 1 5 3 4 6 5 4 3 7
3 9 6 3 9 0 0 5 4 0 4 3 2 2 2
2 7 8 9 0 9 8 7 6 5 6 7 8 7 6
5 4 3 4 5 6 7 4 2 8 5 9 5 7 2
4 6 4 6 7 7 8 4 6 3 3 1 0 0 1
1 5 5 5 4 4 7 7 8 9 6 4 3 2 0
6 0 7 6 8 4 5 7 3 3 3 4 7 6 8
6 7 6 9 2 5 3 7 5 1 0 4 8 3 5
Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency
![Page 26: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/26.jpg)
Lattice of Subspaces: Formal Concept Analysisnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Row Ids
Parallel to the ideas ofFormal Concept Analysis
1. Need Algorithms to find Interesting subspace clusters2. Lattice provides much more insight into dataset.
![Page 27: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/27.jpg)
Clusters in Subspaces
Clusters in overlapping subspaces
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
b
d
Density = number of rows- An antimonotonic property
a
![Page 28: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/28.jpg)
If AB < needed density,
Then so do all its descendents
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Value of (anti)monotonic Propertiesnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
![Page 29: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/29.jpg)
Maximal and Closed Subspacesnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
![Page 30: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/30.jpg)
Siblings and Parents in Lattice
Merge lattice nodes to find clusters of other properties
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
C1 =<{1,2,3,4,5}, {a,c,d,e}>
C2 =<{3,4,5}, {a,c,d,e}>
Siblings in Lattice
![Page 31: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/31.jpg)
Goal: Subspace Clusters with PropertiesAnti-monotonic properties:
– Minimum density (C=<O,A>):= |O| / total number of objects in the data
eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6
– Succinctness: density is strictly smaller than all of its minimum generalizations
eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct
– Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3”
Weak anti-monotonic properties– “average>=δ” “average<= δ” “variance>=δ”
“variance<=δ”eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”
a b c d
o1 5 2 4 2
o2 2 1 2 2
o3 2 2 2 2
o4 2 2 2 2
o5 3 2 4 2
![Page 32: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/32.jpg)
Levelwise Search
• Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies
“average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint:
C5 =<{o1,o5},{c.4}>
• If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property
a b c d
o1 5 2 4 2
o2 2 1 2 2
o3 2 2 2 2
o4 2 2 2 2
o5 3 2 4 2
![Page 33: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/33.jpg)
Levelwise Search for Subspace Clusters
• Anti-monotonic & Weak Anti-monotonic
– Candidate generation based on anti-monotonic properties only
– Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”
![Page 34: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/34.jpg)
Performance Comparison
• Optimizing Techniques– Sorting the attributes
– Reuse previous results
– Stack of unpromising branches
– Check closure property
![Page 35: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/35.jpg)
Distributed Subspace Clustering
• Discover closed subspace clusters from databases located at multiple sites
• Objectives:– Minimize local computation cost– Minimize communication cost
![Page 36: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/36.jpg)
Distributed Subspace Clustering
a b c d
1 0 0 1 1
2 1 0 1 1
3 1 1 1 0
4 0 0 1 1
5 1 1 0 0
a b c d
1 0 0 1 1
2 1 0 1 1
3 1 1 1 0
a b c d
4 0 0 1 1
5 1 1 0 0
DSD1
D2
• Horizontal Partitioned Data
![Page 37: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/37.jpg)
Distributed Subspace Clustering
DS D1 D2
<c,1234>
<cd,124>
<a,235>
<ac,23>
<acd,2>
<ab,35>
<abc,3>
<c,123>
<cd,12>
<ac,23>
<acd,2>
<abc,3>
<cd, 4>
<ab,5>
List of Closed Subspace Clusters
Lemma 1: All locally closed attribute sets are also globally closed
Lemma 2: Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab
![Page 38: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/38.jpg)
Distributed Subspace Clustering
DS D1 D2
<c,1234>
<cd,124>
<a,235>
<ac,23>
<acd,2>
<ab,35>
<abc,3>
<c,123>
<cd,12>
<ac,23>
<acd,2>
<abc,3>
<cd, 4>
<ab,5>
List of Closed Subspace ClustersCompute the object set:
1. Closed at both partitions: compute the union of the two object setseg: cd
2. Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd
3. Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab
![Page 39: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/39.jpg)
Distributed Subspace Clustering
DS D1 D2
<c,1234>
<cd,124>
<a,235>
<ac,23>
<acd,2>
<ab,35>
<abc,3>
<c,123>
<cd,12>
<ac,23>
<acd,2>
<abc,3>
<cd, 4>
<ab,5>
List of Closed Subspace ClustersProblem:
both for case 2 and 3
a = ac ∩ ab and a = acd ∩ ab
Solution:
for each globally closed attribute set, keep track of the largest object set (or size of the object set)
![Page 40: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/40.jpg)
Distributed Subspace Clustering
DS D1 D2
F :<c,1234>
<cd,124>
<a,235>
F1:
<c,123>
<cd,12>
<ac,23>
F2:
E :<ac,23>
<acd,2>
<ab,35>
<abc,3>
E1 :
<acd,2>
<abc,3>
E2:
<cd, 4>
<ab,5>
Density Constraint: δ >= 0.6Observation:
Intersection of two elements both from Eis can not have enough density
Efficient Computation:
Sort Fi and Ei into decreasing density
![Page 41: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/41.jpg)
Distributed Subspace Clustering
R R1 R2
<c,1234>
<cd,124>
<a,235>
<c,123>
<cd,12>
<ac,23>
<c,1234>
<cd,124>
<a,235>
![Page 42: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/42.jpg)
Distributed Subspace Clustering
• Generalize to k>2– k sites need k step communication and
computation– k sites have k types:
![Page 43: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/43.jpg)
Distributed Subspace Clustering
• K=3
![Page 44: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/44.jpg)
Distributed Subspace Clustering
![Page 45: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/45.jpg)
Distributed Subspace Clustering
![Page 46: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/46.jpg)
Part-III
Multi-Domain Clusters
![Page 47: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/47.jpg)
Introduction
Traditional clustering
Bi-Clustering
3-Clustering
![Page 48: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/48.jpg)
Why 3-clusters?
• Correspondence between bi-clusters of two different lattices
• Sharpen local clusters with outside knowledge
• Alternative? “Join datasets then search”– Does not capture underlying interactions– Inefficient– Not always possible
![Page 49: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/49.jpg)
Formal Definitions
Bi-cluster in Di
3-Cluster across D1 and D
2
Pattern in Di
![Page 50: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/50.jpg)
Defining 3-clusters
• D1 is the “learner”
• Maximal rectangle of 1's under suitable permutation in learner
• Best Correspondence to rectangle of 1's in D
2
D1D1
![Page 51: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/51.jpg)
Cluster Quality Measure
• Intuition: Maximize number of 1's while also maximizing number of items and objects
• Trade off between objects and items– More items...less objects– More objects...less items
![Page 52: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/52.jpg)
Quality Measure
–Consider bi-clusters in learner alone
I1
O C1
C2
•Which is preferable ?•User decides
![Page 53: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/53.jpg)
Measure of Cluster Quality
• Quality measure:– Monotonic in both width and height– Balances width and height according to user
defined parameter
• Introduce β width(attributes) willing to trade for a single unit of height (objects)
![Page 54: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/54.jpg)
Cluster Quality Measure
![Page 55: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/55.jpg)
Cluster Quality for 3-clusters
• Utilize same intuition• Width of 3-cluster is sum of individual
widths
![Page 56: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/56.jpg)
Selecting
• Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2 – Cluster key websites popular with large number
of democrats and republicans
• Smaller values produce 3-clusters that are “narrow” and “long”– Discover long list of websites utilized by few
select democrats and republicans
β
![Page 57: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/57.jpg)
3-Clu: Our Algorithm
• Search for 3-clusters similar to search for closed itemsets
• How to formulate the search space?– Assumption that objects outnumber attributes
may not hold– Several possible orderings of the search space
![Page 58: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/58.jpg)
3-Clu Algorithm
![Page 59: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/59.jpg)
3-Clu Algorithm
• Define search space with primacy to objects
• Only need to maintain one search tree• Mimic closed itemset algorithm with
simultaneous pruning of search space• Prune with quality measure
![Page 60: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/60.jpg)
3-Clu Algorithm
![Page 61: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/61.jpg)
Algorithm
• Quality measure is neither monotone nor anti-monotone in the search space
• Pruning is still possible
Is C2 of higher quality ?
![Page 62: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/62.jpg)
Algorithm
![Page 63: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/63.jpg)
Experimental Results
Chess Connect GO-Pheno
![Page 64: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/64.jpg)
Experimental Results
• Test validity of 3-clusters
• Randomly partitioned Mushrooms dataset by attributes
![Page 65: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/65.jpg)
Discriminating Clusters
• Key question: What sets of attributes and /or objects most distinguish the incremental bi-clusters from each other?
• Incremental bi-clusters that only differ slightly may be a result of noise or human error
• Prioritize relationships among incremental bi-clusters
A B C D
1 0 1 0 1
2 1 1 0 0
3 0 1 1 1
4 0 0 1 1
{}
{A,B,C,D}
{2}
{A,B}
{1,2,3,4}
{}
{3,4}
{C,D}
{1,2,3}
{B}
{1,3,4}
{D}
{1,3}
{B,D}
![Page 66: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/66.jpg)
Motivation
tf1 tf2 tf3 tf4
g1 1 0 1 1
g2 1 1 1 0
g3 0 1 0 1
g4 1 0 0 0
g5 1 0 0 1
• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation
of genes/TFs that transform cellular processes• Prioritize relationships
![Page 67: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/67.jpg)
Related Work
• Emerging patterns [Dong,Li]– Only consider ratio of support between frequent itemsets– Supervised technique
• Contrast sets [Bay, Pazzani]– Also supervised– Special case of rule discovery
• Closed Itemset algorithms [Zaki, Uno, Bian] – Efficient– Do not explicitly enumerate lattice structure
![Page 68: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/68.jpg)
Problem Formulation
• Challenges– Enumerating bi-clusters and forming lattice is
known to be NP-Complete problem– Discover distinguishing sets during the mining
process as opposed to post processing step– How to quantify distinction
![Page 69: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/69.jpg)
Problem Formulation
• Dataset D=(O,A,R)
• Consider set of objects X• X’ defined as all attributes common to all
objects of X• Dually defined set of attributes Y
• Bi-cluster: (X,Y) s.t. X’ = Y and Y’=X
![Page 70: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/70.jpg)
Problem Formulation
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
• Sample bi-cluster• <{3,4},{C,D}>
• Bi-clusters are equivalent to• Maximal rectangles of
1s under suitable permutation
• Maximal bi-cliques
![Page 71: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/71.jpg)
Problem Formulation
• Set of bi-clusters from a complete lattice
• Model lattice as weighted directed graph
• Weights represent degree of distinction
• Each edge represents a distinguishing set
• Grow maximum cost spanning tree
![Page 72: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/72.jpg)
Problem Formulation
• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s
under suitable permutation– Consider both change in width and height when
computing distinction– Choose a shape metric s (ex. Area, ratio height
to width etc.)– Quantify distinction as degree of shape change
along a path in the lattice
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
![Page 73: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/73.jpg)
Problem Model
Compute partial derivates as forward difference
{2}{A,B}
{1,2,3}{B}
![Page 74: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/74.jpg)
Our Algorithm (MIDS)
• Input: Dataset D
• Output: Maximum cost spanning tree of bi-cluster lattice of D
• Computational challenge: enumerating bi-cluster lattice and growing tree simultaneously
![Page 75: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/75.jpg)
Our Algorithm (MIDS)
• Most min/max cost spanning tree algorithms assume availability of graph
• Prim’s algorithm depends on the Cut set
• Intuitive idea: Grow bi-cluster lattice incrementally and maintain the Cut set
![Page 76: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/76.jpg)
Our Algorithm (MIDS)
• Prim’s grows sequence of trees
• Denote set of edges between bi-cluster c and all upper neighbors of c that do not appear in current tree by
• Df
• Dynamically compute Cut, while enumerating new bi-clusters
![Page 77: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/77.jpg)
Our Algorithm (MIDS)1. Choose starting bi-cluster c (usually infimum)2. Compute cut set by generating upper neighbors of c together with
update equation3. Compute weight of edges between c and upper neighbors4. Greedily choose maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters visited
![Page 78: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/78.jpg)
Algorithms
{}
{A,B,C,D}
{2}
{A,B}
{3,4}
{C,D}
{1,3}
{B,D}
{1,2,3}
{B}
{1,3,4}
{D}
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
{}
{A,B,C,D}
{2}
{A,B}
{3,4}
{C,D}
{1,3}
{B,D}
{1,2,3}
{B}
{1,3,4}
{D}
{1,2,3,4}
{}
{}
{A,B,C,D}
![Page 79: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/79.jpg)
Algorithms
• Major computational cost is computing upper neighbors of a bi-cluster
• Theorem: • Let <X,Y> be a concept. Then {X υ {o} } ’ ’ are the objects of an
upper neighbor of <X,Y> if and only iffor all z ε {X υ {o} } ’ ’ – X the following holds:{X υ {z}} ’ ’ = {X υ {o} } ’ ’
• Lindig’s algorithm implements this theorem• Algorithm performs a local computation of bi-clusters and
upper neighbors
![Page 80: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/80.jpg)
Algorithms
• Improved Lindig’s algorithm, practical running time• Adapted to enumerate only “large” bi-clusters • Reduced number of set intersections performed
• Theoretical complexity remains the same• Overall complexity of MIDS
– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns
![Page 81: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/81.jpg)
Experimental Results
• Compared solution to Zaki’s CHARM-L algorithm– CHARM-L enumerates bi-clusters, and organizes into
lattice structure– Not incremental: hard to adapt to Prim’s algorithm– Added post processing step to grow MCST
![Page 82: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/82.jpg)
Experimental Results
![Page 83: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/83.jpg)
Experimental Results
• Experimented with synthetic datasets to find most distinguished incremental bi-clusters
• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise
• Next planted several large incremental bi-clusters that differed only slightly as a result of noise
![Page 84: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/84.jpg)
Experimental Results
• Region 1 is region of interest, clearly distinct
• Noise added to region 1, while regions 2 and 3 contain minimal distinction
![Page 85: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/85.jpg)
Experimental Results
![Page 86: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/86.jpg)
Computer_Science@UC
A Vision
Raj Bhatnagar
![Page 87: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/87.jpg)
CS Department @ UC
Department Goals:• Train CS graduates to meet technical manpower
needs in Ohio, and the world at large.• Contribute to the creation of scholarly knowledge
through research
Needed Features:– Strong Research and Graduate program– Strong Undergraduate program– Good visibility and ranking in research communities– Good reputation in 250-mile region for UG program
![Page 88: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/88.jpg)
Research and Graduate Program
CS Dept. Faculty inCore CS Areas
BME, CHMC
Bio-InformaticsGIS
, Scie
nces
A&S
Engineering, Business
Robotics, BI
Mat
hem
atics
Secur
ity, D
ata
Mini
ng
• CS is an Enabling Science• Can catalyze research in Science and Engineering• UC needs a stronger CS program
• Core CS areas to be covered in CS Dept., Foundational research
• Collaborations to be built with other UC departments for research
![Page 89: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/89.jpg)
Core Computing FacultyCurrent Number: 12Areas of Strength:
– Algorithms and CS Theory– Networks and Communications– Machine Learning, AI, and Data Mining– HCI
Gaps in Strength/Coverage– Programming Languages/Compilers– Software Engineering– Databases– Computer Systems and Networks
Target Number for a Strong CSD: 18
![Page 90: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/90.jpg)
Potential CollaborationsBio-Informatics
– Have been research contacts– Strong presence of Bioinformatics activity at UC– Need new CS faculty with matching interests
Mathematics– Computer Security, quantum computing– Data Mining, privacy preserving operations on data
GIS– Very active and growing in A&S– Almost no contact with CS faculty; great research potential
Robotics– ME has good activity; a recent hire from Duke University– Great student interest; grad and undergrad
Business– Strength in databases and interest in collaborations
CAS– Potential for a minor in IT for CS students and vice-versa
![Page 91: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/91.jpg)
Undergraduate Program• High priority to increase enrollment/retention
– Translates into more resources now– Alumni potential donors– Need to advertize our strengths on www and in local area– Increase interestingness of available electives – from new faculty and
also from collaborating departments– Increase social capital and sense of belongingness within CS student
body• Strengthen Capstone projects
– Sponsored projects by local industry (item for IAB)– Advertize them on department website– Awards for top projects (sponsored by industry)
• Student Experience while at UC– Support ACM, EEE, and LARC groups for an enriching experience– Increase UG Research experience opportunities– Seek funds to provide more scholarships
![Page 92: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/92.jpg)
Graduate Program
Marshal Resources to support Ph.D. students– Help faculty’s attempts to seek sponsored research– Seek industry help for sponsored research– Seek funds to support student travel to conferences
Enhance Students’ Quality of Experience at UC– More interactions with faculty/students– More graduate courses– More support for organizing/travelling to
events/conferences
![Page 93: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/93.jpg)
Resources
Need serious efforts for raising resources– Faculty lines from UC
• Need to reach critical mass for CSD• PBB is a hurdle• College support/commitment is needed
– CS Endowment Account (Our money, that can’t be cut!)• Approach alumni for donations• Seek industry support/sponsorship for scholarships, seminar speakers
– Sponsored research awards from NSF• New administration has significantly enhanced funds for NSF• Support faculty in efforts to seek these funds
– Seek more UGA funds from CoE• Difficult for now; but must continue efforts
![Page 94: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/94.jpg)
Phase-I SummaryMain Results:
• Algorithms for complex operations that work with implicit databases
• Decision Trees, Association Rules, Covariance matrix• K-NN neighbors• Hierarchical and Sequential Clustering
• Algorithms for distributed control of multi-agent systems• Distributed Multi-Agent Reorganization
Open Research Issues:– We preserve data privacy, but we need a formal model and
analysis of privacy– Mining of streaming data at multiple cooperating nodes
![Page 95: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/95.jpg)
Phase-I Research Participants:– Ph.D. dissertations
• Ahmed Khedr, 2003• Barrington Young, 2007• Eric Matson, 2008
– M.S. theses• Shriram Srinivasan, 1997• Sanjeev Beemidi, 1998• Harpreet Singh, 2000• Rahul Dasgupta, 2000• Susmit Kumar 2002• Rishi Jhaver, 2003• Chris Calendar, 2004• Michael Kinsey, 2005• Kaustubh Shinde, 2006
![Page 96: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/96.jpg)
Phase-I Publications:1. Ahmed Khedr and Raj Bhatnagar. Agents for Integrating Distributed Data for Complex
Computations. \textit{Computing and Informatics} Journal, Vol. 26, 2007, 149-170.
2. Eric T. Matson, Raj Bhatnagar. Knowledge Sharing Between Agents in a Transitioning Organization. Proceedings of the COIN 2007 published as book Coordination, Organizations, Institutions, and Norms in Agent Systems - III, Springer Verlag, 2007, pp. 187-202.
3. Eric Matson and Raj Bhatnagar. Properties of Capability Based Agent Organization Transition, Proceedings of the Intelligent Agent Technologies (IAT 2006) confernece held in Hong Kong in December 2006.
4. Barrington Young and Raj Bhatnagar. Secure K-NN Algorithm for Distributed Databases, Proceedings of the Privacy Security and Trust Conference, 2006, pp. 485-490.
5. Ahmed Khedr and Raj Bhatnagar, Decomposable Algorithms for Minimum Spanning Tree, Presented at the International Workshop on Distributed Computing, December 2003, Springer Verlag notes on Computer Science, vol. 2918.
6. Raj Bhatnagar, Sriram Srinivasan. Pattern Discovery in Distributed Databases. {\em Proceedings of the AAAI-97 Conference} held at Providence, RI, in July 1997, pp. 503-508.
![Page 97: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/97.jpg)
Phase-II Summary
Main Results:• Subspace Clustering Algorithms
• Efficient Lattice-based search• Use of novel monotonic conditions to control search
• Distributed mining across multiple lattices• Only for horizontally partitioned datasets
• Applications of Results• Genomic, document collections,
Open Research Issues:– Clusters for Non-binary datasets– Approximately closed clusters
![Page 98: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/98.jpg)
Phase-II Research Participants:– Ph.D. dissertations
• Haiyun Bian, 2006• Amit Sinha, 2008
– M.S. theses• Gautam Kurra, 2002• Anshuman Rajshiva, 2004• Ramya Ashok, 2005• Aparna Yardi, 2006• Aravind Kumar, 2006• Shriram Narayanswami, 2007• Mrunal Deshmukh, 2008
– Collaborations:• CHMC
![Page 99: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/99.jpg)
Phase-II Publications:1. Shriram Narayanswamy, Raj Bhatnagar. A Lattice-Based Model for Recommender Systems.
Proceedings of the International Conference on Tools with Artificial intelligence (ICTAI 2008) pp. 349-356.
2. Haiyun Bian, Raj Bhatnagar. An Algorithm for Mining Weighted Dense Maximal 1-Complete Regions. Data Mining: Foundations and Practice, 31-48, Springer Verlag, 2008.
3. Haiyun Bian, Raj Bhatnagar, and Barrington Young. An Efficient Constraint-Based Closed Set Mining Algorithm. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 67-72.
4. Barrington Young, Raj Bhatnagar, Giridhar Tatavarty, and Haiyun Bian. Covariance matrix Computations with Federated databases. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 172-177.
5. Haiyun Bian, Raj Bhatnagar: Efficiently Mining Maximal 1-complete Regions from Dense Datasets. ICDM Workshop on Foundations of data Mining 2006, Proceedings of ICDM Workshops, pp 423-427
6. Haiyun Bian and Raj Bhatnagar. Towards More Supervised Subspace Cluetering, Proceedings of the MAICS 2006 conference, held in Valparaiso, OH April 2006.
7. Arvind Muthukrishnan and Raj Bhatnagar. Concept-based Organization and Retrieval of Technical Documents. Proceedings of the MAICS2006, Valparaiso, OH April 2006.
8. Haiyun Bian and Raj Bhatnagar. An Algorithm for Lattice-Structured subspace clustering, Proceedings of the SIAM International Conference on Data Mining, April 2005.
![Page 100: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/100.jpg)
Phase-III Summary
Main Results:• 3-Clustering Algorithm
• Efficient search algorithm• Bioinformatics Application, Genomic datasets
• Most Discriminating subsets• Efficient algorithm
Open Research Issues:– Multi-domain datasets with closed loop relationships– Diagonal band patterns
![Page 101: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/101.jpg)
Phase-III Research Participants:– Ph.D. dissertations
• Faris Alqadah, 2010 (very likely)
– Collaborations: CHMC
![Page 102: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/102.jpg)
Phase-III Publications:1. Faris Alqadah and Raj Bhatnagar. Discovering Substantial Distinctions among
Incremental Bi-Clusters, To be presented atthe SIAM International COnference on Data Mining (SDM 09) in April 2009.
2. Faris Alqadah and Raj Bhatnagar. An effective algorithm for mining 3-clusters in vertically partitioned data. Proceedings of the CIKM 2008, 1103-1112.
3. Faris Alqadah, Raj Bhatnagar. Detecting significant distinguishing sets among bi-clusters. Proceedings of the CIKM 2008.
![Page 103: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/103.jpg)
Conclusions
• Introduced quantitative measure of distinction among incremental bi-clusters
• Developed efficient algorithm for enumerating bi-clusters and growing maximum cost spanning tree simultaneously
![Page 104: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/104.jpg)
Info from Lattice of Clusters
tf1 tf2 tf3 tf4
g1 1 0 1 1
g2 1 1 1 0
g3 0 1 0 1
g4 1 0 0 0
g5 1 0 0 1
• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation of
genes/TFs that transform cellular processes• Prioritize relationships
![Page 105: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/105.jpg)
Problem Formulation
• Challenges– Enumerating bi-clusters and forming lattice is
known to be NP-Complete problem– Discover distinguishing sets during the mining
process as opposed to post processing step– How to quantify distinction
![Page 106: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/106.jpg)
Problem Formulation
• Model lattice as weighted directed graph
• Weights represent degree of distinction
• Each edge represents a distinguishing set
• Grow maximum cost spanning tree
![Page 107: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/107.jpg)
Problem Formulation
• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s
under suitable permutation– Consider both change in width and height when
computing distinction– Choose a shape metric s (ex. Area, ratio height
to width etc.)– Quantify distinction as degree of shape change
along a path in the lattice
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
![Page 108: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/108.jpg)
Problem Model
• Compute partial derivates as forward difference
{2}{A,B}
{1,2,3}{B}
![Page 109: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/109.jpg)
Our Algorithm (MIDS)
• Adapt Prim’s algorithm– Lattice is not readily available – Dynamically compute cut set by enumerating upper
neighbors of bi-clusters1. Choose starting bi-cluster c2. Compute cut set by generating upper neighbors of c3. Compute weight of edges between c and upper
neighbors4. Greedily maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters
visited
![Page 110: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/110.jpg)
Algorithm Details
• Step 2: Generating upper neighbors in lattice– How?
• Lindig’s algorithm– Cost?
• Improved Lindig’s algorithm, practical running time• Theortical complexity remains the same
• Overall complexity of MIDS– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns
![Page 111: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/111.jpg)
Experimental Results
• Experimented with synthetic datasets to find most distinguished incremental bi-clusters
• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise
• Next planted several large incremental bi-clusters that differed only slightly as a result of noise
![Page 112: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/112.jpg)
Experimental Results
• Region 1 is region of interest, clearly distinct
• Noise added to region 1, while regions 2 and 3 contain minimal distinction
![Page 113: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d5e5503460f94a3d7e5/html5/thumbnails/113.jpg)
Experimental Results