discovering patterns in multiple datasets raj bhatnagar university of cincinnati

Discovering Patterns in Multiple Datasets

Raj Bhatnagar

University of Cincinnati

Nature of Distributed Datasets

Horizontal Partitioning

A B C D E

A B C D E

A B C D E

Vertical Partitioning

D E F G H

A H J K M

A B C D E

Data components may be Geographically Distributed

Nature of Distributed DatasetsMulti-Domain Datasets

Gen

es

Diseases

Gen

es

Drugs

Dru

gs

Adverse Reactions

Nature of Distributed DatasetsMulti-Domain Datasets

Doc

umen

ts

Keywords

Doc

umen

ts

Cited-Documents

Key

wor

ds

Topics

Types of Patterns• Decision Trees• Association Rules• Principal Component Analysis• K-Nearest Neighbor Analysis• Clusters

– Hierarchical

– K-Means

– Subspace

Nature of Clusters

Patterns Ξ Unsupervised, Data Driven, Clusters

Single-Domain Clustering

Gen

es

Diseases

Gen

es

Diseases

Clusters of similar genes;In the context of diseases

Clusters of similar diseasesIn the context of genes

Clusters may be: - Mutually Exclusive - Overlapping

Nature of Patterns

Simultaneous Two-Domain Clustering

Gen

es

Diseases A cluster of similar genes - in a subspace of diseases;

A cluster of similar diseases - in a subspace of genes

Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains

G

D

Nature of Patterns

Simultaneous Three (Multi)-Domain Clustering

Gen

es

Diseases

Gen

es

Dise.

Gen

es

Drugs

Gen

es

Drugs

Match “genes” subsets in two clusters

Phase-III of this research

Part-I

Patterns in Vertically Distributed Databases

Learning Decision Trees

D = D1 X D2 X . . . X Dn

- D is implicitly specified

Goal: Build decision tree for implicit D, using the explicit Di’s

D1 D2 Dn

A B C C D E A E G

Limitations:- Can’t move Di’s to a common site

- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples

Geographically distributed databases

Vertically Partitioned Dataset

Explicit and Implicit Databases

321162

121162

211221

211261

321161

121161

FEDCBA

Implicit Database

Explicit Component Databases

22

12

21

11

CA

SharedSet

------162

122311161

121111261

211221221

CEAFCDCBA

Node 3Node 2Node 1

Decomposition of Computations

- Since D is implicit,

- For a computation:- Decompose F into G and g’s

- Decomposition depends on- F- Di’s and Set of shared attributes

D1 D2 Dn

A B C C D E A E G

)]()...(),([ 2211 nn DgDgDgGR

)(DFR

Count All Tuples in Implicit D

)(# DtuplesR

m

j

n

iCondi j

DNR1 1

))(((

– condJ : Jth tuple in Shareds

– n: number of databases (Dis)

– (N(Dt)condJ): count of tuples in Dt satisfying condJ

– Local computation: gi(Di,) = N(Dt)condJ

– G is a sum-of-products

– If each Di knows “shared” values, then• Only one message per site needed for #tuples

22

12

21

11

CA

Shareds

L shared attributes;k values each;

kl tuples

Learning Decision Trees

Consists of various counts only:

))log( 2b

bc

c b

bc

b N

N

N

NE

b branchesa=?

a1 a2 ab

ID3 Algorithm

c classes in the dataset

Nbc and Nb can be computed using g and G as for #tuple - one message/database needed for computing each Entropy value

Compute Covariance Matrix for D• Covariance matrix for D

– Needed for eigen vectors/principal components

– Needs second order moments

– Helps compute terms of the type:

– This matrix can be computed at one of the databases

Dt

tt yx

)])([( jjii xxE

G-and-g Decomposition for 2nd order moments

• Sum of products for two attributes:

• Six different ways in which x and y may be distributed– Each requires a different decomposition

– Case 1: x same as y; and x belongs to the SharedSet.

– Case 2: x same as y; and x does not belong to the SharedSet.

– Case 3: x and y both belong to the SharedSet.

Dt

tt yx

)....(*2 DinxCountx jj

j

)(*)....( 2kk k condCountcondforxAvg

)....(** SharedincondCountyx kkk k

Sum of Products

– Case 4: x belongs to SharedSet and y does not.

– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.

• For each tuple t in SharedSet, obtain

• and then

– Case 6: x, y don’t belong to the SharedSet and reside on the same

node.

)(** jj j xxCountyx

)(,)( tytx

t

tySumtxSum ))((*))((

t

tCounttod )(*)(Pr where

Prod(t) is average of product of x and y for cond-t of SharedSet

Nearest Neighbor Algorithm

Find nearest neighbor of r1 in D1

• with virtual extensions in D for all tuples in D1

• Need to Compute all pair wise distances• The same distance values can be used for clustering algorithms

Problem: Closed-Loops in Databases

Extracting Communication Graph

The learner is D1

Covariance, k-NN, etc. algorithms developed for this situation

Part-II

Subspace Clusters and Lattice Organization

Clustering in Multi-Domains

• Example 3-D dataset with 4 clusters.• Each cluster is in 2-D • Points from two subspace clusters can be very close --

making traditional clustering algorithms inapplicable.• Overlapping between clusters

Subspace Clustering

• “Interestingness” of a subspace cluster: – Domain dependent / user defined– Similarity-based clusters

Subspace Clusters

• Number of Subspaces

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

30

1

30

k k

Nature of Real Datasets1 0 1 1 0 0 0 1 1 1 1 0 0 0 0

0 0 0 1 1 1 1 1 1 0 0 0 0 0 0

0 1 0 1 0 1 0 1 1 1 0 0 0 1 1

1 0 0 0 0 0 0 0 1 1 1 1 0 0 1

1 1 0 0 0 0 1 1 1 1 1 1 0 0 0

0 0 0 0 0 0 0 0 0 1 1 1 1 1 0

0 1 0 0 0 0 0 1 1 1 1 1 0 0 1

0 1 0 1 0 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 1 1 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

5 6 3 4 5 1 2 5 4 6 5 7 6 7 5

6 8 8 9 9 9 7 6 5 4 3 2 1 2 3

4 3 2 1 2 3 4 5 6 7 8 7 6 5 6

7 8 7 6 5 6 7 8 9 0 9 8 7 6 0

0 9 8 0 9 8 7 6 5 4 3 2 3 4 5

4 3 4 5 0 0 0 0 0 9 8 7 6 5 4

3 4 5 6 5 4 3 4 3 4 3 6 3 7 2

7 3 9 0 7 0 1 5 3 4 6 5 4 3 7

3 9 6 3 9 0 0 5 4 0 4 3 2 2 2

2 7 8 9 0 9 8 7 6 5 6 7 8 7 6

5 4 3 4 5 6 7 4 2 8 5 9 5 7 2

4 6 4 6 7 7 8 4 6 3 3 1 0 0 1

1 5 5 5 4 4 7 7 8 9 6 4 3 2 0

6 0 7 6 8 4 5 7 3 3 3 4 7 6 8

6 7 6 9 2 5 3 7 5 1 0 4 8 3 5

Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency

Lattice of Subspaces: Formal Concept Analysisnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Row Ids

Parallel to the ideas ofFormal Concept Analysis

1. Need Algorithms to find Interesting subspace clusters2. Lattice provides much more insight into dataset.

Clusters in Subspaces

Clusters in overlapping subspaces

a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

b

d

Density = number of rows- An antimonotonic property

a

If AB < needed density,

Then so do all its descendents

null


A B C D E



ABCDE

Value of (anti)monotonic Propertiesnull


A B C D E



ABCDEPruned supersets

Maximal and Closed Subspacesnull


A B C D E



ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

Siblings and Parents in Lattice

Merge lattice nodes to find clusters of other properties

a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

C1 =<{1,2,3,4,5}, {a,c,d,e}>

C2 =<{3,4,5}, {a,c,d,e}>

Siblings in Lattice

Goal: Subspace Clusters with PropertiesAnti-monotonic properties:

– Minimum density (C=<O,A>):= |O| / total number of objects in the data

eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6

– Succinctness: density is strictly smaller than all of its minimum generalizations

eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct

– Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3”

Weak anti-monotonic properties– “average>=δ” “average<= δ” “variance>=δ”

“variance<=δ”eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”

a b c d

o1 5 2 4 2

o2 2 1 2 2

o3 2 2 2 2

o4 2 2 2 2

o5 3 2 4 2

Levelwise Search

• Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies

“average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint:

C5 =<{o1,o5},{c.4}>

• If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property

a b c d

o1 5 2 4 2

o2 2 1 2 2

o3 2 2 2 2

o4 2 2 2 2

o5 3 2 4 2

Levelwise Search for Subspace Clusters

• Anti-monotonic & Weak Anti-monotonic

– Candidate generation based on anti-monotonic properties only

– Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”

Performance Comparison

• Optimizing Techniques– Sorting the attributes

– Reuse previous results

– Stack of unpromising branches

– Check closure property

Distributed Subspace Clustering

• Discover closed subspace clusters from databases located at multiple sites

• Objectives:– Minimize local computation cost– Minimize communication cost


a b c d

1 0 0 1 1

2 1 0 1 1

3 1 1 1 0

4 0 0 1 1

5 1 1 0 0

a b c d

1 0 0 1 1

2 1 0 1 1

3 1 1 1 0

a b c d

4 0 0 1 1

5 1 1 0 0

DSD1

D2

• Horizontal Partitioned Data


DS D1 D2

<c,1234>

<cd,124>

<a,235>

<ac,23>

<acd,2>

<ab,35>

<abc,3>

<c,123>

<cd,12>

<ac,23>

<acd,2>

<abc,3>

<cd, 4>

<ab,5>

List of Closed Subspace Clusters

Lemma 1: All locally closed attribute sets are also globally closed

Lemma 2: Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab


DS D1 D2

<c,1234>

<cd,124>

<a,235>

<ac,23>

<acd,2>

<ab,35>

<abc,3>

<c,123>

<cd,12>

<ac,23>

<acd,2>

<abc,3>

<cd, 4>

<ab,5>

List of Closed Subspace ClustersCompute the object set:

1. Closed at both partitions: compute the union of the two object setseg: cd

2. Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd

3. Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab


DS D1 D2

<c,1234>

<cd,124>

<a,235>

<ac,23>

<acd,2>

<ab,35>

<abc,3>

<c,123>

<cd,12>

<ac,23>

<acd,2>

<abc,3>

<cd, 4>

<ab,5>

List of Closed Subspace ClustersProblem:

both for case 2 and 3

a = ac ∩ ab and a = acd ∩ ab

Solution:

for each globally closed attribute set, keep track of the largest object set (or size of the object set)


DS D1 D2

F :<c,1234>

<cd,124>

<a,235>

F1:

<c,123>

<cd,12>

<ac,23>

F2:

E :<ac,23>

<acd,2>

<ab,35>

<abc,3>

E1 :

<acd,2>

<abc,3>

E2:

<cd, 4>

<ab,5>

Density Constraint: δ >= 0.6Observation:

Intersection of two elements both from Eis can not have enough density

Efficient Computation:

Sort Fi and Ei into decreasing density


R R1 R2

<c,1234>

<cd,124>

<a,235>

<c,123>

<cd,12>

<ac,23>

<c,1234>

<cd,124>

<a,235>


• Generalize to k>2– k sites need k step communication and

computation– k sites have k types:


• K=3

Part-III

Multi-Domain Clusters

Introduction

Traditional clustering

Bi-Clustering

3-Clustering

Why 3-clusters?

• Correspondence between bi-clusters of two different lattices

• Sharpen local clusters with outside knowledge

• Alternative? “Join datasets then search”– Does not capture underlying interactions– Inefficient– Not always possible

Formal Definitions

Bi-cluster in Di

3-Cluster across D1 and D

2

Pattern in Di

Defining 3-clusters

• D1 is the “learner”

• Maximal rectangle of 1's under suitable permutation in learner

• Best Correspondence to rectangle of 1's in D

2

D1D1

Cluster Quality Measure

• Intuition: Maximize number of 1's while also maximizing number of items and objects

• Trade off between objects and items– More items...less objects– More objects...less items

Quality Measure

–Consider bi-clusters in learner alone

I1

O C1

C2

•Which is preferable ?•User decides

Measure of Cluster Quality

• Quality measure:– Monotonic in both width and height– Balances width and height according to user

defined parameter

• Introduce β width(attributes) willing to trade for a single unit of height (objects)

Cluster Quality Measure

Cluster Quality for 3-clusters

• Utilize same intuition• Width of 3-cluster is sum of individual

widths

Selecting

• Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2 – Cluster key websites popular with large number

of democrats and republicans

• Smaller values produce 3-clusters that are “narrow” and “long”– Discover long list of websites utilized by few

select democrats and republicans

β

3-Clu: Our Algorithm

• Search for 3-clusters similar to search for closed itemsets

• How to formulate the search space?– Assumption that objects outnumber attributes

may not hold– Several possible orderings of the search space

3-Clu Algorithm

3-Clu Algorithm

• Define search space with primacy to objects

• Only need to maintain one search tree• Mimic closed itemset algorithm with

simultaneous pruning of search space• Prune with quality measure

3-Clu Algorithm

Algorithm

• Quality measure is neither monotone nor anti-monotone in the search space

• Pruning is still possible

Is C2 of higher quality ?

Algorithm

Experimental Results

Chess Connect GO-Pheno


• Test validity of 3-clusters

• Randomly partitioned Mushrooms dataset by attributes

Discriminating Clusters

• Key question: What sets of attributes and /or objects most distinguish the incremental bi-clusters from each other?

• Incremental bi-clusters that only differ slightly may be a result of noise or human error

• Prioritize relationships among incremental bi-clusters

A B C D

1 0 1 0 1

2 1 1 0 0

3 0 1 1 1

4 0 0 1 1

{}

{A,B,C,D}

{2}

{A,B}

{1,2,3,4}

{}

{3,4}

{C,D}

{1,2,3}

{B}

{1,3,4}

{D}

{1,3}

{B,D}

Motivation

tf1 tf2 tf3 tf4

g1 1 0 1 1

g2 1 1 1 0

g3 0 1 0 1

g4 1 0 0 0

g5 1 0 0 1

• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation

of genes/TFs that transform cellular processes• Prioritize relationships

Related Work

• Emerging patterns [Dong,Li]– Only consider ratio of support between frequent itemsets– Supervised technique

• Contrast sets [Bay, Pazzani]– Also supervised– Special case of rule discovery

• Closed Itemset algorithms [Zaki, Uno, Bian] – Efficient– Do not explicitly enumerate lattice structure

Problem Formulation

• Challenges– Enumerating bi-clusters and forming lattice is

known to be NP-Complete problem– Discover distinguishing sets during the mining

process as opposed to post processing step– How to quantify distinction

Problem Formulation

• Dataset D=(O,A,R)

• Consider set of objects X• X’ defined as all attributes common to all

objects of X• Dually defined set of attributes Y

• Bi-cluster: (X,Y) s.t. X’ = Y and Y’=X

Problem Formulation

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

• Sample bi-cluster• <{3,4},{C,D}>

• Bi-clusters are equivalent to• Maximal rectangles of

1s under suitable permutation

• Maximal bi-cliques

Problem Formulation

• Set of bi-clusters from a complete lattice

• Model lattice as weighted directed graph

• Weights represent degree of distinction

• Each edge represents a distinguishing set

• Grow maximum cost spanning tree

Problem Formulation

• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s

under suitable permutation– Consider both change in width and height when

computing distinction– Choose a shape metric s (ex. Area, ratio height

to width etc.)– Quantify distinction as degree of shape change

along a path in the lattice

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

Problem Model

Compute partial derivates as forward difference

{2}{A,B}

{1,2,3}{B}

Our Algorithm (MIDS)

• Input: Dataset D

• Output: Maximum cost spanning tree of bi-cluster lattice of D

• Computational challenge: enumerating bi-cluster lattice and growing tree simultaneously


• Most min/max cost spanning tree algorithms assume availability of graph

• Prim’s algorithm depends on the Cut set

• Intuitive idea: Grow bi-cluster lattice incrementally and maintain the Cut set


• Prim’s grows sequence of trees

• Denote set of edges between bi-cluster c and all upper neighbors of c that do not appear in current tree by

• Df

• Dynamically compute Cut, while enumerating new bi-clusters

Our Algorithm (MIDS)1. Choose starting bi-cluster c (usually infimum)2. Compute cut set by generating upper neighbors of c together with

update equation3. Compute weight of edges between c and upper neighbors4. Greedily choose maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters visited

Algorithms

{}

{A,B,C,D}

{2}

{A,B}

{3,4}

{C,D}

{1,3}

{B,D}

{1,2,3}

{B}

{1,3,4}

{D}

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

{}

{A,B,C,D}

{2}

{A,B}

{3,4}

{C,D}

{1,3}

{B,D}

{1,2,3}

{B}

{1,3,4}

{D}

{1,2,3,4}

{}

{}

{A,B,C,D}

Algorithms

• Major computational cost is computing upper neighbors of a bi-cluster

• Theorem: • Let <X,Y> be a concept. Then {X υ {o} } ’ ’ are the objects of an

upper neighbor of <X,Y> if and only iffor all z ε {X υ {o} } ’ ’ – X the following holds:{X υ {z}} ’ ’ = {X υ {o} } ’ ’

• Lindig’s algorithm implements this theorem• Algorithm performs a local computation of bi-clusters and

upper neighbors

Algorithms

• Improved Lindig’s algorithm, practical running time• Adapted to enumerate only “large” bi-clusters • Reduced number of set intersections performed

• Theoretical complexity remains the same• Overall complexity of MIDS

– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns


• Compared solution to Zaki’s CHARM-L algorithm– CHARM-L enumerates bi-clusters, and organizes into

lattice structure– Not incremental: hard to adapt to Prim’s algorithm– Added post processing step to grow MCST


• Experimented with synthetic datasets to find most distinguished incremental bi-clusters

• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise

• Next planted several large incremental bi-clusters that differed only slightly as a result of noise


• Region 1 is region of interest, clearly distinct

• Noise added to region 1, while regions 2 and 3 contain minimal distinction

Computer_Science@UC

A Vision

Raj Bhatnagar

CS Department @ UC

Department Goals:• Train CS graduates to meet technical manpower

needs in Ohio, and the world at large.• Contribute to the creation of scholarly knowledge

through research

Needed Features:– Strong Research and Graduate program– Strong Undergraduate program– Good visibility and ranking in research communities– Good reputation in 250-mile region for UG program

Research and Graduate Program

CS Dept. Faculty inCore CS Areas

BME, CHMC

Bio-InformaticsGIS

, Scie

nces

A&S

Engineering, Business

Robotics, BI

Mat

hem

atics

Secur

ity, D

ata

Mini

ng

• CS is an Enabling Science• Can catalyze research in Science and Engineering• UC needs a stronger CS program

• Core CS areas to be covered in CS Dept., Foundational research

• Collaborations to be built with other UC departments for research

Core Computing FacultyCurrent Number: 12Areas of Strength:

– Algorithms and CS Theory– Networks and Communications– Machine Learning, AI, and Data Mining– HCI

Gaps in Strength/Coverage– Programming Languages/Compilers– Software Engineering– Databases– Computer Systems and Networks

Target Number for a Strong CSD: 18

Potential CollaborationsBio-Informatics

– Have been research contacts– Strong presence of Bioinformatics activity at UC– Need new CS faculty with matching interests

Mathematics– Computer Security, quantum computing– Data Mining, privacy preserving operations on data

GIS– Very active and growing in A&S– Almost no contact with CS faculty; great research potential

Robotics– ME has good activity; a recent hire from Duke University– Great student interest; grad and undergrad

Business– Strength in databases and interest in collaborations

CAS– Potential for a minor in IT for CS students and vice-versa

Undergraduate Program• High priority to increase enrollment/retention

– Translates into more resources now– Alumni potential donors– Need to advertize our strengths on www and in local area– Increase interestingness of available electives – from new faculty and

also from collaborating departments– Increase social capital and sense of belongingness within CS student

body• Strengthen Capstone projects

– Sponsored projects by local industry (item for IAB)– Advertize them on department website– Awards for top projects (sponsored by industry)

• Student Experience while at UC– Support ACM, EEE, and LARC groups for an enriching experience– Increase UG Research experience opportunities– Seek funds to provide more scholarships

Graduate Program

Marshal Resources to support Ph.D. students– Help faculty’s attempts to seek sponsored research– Seek industry help for sponsored research– Seek funds to support student travel to conferences

Enhance Students’ Quality of Experience at UC– More interactions with faculty/students– More graduate courses– More support for organizing/travelling to

events/conferences

Resources

Need serious efforts for raising resources– Faculty lines from UC

• Need to reach critical mass for CSD• PBB is a hurdle• College support/commitment is needed

– CS Endowment Account (Our money, that can’t be cut!)• Approach alumni for donations• Seek industry support/sponsorship for scholarships, seminar speakers

– Sponsored research awards from NSF• New administration has significantly enhanced funds for NSF• Support faculty in efforts to seek these funds

– Seek more UGA funds from CoE• Difficult for now; but must continue efforts

Phase-I SummaryMain Results:

• Algorithms for complex operations that work with implicit databases

• Decision Trees, Association Rules, Covariance matrix• K-NN neighbors• Hierarchical and Sequential Clustering

• Algorithms for distributed control of multi-agent systems• Distributed Multi-Agent Reorganization

Open Research Issues:– We preserve data privacy, but we need a formal model and

analysis of privacy– Mining of streaming data at multiple cooperating nodes

Phase-I Research Participants:– Ph.D. dissertations

• Ahmed Khedr, 2003• Barrington Young, 2007• Eric Matson, 2008

– M.S. theses• Shriram Srinivasan, 1997• Sanjeev Beemidi, 1998• Harpreet Singh, 2000• Rahul Dasgupta, 2000• Susmit Kumar 2002• Rishi Jhaver, 2003• Chris Calendar, 2004• Michael Kinsey, 2005• Kaustubh Shinde, 2006

Phase-I Publications:1. Ahmed Khedr and Raj Bhatnagar. Agents for Integrating Distributed Data for Complex

Computations. \textit{Computing and Informatics} Journal, Vol. 26, 2007, 149-170.

2. Eric T. Matson, Raj Bhatnagar. Knowledge Sharing Between Agents in a Transitioning Organization. Proceedings of the COIN 2007 published as book Coordination, Organizations, Institutions, and Norms in Agent Systems - III, Springer Verlag, 2007, pp. 187-202.

3. Eric Matson and Raj Bhatnagar. Properties of Capability Based Agent Organization Transition, Proceedings of the Intelligent Agent Technologies (IAT 2006) confernece held in Hong Kong in December 2006.

4. Barrington Young and Raj Bhatnagar. Secure K-NN Algorithm for Distributed Databases, Proceedings of the Privacy Security and Trust Conference, 2006, pp. 485-490.

5. Ahmed Khedr and Raj Bhatnagar, Decomposable Algorithms for Minimum Spanning Tree, Presented at the International Workshop on Distributed Computing, December 2003, Springer Verlag notes on Computer Science, vol. 2918.

6. Raj Bhatnagar, Sriram Srinivasan. Pattern Discovery in Distributed Databases. {\em Proceedings of the AAAI-97 Conference} held at Providence, RI, in July 1997, pp. 503-508.

Phase-II Summary

Main Results:• Subspace Clustering Algorithms

• Efficient Lattice-based search• Use of novel monotonic conditions to control search

• Distributed mining across multiple lattices• Only for horizontally partitioned datasets

• Applications of Results• Genomic, document collections,

Open Research Issues:– Clusters for Non-binary datasets– Approximately closed clusters

Phase-II Research Participants:– Ph.D. dissertations

• Haiyun Bian, 2006• Amit Sinha, 2008

– M.S. theses• Gautam Kurra, 2002• Anshuman Rajshiva, 2004• Ramya Ashok, 2005• Aparna Yardi, 2006• Aravind Kumar, 2006• Shriram Narayanswami, 2007• Mrunal Deshmukh, 2008

– Collaborations:• CHMC

Phase-II Publications:1. Shriram Narayanswamy, Raj Bhatnagar. A Lattice-Based Model for Recommender Systems.

Proceedings of the International Conference on Tools with Artificial intelligence (ICTAI 2008) pp. 349-356.

2. Haiyun Bian, Raj Bhatnagar. An Algorithm for Mining Weighted Dense Maximal 1-Complete Regions. Data Mining: Foundations and Practice, 31-48, Springer Verlag, 2008.

3. Haiyun Bian, Raj Bhatnagar, and Barrington Young. An Efficient Constraint-Based Closed Set Mining Algorithm. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 67-72.

4. Barrington Young, Raj Bhatnagar, Giridhar Tatavarty, and Haiyun Bian. Covariance matrix Computations with Federated databases. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 172-177.

5. Haiyun Bian, Raj Bhatnagar: Efficiently Mining Maximal 1-complete Regions from Dense Datasets. ICDM Workshop on Foundations of data Mining 2006, Proceedings of ICDM Workshops, pp 423-427

6. Haiyun Bian and Raj Bhatnagar. Towards More Supervised Subspace Cluetering, Proceedings of the MAICS 2006 conference, held in Valparaiso, OH April 2006.

7. Arvind Muthukrishnan and Raj Bhatnagar. Concept-based Organization and Retrieval of Technical Documents. Proceedings of the MAICS2006, Valparaiso, OH April 2006.

8. Haiyun Bian and Raj Bhatnagar. An Algorithm for Lattice-Structured subspace clustering, Proceedings of the SIAM International Conference on Data Mining, April 2005.

Phase-III Summary

Main Results:• 3-Clustering Algorithm

• Efficient search algorithm• Bioinformatics Application, Genomic datasets

• Most Discriminating subsets• Efficient algorithm

Open Research Issues:– Multi-domain datasets with closed loop relationships– Diagonal band patterns

Phase-III Research Participants:– Ph.D. dissertations

• Faris Alqadah, 2010 (very likely)

– Collaborations: CHMC

Phase-III Publications:1. Faris Alqadah and Raj Bhatnagar. Discovering Substantial Distinctions among

Incremental Bi-Clusters, To be presented atthe SIAM International COnference on Data Mining (SDM 09) in April 2009.

2. Faris Alqadah and Raj Bhatnagar. An effective algorithm for mining 3-clusters in vertically partitioned data. Proceedings of the CIKM 2008, 1103-1112.

3. Faris Alqadah, Raj Bhatnagar. Detecting significant distinguishing sets among bi-clusters. Proceedings of the CIKM 2008.

Conclusions

• Introduced quantitative measure of distinction among incremental bi-clusters

• Developed efficient algorithm for enumerating bi-clusters and growing maximum cost spanning tree simultaneously

Info from Lattice of Clusters

tf1 tf2 tf3 tf4

g1 1 0 1 1

g2 1 1 1 0

g3 0 1 0 1

g4 1 0 0 0

g5 1 0 0 1

• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation of

genes/TFs that transform cellular processes• Prioritize relationships

Problem Formulation

• Challenges– Enumerating bi-clusters and forming lattice is

known to be NP-Complete problem– Discover distinguishing sets during the mining

process as opposed to post processing step– How to quantify distinction

Problem Formulation

• Model lattice as weighted directed graph

• Weights represent degree of distinction

• Each edge represents a distinguishing set

• Grow maximum cost spanning tree

Problem Formulation

• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s

under suitable permutation– Consider both change in width and height when

computing distinction– Choose a shape metric s (ex. Area, ratio height

to width etc.)– Quantify distinction as degree of shape change

along a path in the lattice

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

Problem Model

• Compute partial derivates as forward difference

{2}{A,B}

{1,2,3}{B}


• Adapt Prim’s algorithm– Lattice is not readily available – Dynamically compute cut set by enumerating upper

neighbors of bi-clusters1. Choose starting bi-cluster c2. Compute cut set by generating upper neighbors of c3. Compute weight of edges between c and upper

neighbors4. Greedily maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters

visited

Algorithm Details

• Step 2: Generating upper neighbors in lattice– How?

• Lindig’s algorithm– Cost?

• Improved Lindig’s algorithm, practical running time• Theortical complexity remains the same

• Overall complexity of MIDS– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns


• Experimented with synthetic datasets to find most distinguished incremental bi-clusters

• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise

• Next planted several large incremental bi-clusters that differed only slightly as a result of noise


• Region 1 is region of interest, clearly distinct

• Noise added to region 1, while regions 2 and 3 contain minimal distinction