discovering patterns in multiple datasets raj bhatnagar university of cincinnati

113
Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Discovering Patterns in Multiple Datasets

Raj Bhatnagar

University of Cincinnati

Page 2: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Distributed Datasets

Horizontal Partitioning

A B C D E

A B C D E

A B C D E

Vertical Partitioning

D E F G H

A H J K M

A B C D E

Data components may be Geographically Distributed

Page 3: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Distributed DatasetsMulti-Domain Datasets

Gen

es

Diseases

Gen

es

Drugs

Dru

gs

Adverse Reactions

Page 4: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Distributed DatasetsMulti-Domain Datasets

Doc

umen

ts

Keywords

Doc

umen

ts

Cited-Documents

Key

wor

ds

Topics

Page 5: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Types of Patterns• Decision Trees• Association Rules• Principal Component Analysis• K-Nearest Neighbor Analysis• Clusters

– Hierarchical

– K-Means

– Subspace

Page 6: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Clusters

Patterns Ξ Unsupervised, Data Driven, Clusters

Single-Domain Clustering

Gen

es

Diseases

Gen

es

Diseases

Clusters of similar genes;In the context of diseases

Clusters of similar diseasesIn the context of genes

Clusters may be: - Mutually Exclusive - Overlapping

Page 7: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Patterns

Simultaneous Two-Domain Clustering

Gen

es

Diseases A cluster of similar genes - in a subspace of diseases;

A cluster of similar diseases - in a subspace of genes

Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains

G

D

Page 8: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Patterns

Simultaneous Three (Multi)-Domain Clustering

Gen

es

Diseases

Gen

es

Dise.

Gen

es

Drugs

Gen

es

Drugs

Match “genes” subsets in two clusters

Phase-III of this research

Page 9: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Part-I

Patterns in Vertically Distributed Databases

Page 10: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Learning Decision Trees

D = D1 X D2 X . . . X Dn

- D is implicitly specified

Goal: Build decision tree for implicit D, using the explicit Di’s

D1 D2 Dn

A B C C D E A E G

Limitations:- Can’t move Di’s to a common site

- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples

Geographically distributed databases

Vertically Partitioned Dataset

Page 11: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Explicit and Implicit Databases

321162

121162

211221

211261

321161

121161

FEDCBA

Implicit Database

Explicit Component Databases

22

12

21

11

CA

SharedSet

------162

122311161

121111261

211221221

CEAFCDCBA

Node 3Node 2Node 1

Page 12: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Decomposition of Computations

- Since D is implicit,

- For a computation:- Decompose F into G and g’s

- Decomposition depends on- F- Di’s and Set of shared attributes

D1 D2 Dn

A B C C D E A E G

)]()...(),([ 2211 nn DgDgDgGR

)(DFR

Page 13: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Count All Tuples in Implicit D

)(# DtuplesR

m

j

n

iCondi j

DNR1 1

))(((

– condJ : Jth tuple in Shareds

– n: number of databases (Dis)

– (N(Dt)condJ): count of tuples in Dt satisfying condJ

– Local computation: gi(Di,) = N(Dt)condJ

– G is a sum-of-products

– If each Di knows “shared” values, then• Only one message per site needed for #tuples

22

12

21

11

CA

Shareds

L shared attributes;k values each;

kl tuples

Page 14: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Learning Decision Trees

Consists of various counts only:

))log( 2b

bc

c b

bc

b N

N

N

NE

b branchesa=?

a1 a2 ab

ID3 Algorithm

c classes in the dataset

Nbc and Nb can be computed using g and G as for #tuple - one message/database needed for computing each Entropy value

Page 15: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Compute Covariance Matrix for D• Covariance matrix for D

– Needed for eigen vectors/principal components

– Needs second order moments

– Helps compute terms of the type:

– This matrix can be computed at one of the databases

Dt

tt yx

)])([( jjii xxE

Page 16: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

G-and-g Decomposition for 2nd order moments

• Sum of products for two attributes:

• Six different ways in which x and y may be distributed– Each requires a different decomposition

– Case 1: x same as y; and x belongs to the SharedSet.

– Case 2: x same as y; and x does not belong to the SharedSet.

– Case 3: x and y both belong to the SharedSet.

Dt

tt yx

)....(*2 DinxCountx jj

j

)(*)....( 2kk k condCountcondforxAvg

)....(** SharedincondCountyx kkk k

Page 17: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Sum of Products

– Case 4: x belongs to SharedSet and y does not.

– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.

• For each tuple t in SharedSet, obtain

• and then

– Case 6: x, y don’t belong to the SharedSet and reside on the same

node.

)(** jj j xxCountyx

)(,)( tytx

t

tySumtxSum ))((*))((

t

tCounttod )(*)(Pr where

Prod(t) is average of product of x and y for cond-t of SharedSet

Page 18: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nearest Neighbor Algorithm

Find nearest neighbor of r1 in D1

• with virtual extensions in D for all tuples in D1

• Need to Compute all pair wise distances• The same distance values can be used for clustering algorithms

Page 19: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem: Closed-Loops in Databases

Page 20: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Extracting Communication Graph

The learner is D1

Covariance, k-NN, etc. algorithms developed for this situation

Page 21: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Part-II

Subspace Clusters and Lattice Organization

Page 22: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Clustering in Multi-Domains

• Example 3-D dataset with 4 clusters.• Each cluster is in 2-D • Points from two subspace clusters can be very close --

making traditional clustering algorithms inapplicable.• Overlapping between clusters

Page 23: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Subspace Clustering

• “Interestingness” of a subspace cluster: – Domain dependent / user defined– Similarity-based clusters

Page 24: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Subspace Clusters

• Number of Subspaces

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

30

1

30

k k

Page 25: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Real Datasets1 0 1 1 0 0 0 1 1 1 1 0 0 0 0

0 0 0 1 1 1 1 1 1 0 0 0 0 0 0

0 1 0 1 0 1 0 1 1 1 0 0 0 1 1

1 0 0 0 0 0 0 0 1 1 1 1 0 0 1

1 1 0 0 0 0 1 1 1 1 1 1 0 0 0

0 0 0 0 0 0 0 0 0 1 1 1 1 1 0

0 1 0 0 0 0 0 1 1 1 1 1 0 0 1

0 1 0 1 0 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 1 1 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

5 6 3 4 5 1 2 5 4 6 5 7 6 7 5

6 8 8 9 9 9 7 6 5 4 3 2 1 2 3

4 3 2 1 2 3 4 5 6 7 8 7 6 5 6

7 8 7 6 5 6 7 8 9 0 9 8 7 6 0

0 9 8 0 9 8 7 6 5 4 3 2 3 4 5

4 3 4 5 0 0 0 0 0 9 8 7 6 5 4

3 4 5 6 5 4 3 4 3 4 3 6 3 7 2

7 3 9 0 7 0 1 5 3 4 6 5 4 3 7

3 9 6 3 9 0 0 5 4 0 4 3 2 2 2

2 7 8 9 0 9 8 7 6 5 6 7 8 7 6

5 4 3 4 5 6 7 4 2 8 5 9 5 7 2

4 6 4 6 7 7 8 4 6 3 3 1 0 0 1

1 5 5 5 4 4 7 7 8 9 6 4 3 2 0

6 0 7 6 8 4 5 7 3 3 3 4 7 6 8

6 7 6 9 2 5 3 7 5 1 0 4 8 3 5

Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency

Page 26: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Lattice of Subspaces: Formal Concept Analysisnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Row Ids

Parallel to the ideas ofFormal Concept Analysis

1. Need Algorithms to find Interesting subspace clusters2. Lattice provides much more insight into dataset.

Page 27: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Clusters in Subspaces

Clusters in overlapping subspaces

  a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

  a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

  a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

  a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

  a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1

b

d

Density = number of rows- An antimonotonic property

a

Page 28: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

If AB < needed density,

Then so do all its descendents

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Value of (anti)monotonic Propertiesnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 29: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Maximal and Closed Subspacesnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

Page 30: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Siblings and Parents in Lattice

Merge lattice nodes to find clusters of other properties

  a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

  a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

  a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

  a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

  a b c d e

1 1 1 1 0 1

2 1 1 1 0 1

3 1 0 1 1 1

4 0 0 1 1 1

5 1 0 1 1 1

C1 =<{1,2,3,4,5}, {a,c,d,e}>

C2 =<{3,4,5}, {a,c,d,e}>

Siblings in Lattice

Page 31: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Goal: Subspace Clusters with PropertiesAnti-monotonic properties:

– Minimum density (C=<O,A>):= |O| / total number of objects in the data

eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6

– Succinctness: density is strictly smaller than all of its minimum generalizations

eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct

– Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3”

Weak anti-monotonic properties– “average>=δ” “average<= δ” “variance>=δ”

“variance<=δ”eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”

a b c d

o1 5 2 4 2

o2 2 1 2 2

o3 2 2 2 2

o4 2 2 2 2

o5 3 2 4 2

Page 32: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Levelwise Search

• Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies

“average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint:

C5 =<{o1,o5},{c.4}>

• If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property

a b c d

o1 5 2 4 2

o2 2 1 2 2

o3 2 2 2 2

o4 2 2 2 2

o5 3 2 4 2

Page 33: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Levelwise Search for Subspace Clusters

• Anti-monotonic & Weak Anti-monotonic

– Candidate generation based on anti-monotonic properties only

– Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”

Page 34: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Performance Comparison

• Optimizing Techniques– Sorting the attributes

– Reuse previous results

– Stack of unpromising branches

– Check closure property

Page 35: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

• Discover closed subspace clusters from databases located at multiple sites

• Objectives:– Minimize local computation cost– Minimize communication cost

Page 36: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

a b c d

1 0 0 1 1

2 1 0 1 1

3 1 1 1 0

4 0 0 1 1

5 1 1 0 0

a b c d

1 0 0 1 1

2 1 0 1 1

3 1 1 1 0

a b c d

4 0 0 1 1

5 1 1 0 0

DSD1

D2

• Horizontal Partitioned Data

Page 37: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

DS D1 D2

<c,1234>

<cd,124>

<a,235>

<ac,23>

<acd,2>

<ab,35>

<abc,3>

<c,123>

<cd,12>

<ac,23>

<acd,2>

<abc,3>

<cd, 4>

<ab,5>

List of Closed Subspace Clusters

Lemma 1: All locally closed attribute sets are also globally closed

Lemma 2: Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab

Page 38: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

DS D1 D2

<c,1234>

<cd,124>

<a,235>

<ac,23>

<acd,2>

<ab,35>

<abc,3>

<c,123>

<cd,12>

<ac,23>

<acd,2>

<abc,3>

<cd, 4>

<ab,5>

List of Closed Subspace ClustersCompute the object set:

1. Closed at both partitions: compute the union of the two object setseg: cd

2. Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd

3. Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab

Page 39: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

DS D1 D2

<c,1234>

<cd,124>

<a,235>

<ac,23>

<acd,2>

<ab,35>

<abc,3>

<c,123>

<cd,12>

<ac,23>

<acd,2>

<abc,3>

<cd, 4>

<ab,5>

List of Closed Subspace ClustersProblem:

both for case 2 and 3

a = ac ∩ ab and a = acd ∩ ab

Solution:

for each globally closed attribute set, keep track of the largest object set (or size of the object set)

Page 40: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

DS D1 D2

F :<c,1234>

<cd,124>

<a,235>

F1:

<c,123>

<cd,12>

<ac,23>

F2:

E :<ac,23>

<acd,2>

<ab,35>

<abc,3>

E1 :

<acd,2>

<abc,3>

E2:

<cd, 4>

<ab,5>

Density Constraint: δ >= 0.6Observation:

Intersection of two elements both from Eis can not have enough density

Efficient Computation:

Sort Fi and Ei into decreasing density

Page 41: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

R R1 R2

<c,1234>

<cd,124>

<a,235>

<c,123>

<cd,12>

<ac,23>

<c,1234>

<cd,124>

<a,235>

Page 42: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

• Generalize to k>2– k sites need k step communication and

computation– k sites have k types:

Page 43: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

• K=3

Page 44: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

Page 45: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Distributed Subspace Clustering

Page 46: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Part-III

Multi-Domain Clusters

Page 47: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Introduction

Traditional clustering

Bi-Clustering

3-Clustering

Page 48: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Why 3-clusters?

• Correspondence between bi-clusters of two different lattices

• Sharpen local clusters with outside knowledge

• Alternative? “Join datasets then search”– Does not capture underlying interactions– Inefficient– Not always possible

Page 49: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Formal Definitions

Bi-cluster in Di

3-Cluster across D1 and D

2

Pattern in Di

Page 50: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Defining 3-clusters

• D1 is the “learner”

• Maximal rectangle of 1's under suitable permutation in learner

• Best Correspondence to rectangle of 1's in D

2

D1D1

Page 51: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Cluster Quality Measure

• Intuition: Maximize number of 1's while also maximizing number of items and objects

• Trade off between objects and items– More items...less objects– More objects...less items

Page 52: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Quality Measure

–Consider bi-clusters in learner alone

I1

O C1

C2

•Which is preferable ?•User decides

Page 53: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Measure of Cluster Quality

• Quality measure:– Monotonic in both width and height– Balances width and height according to user

defined parameter

• Introduce β width(attributes) willing to trade for a single unit of height (objects)

Page 54: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Cluster Quality Measure

Page 55: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Cluster Quality for 3-clusters

• Utilize same intuition• Width of 3-cluster is sum of individual

widths

Page 56: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Selecting

• Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2 – Cluster key websites popular with large number

of democrats and republicans

• Smaller values produce 3-clusters that are “narrow” and “long”– Discover long list of websites utilized by few

select democrats and republicans

β

Page 57: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

3-Clu: Our Algorithm

• Search for 3-clusters similar to search for closed itemsets

• How to formulate the search space?– Assumption that objects outnumber attributes

may not hold– Several possible orderings of the search space

Page 58: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

3-Clu Algorithm

Page 59: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

3-Clu Algorithm

• Define search space with primacy to objects

• Only need to maintain one search tree• Mimic closed itemset algorithm with

simultaneous pruning of search space• Prune with quality measure

Page 60: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

3-Clu Algorithm

Page 61: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Algorithm

• Quality measure is neither monotone nor anti-monotone in the search space

• Pruning is still possible

Is C2 of higher quality ?

Page 62: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Algorithm

Page 63: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

Chess Connect GO-Pheno

Page 64: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

• Test validity of 3-clusters

• Randomly partitioned Mushrooms dataset by attributes

Page 65: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Discriminating Clusters

• Key question: What sets of attributes and /or objects most distinguish the incremental bi-clusters from each other?

• Incremental bi-clusters that only differ slightly may be a result of noise or human error

• Prioritize relationships among incremental bi-clusters

A B C D

1 0 1 0 1

2 1 1 0 0

3 0 1 1 1

4 0 0 1 1

{}

{A,B,C,D}

{2}

{A,B}

{1,2,3,4}

{}

{3,4}

{C,D}

{1,2,3}

{B}

{1,3,4}

{D}

{1,3}

{B,D}

Page 66: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Motivation

tf1 tf2 tf3 tf4

g1 1 0 1 1

g2 1 1 1 0

g3 0 1 0 1

g4 1 0 0 0

g5 1 0 0 1

• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation

of genes/TFs that transform cellular processes• Prioritize relationships

Page 67: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Related Work

• Emerging patterns [Dong,Li]– Only consider ratio of support between frequent itemsets– Supervised technique

• Contrast sets [Bay, Pazzani]– Also supervised– Special case of rule discovery

• Closed Itemset algorithms [Zaki, Uno, Bian] – Efficient– Do not explicitly enumerate lattice structure

Page 68: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Challenges– Enumerating bi-clusters and forming lattice is

known to be NP-Complete problem– Discover distinguishing sets during the mining

process as opposed to post processing step– How to quantify distinction

Page 69: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Dataset D=(O,A,R)

• Consider set of objects X• X’ defined as all attributes common to all

objects of X• Dually defined set of attributes Y

• Bi-cluster: (X,Y) s.t. X’ = Y and Y’=X

Page 70: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

• Sample bi-cluster• <{3,4},{C,D}>

• Bi-clusters are equivalent to• Maximal rectangles of

1s under suitable permutation

• Maximal bi-cliques

Page 71: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Set of bi-clusters from a complete lattice

• Model lattice as weighted directed graph

• Weights represent degree of distinction

• Each edge represents a distinguishing set

• Grow maximum cost spanning tree

Page 72: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s

under suitable permutation– Consider both change in width and height when

computing distinction– Choose a shape metric s (ex. Area, ratio height

to width etc.)– Quantify distinction as degree of shape change

along a path in the lattice

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

Page 73: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Model

Compute partial derivates as forward difference

{2}{A,B}

{1,2,3}{B}

Page 74: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Our Algorithm (MIDS)

• Input: Dataset D

• Output: Maximum cost spanning tree of bi-cluster lattice of D

• Computational challenge: enumerating bi-cluster lattice and growing tree simultaneously

Page 75: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Our Algorithm (MIDS)

• Most min/max cost spanning tree algorithms assume availability of graph

• Prim’s algorithm depends on the Cut set

• Intuitive idea: Grow bi-cluster lattice incrementally and maintain the Cut set

Page 76: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Our Algorithm (MIDS)

• Prim’s grows sequence of trees

• Denote set of edges between bi-cluster c and all upper neighbors of c that do not appear in current tree by

• Df

• Dynamically compute Cut, while enumerating new bi-clusters

Page 77: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Our Algorithm (MIDS)1. Choose starting bi-cluster c (usually infimum)2. Compute cut set by generating upper neighbors of c together with

update equation3. Compute weight of edges between c and upper neighbors4. Greedily choose maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters visited

Page 78: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Algorithms

{}

{A,B,C,D}

{2}

{A,B}

{3,4}

{C,D}

{1,3}

{B,D}

{1,2,3}

{B}

{1,3,4}

{D}

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

{}

{A,B,C,D}

{2}

{A,B}

{3,4}

{C,D}

{1,3}

{B,D}

{1,2,3}

{B}

{1,3,4}

{D}

{1,2,3,4}

{}

{}

{A,B,C,D}

Page 79: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Algorithms

• Major computational cost is computing upper neighbors of a bi-cluster

• Theorem: • Let <X,Y> be a concept. Then {X υ {o} } ’ ’ are the objects of an

upper neighbor of <X,Y> if and only iffor all z ε {X υ {o} } ’ ’ – X the following holds:{X υ {z}} ’ ’ = {X υ {o} } ’ ’

• Lindig’s algorithm implements this theorem• Algorithm performs a local computation of bi-clusters and

upper neighbors

Page 80: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Algorithms

• Improved Lindig’s algorithm, practical running time• Adapted to enumerate only “large” bi-clusters • Reduced number of set intersections performed

• Theoretical complexity remains the same• Overall complexity of MIDS

– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns

Page 81: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

• Compared solution to Zaki’s CHARM-L algorithm– CHARM-L enumerates bi-clusters, and organizes into

lattice structure– Not incremental: hard to adapt to Prim’s algorithm– Added post processing step to grow MCST

Page 82: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

Page 83: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

• Experimented with synthetic datasets to find most distinguished incremental bi-clusters

• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise

• Next planted several large incremental bi-clusters that differed only slightly as a result of noise

Page 84: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

• Region 1 is region of interest, clearly distinct

• Noise added to region 1, while regions 2 and 3 contain minimal distinction

Page 85: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

Page 86: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Computer_Science@UC

A Vision

Raj Bhatnagar

Page 87: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

CS Department @ UC

Department Goals:• Train CS graduates to meet technical manpower

needs in Ohio, and the world at large.• Contribute to the creation of scholarly knowledge

through research

Needed Features:– Strong Research and Graduate program– Strong Undergraduate program– Good visibility and ranking in research communities– Good reputation in 250-mile region for UG program

Page 88: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Research and Graduate Program

CS Dept. Faculty inCore CS Areas

BME, CHMC

Bio-InformaticsGIS

, Scie

nces

A&S

Engineering, Business

Robotics, BI

Mat

hem

atics

Secur

ity, D

ata

Mini

ng

• CS is an Enabling Science• Can catalyze research in Science and Engineering• UC needs a stronger CS program

• Core CS areas to be covered in CS Dept., Foundational research

• Collaborations to be built with other UC departments for research

Page 89: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Core Computing FacultyCurrent Number: 12Areas of Strength:

– Algorithms and CS Theory– Networks and Communications– Machine Learning, AI, and Data Mining– HCI

Gaps in Strength/Coverage– Programming Languages/Compilers– Software Engineering– Databases– Computer Systems and Networks

Target Number for a Strong CSD: 18

Page 90: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Potential CollaborationsBio-Informatics

– Have been research contacts– Strong presence of Bioinformatics activity at UC– Need new CS faculty with matching interests

Mathematics– Computer Security, quantum computing– Data Mining, privacy preserving operations on data

GIS– Very active and growing in A&S– Almost no contact with CS faculty; great research potential

Robotics– ME has good activity; a recent hire from Duke University– Great student interest; grad and undergrad

Business– Strength in databases and interest in collaborations

CAS– Potential for a minor in IT for CS students and vice-versa

Page 91: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Undergraduate Program• High priority to increase enrollment/retention

– Translates into more resources now– Alumni potential donors– Need to advertize our strengths on www and in local area– Increase interestingness of available electives – from new faculty and

also from collaborating departments– Increase social capital and sense of belongingness within CS student

body• Strengthen Capstone projects

– Sponsored projects by local industry (item for IAB)– Advertize them on department website– Awards for top projects (sponsored by industry)

• Student Experience while at UC– Support ACM, EEE, and LARC groups for an enriching experience– Increase UG Research experience opportunities– Seek funds to provide more scholarships

Page 92: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Graduate Program

Marshal Resources to support Ph.D. students– Help faculty’s attempts to seek sponsored research– Seek industry help for sponsored research– Seek funds to support student travel to conferences

Enhance Students’ Quality of Experience at UC– More interactions with faculty/students– More graduate courses– More support for organizing/travelling to

events/conferences

Page 93: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Resources

Need serious efforts for raising resources– Faculty lines from UC

• Need to reach critical mass for CSD• PBB is a hurdle• College support/commitment is needed

– CS Endowment Account (Our money, that can’t be cut!)• Approach alumni for donations• Seek industry support/sponsorship for scholarships, seminar speakers

– Sponsored research awards from NSF• New administration has significantly enhanced funds for NSF• Support faculty in efforts to seek these funds

– Seek more UGA funds from CoE• Difficult for now; but must continue efforts

Page 94: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-I SummaryMain Results:

• Algorithms for complex operations that work with implicit databases

• Decision Trees, Association Rules, Covariance matrix• K-NN neighbors• Hierarchical and Sequential Clustering

• Algorithms for distributed control of multi-agent systems• Distributed Multi-Agent Reorganization

Open Research Issues:– We preserve data privacy, but we need a formal model and

analysis of privacy– Mining of streaming data at multiple cooperating nodes

Page 95: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-I Research Participants:– Ph.D. dissertations

• Ahmed Khedr, 2003• Barrington Young, 2007• Eric Matson, 2008

– M.S. theses• Shriram Srinivasan, 1997• Sanjeev Beemidi, 1998• Harpreet Singh, 2000• Rahul Dasgupta, 2000• Susmit Kumar 2002• Rishi Jhaver, 2003• Chris Calendar, 2004• Michael Kinsey, 2005• Kaustubh Shinde, 2006

Page 96: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-I Publications:1. Ahmed Khedr and Raj Bhatnagar. Agents for Integrating Distributed Data for Complex

Computations. \textit{Computing and Informatics} Journal, Vol. 26, 2007, 149-170.

2. Eric T. Matson, Raj Bhatnagar. Knowledge Sharing Between Agents in a Transitioning Organization. Proceedings of the COIN 2007 published as book Coordination, Organizations, Institutions, and Norms in Agent Systems - III, Springer Verlag, 2007, pp. 187-202.

3. Eric Matson and Raj Bhatnagar. Properties of Capability Based Agent Organization Transition, Proceedings of the Intelligent Agent Technologies (IAT 2006) confernece held in Hong Kong in December 2006.

4. Barrington Young and Raj Bhatnagar. Secure K-NN Algorithm for Distributed Databases, Proceedings of the Privacy Security and Trust Conference, 2006, pp. 485-490.

5. Ahmed Khedr and Raj Bhatnagar, Decomposable Algorithms for Minimum Spanning Tree, Presented at the International Workshop on Distributed Computing, December 2003, Springer Verlag notes on Computer Science, vol. 2918.

6. Raj Bhatnagar, Sriram Srinivasan. Pattern Discovery in Distributed Databases. {\em Proceedings of the AAAI-97 Conference} held at Providence, RI, in July 1997, pp. 503-508.

Page 97: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-II Summary

Main Results:• Subspace Clustering Algorithms

• Efficient Lattice-based search• Use of novel monotonic conditions to control search

• Distributed mining across multiple lattices• Only for horizontally partitioned datasets

• Applications of Results• Genomic, document collections,

Open Research Issues:– Clusters for Non-binary datasets– Approximately closed clusters

Page 98: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-II Research Participants:– Ph.D. dissertations

• Haiyun Bian, 2006• Amit Sinha, 2008

– M.S. theses• Gautam Kurra, 2002• Anshuman Rajshiva, 2004• Ramya Ashok, 2005• Aparna Yardi, 2006• Aravind Kumar, 2006• Shriram Narayanswami, 2007• Mrunal Deshmukh, 2008

– Collaborations:• CHMC

Page 99: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-II Publications:1. Shriram Narayanswamy, Raj Bhatnagar. A Lattice-Based Model for Recommender Systems.

Proceedings of the International Conference on Tools with Artificial intelligence (ICTAI 2008) pp. 349-356.

2. Haiyun Bian, Raj Bhatnagar. An Algorithm for Mining Weighted Dense Maximal 1-Complete Regions. Data Mining: Foundations and Practice, 31-48, Springer Verlag, 2008.

3. Haiyun Bian, Raj Bhatnagar, and Barrington Young. An Efficient Constraint-Based Closed Set Mining Algorithm. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 67-72.

4. Barrington Young, Raj Bhatnagar, Giridhar Tatavarty, and Haiyun Bian. Covariance matrix Computations with Federated databases. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 172-177.

5. Haiyun Bian, Raj Bhatnagar: Efficiently Mining Maximal 1-complete Regions from Dense Datasets. ICDM Workshop on Foundations of data Mining 2006, Proceedings of ICDM Workshops, pp 423-427

6. Haiyun Bian and Raj Bhatnagar. Towards More Supervised Subspace Cluetering, Proceedings of the MAICS 2006 conference, held in Valparaiso, OH April 2006.

7. Arvind Muthukrishnan and Raj Bhatnagar. Concept-based Organization and Retrieval of Technical Documents. Proceedings of the MAICS2006, Valparaiso, OH April 2006.

8. Haiyun Bian and Raj Bhatnagar. An Algorithm for Lattice-Structured subspace clustering, Proceedings of the SIAM International Conference on Data Mining, April 2005.

Page 100: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-III Summary

Main Results:• 3-Clustering Algorithm

• Efficient search algorithm• Bioinformatics Application, Genomic datasets

• Most Discriminating subsets• Efficient algorithm

Open Research Issues:– Multi-domain datasets with closed loop relationships– Diagonal band patterns

Page 101: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-III Research Participants:– Ph.D. dissertations

• Faris Alqadah, 2010 (very likely)

– Collaborations: CHMC

Page 102: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Phase-III Publications:1. Faris Alqadah and Raj Bhatnagar. Discovering Substantial Distinctions among

Incremental Bi-Clusters, To be presented atthe SIAM International COnference on Data Mining (SDM 09) in April 2009.

2. Faris Alqadah and Raj Bhatnagar. An effective algorithm for mining 3-clusters in vertically partitioned data. Proceedings of the CIKM 2008, 1103-1112.

3. Faris Alqadah, Raj Bhatnagar. Detecting significant distinguishing sets among bi-clusters. Proceedings of the CIKM 2008.

Page 103: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Conclusions

• Introduced quantitative measure of distinction among incremental bi-clusters

• Developed efficient algorithm for enumerating bi-clusters and growing maximum cost spanning tree simultaneously

Page 104: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Info from Lattice of Clusters

tf1 tf2 tf3 tf4

g1 1 0 1 1

g2 1 1 1 0

g3 0 1 0 1

g4 1 0 0 0

g5 1 0 0 1

• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation of

genes/TFs that transform cellular processes• Prioritize relationships

Page 105: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Challenges– Enumerating bi-clusters and forming lattice is

known to be NP-Complete problem– Discover distinguishing sets during the mining

process as opposed to post processing step– How to quantify distinction

Page 106: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Model lattice as weighted directed graph

• Weights represent degree of distinction

• Each edge represents a distinguishing set

• Grow maximum cost spanning tree

Page 107: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Formulation

• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s

under suitable permutation– Consider both change in width and height when

computing distinction– Choose a shape metric s (ex. Area, ratio height

to width etc.)– Quantify distinction as degree of shape change

along a path in the lattice

A B C D

10 1 0 1

21 1 0 0

30 1 1 1

40 0 1 1

Page 108: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Problem Model

• Compute partial derivates as forward difference

{2}{A,B}

{1,2,3}{B}

Page 109: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Our Algorithm (MIDS)

• Adapt Prim’s algorithm– Lattice is not readily available – Dynamically compute cut set by enumerating upper

neighbors of bi-clusters1. Choose starting bi-cluster c2. Compute cut set by generating upper neighbors of c3. Compute weight of edges between c and upper

neighbors4. Greedily maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters

visited

Page 110: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Algorithm Details

• Step 2: Generating upper neighbors in lattice– How?

• Lindig’s algorithm– Cost?

• Improved Lindig’s algorithm, practical running time• Theortical complexity remains the same

• Overall complexity of MIDS– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns

Page 111: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

• Experimented with synthetic datasets to find most distinguished incremental bi-clusters

• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise

• Next planted several large incremental bi-clusters that differed only slightly as a result of noise

Page 112: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results

• Region 1 is region of interest, clearly distinct

• Noise added to region 1, while regions 2 and 3 contain minimal distinction

Page 113: Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Experimental Results