knowledge enhanced clustering

50
Knowledge Enhanced Clustering

Upload: karma

Post on 19-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Knowledge Enhanced Clustering. Clustering “Find the Groups of Similar Things”. Height. Find the set partition (or hyperplanes) that minimize some objective function. Weight. Clustering “Find the Groups of Similar Things”. Height. Find the set partition (or hyperplanes) that - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Knowledge Enhanced Clustering

Knowledge Enhanced Clustering

Page 2: Knowledge Enhanced Clustering

Clustering “Find the Groups of Similar Things”

Height

Weight

Find the set partition (or hyperplanes) that minimize some objective function

Page 3: Knowledge Enhanced Clustering

Clustering “Find the Groups of Similar Things”

Find the set partition (or hyperplanes) that minimize some objective function

ArgminC iD(Cf(S_i)- s_i)

Height

Weight

Page 4: Knowledge Enhanced Clustering

K Means Example (k=2)Initialize Centroids

x

x

Height

Weight

Page 5: Knowledge Enhanced Clustering

K Means ExampleAssign Points to Clusters

x

x

Height

Weight

Page 6: Knowledge Enhanced Clustering

K Means ExampleRe-estimate Centroids

x

x

Height

Weight

Page 7: Knowledge Enhanced Clustering

K Means ExampleRe-assign Points to Clusters

x

x

Height

Weight

Page 8: Knowledge Enhanced Clustering

K Means ExampleRe-estimate Centroids

x

x

Height

Weight

Page 9: Knowledge Enhanced Clustering

K Means ExampleRe-assign Points to Clusters

x

x

Height

Weight

Page 10: Knowledge Enhanced Clustering

K Means ExampleConverge

x

x

Height

Weight

Page 11: Knowledge Enhanced Clustering

K Means ExampleConvergence

x

x

Greedy algorithm. Produces useful results, linear time per iteration

Height

Weight

Page 12: Knowledge Enhanced Clustering

Where Data Driven Clustering Fails: a) Pandemic Preparation

[Davidson and Ravi 2007a]

• In collaboration with Los-Alamos/Virginia Tech Bio-informatics Institute– VBI Micro simulator based on census data, road

network, buildings etc. – Ideal to model pandemics due to bird flu, bio-

terrorism.– Problem: Find spatial clusters of households that

have a high propensity to be infected or not infected.– Currently at city level (million households), but soon

the eastern seaboard, entire country.

Page 13: Knowledge Enhanced Clustering

Portland Pandemic Simulation

Page 14: Knowledge Enhanced Clustering

Portland Pandemic Simulation

Typical results are shown in the left.

Not particularly useful for containment policy design because:a) Some regions are too largeb) Uneven distribution of key

facilities such as hospitals/school

Page 15: Knowledge Enhanced Clustering

Another Problem: b) Automatic Lane Finding from GPS Traces

[Wagstaff, Langley et al. ’01]

Lane-level navigation (e.g.,

advance notification for taking exits)

Lane-keeping suggestions (e.g.,

lane departure warning)

Page 16: Knowledge Enhanced Clustering

Mining GPS Traces• Instances are the x,y location on the road.

Page 17: Knowledge Enhanced Clustering

Mining GPS Traces• Instances are the x,y location on the road.

This is a very good local minimum of the algorithm’s objective function

Page 18: Knowledge Enhanced Clustering

Another Example: c) CMU Faces Database

[Davidson, Wagstaff, Basu, ECML 06]

Useful for biometric applications such as face recognition etc.

Page 19: Knowledge Enhanced Clustering

Typical But Not Useful Clusters For Our Purpose

Page 20: Knowledge Enhanced Clustering

Limitations of Data Driven Clustering at a High Level

• Objective functions were reasonably minimized.

• Hoping patterns are “novel and actionable” is a long-shot.

• Problem: Find a general purpose and principled way to encode knowledge into the many data mining algorithms.– Bayesian approach?

Page 21: Knowledge Enhanced Clustering

Outline

• Knowledge enhanced mining with constraints– Motivation – How to add in domain expertise– Complexity results– Sufficient conditions and algorithms

• Other work potentially applicable to sky survey data– Speeding up algorithms by scaling down data– Mining poor quality data– Mining with biased data sets

Page 22: Knowledge Enhanced Clustering

What Type of Knowledge Do We Want To Represent

Explicit: Points morethan 3 metres apart along y-axismust be in different clusters

> 3 metres

Must-link

Cannot-link

Implicit: The people intwo images have similar ordissimilar features.

Page 23: Knowledge Enhanced Clustering

Representing Knowledge With Constraints

• Clustering is finding a set partition• Must-Link (ML) Constraints

• Explicit: Points si and sj

(i j) must be assigned to the same cluster. Equivalence relation.

• Implicit: si and sj

are similar

• Cannot-Link (CL) Constraints• Explicit: Points si

and sj (i j) can not be assigned to the

same cluster. Symmetrical.

• Implicit: si and sj

are different

• Any partition can be expressed a collection of ML and CL constraints

Page 24: Knowledge Enhanced Clustering

Unconstrained Clustering Example (Number of Clusters=2)

Height

Weight

Page 25: Knowledge Enhanced Clustering

Unconstrained Clustering Example (Number of Clusters=2)

Height

Weight

x

x

Page 26: Knowledge Enhanced Clustering

Unconstrained Clustering Example (Number of Clusters=2)

Height

Weight

x x

Page 27: Knowledge Enhanced Clustering

Constrained Clustering Example (Number of Clusters=2)

Cannot-link

Must-link

Height

Weight

x x

Page 28: Knowledge Enhanced Clustering

Cluster Level Constraints• Useful decision regions have:• Cluster diameters at most

– Conjunction of cannot-links between points whose distance is greater than

• Clusters must be distance apart– Conjunction of must-links between all

points whose distance is less than

Don’t need all constraints. Davidson, Wagstaff, Basu 2006aDiscusses a useful subset

Page 29: Knowledge Enhanced Clustering

Constraint Language To Express Knowledge

Pandemic Results Example

FeaturesApart (Elevation=High, Elevation=Low)

NotMoreThanCTogether(2,School1, School2 … Schooln)

Page 30: Knowledge Enhanced Clustering

Can Also Use Constraints to Critique (Give Feedback)

• Feedback incrementally specifying constraints– Positive feedback– Negative feedback– ML(x,y) Not(CL(x,y))

• Do not re-run mining algorithm again• Efficiently modify existing clustering to

satisfy feedback

(Joint work with Martin Ester, S.S. Ravi, and Mohammed Zaki)

Page 31: Knowledge Enhanced Clustering

Outline

• Knowledge enhanced mining with constraints– Motivation – How to add in domain expertise– Complexity results– Sufficient conditions and algorithms

• Other work potentially applicable to sky survey data– Speeding up algorithms by scaling down data– Mining poor quality data– Mining with biased data sets

Page 32: Knowledge Enhanced Clustering

Complexity Results: Can We Design Efficient Algorithms

• Unconstrained problem version:– ArgminC iD(Cf(s_i)- s_i)

f(s_i) is the cluster identify function

Page 33: Knowledge Enhanced Clustering

Complexity Results: Can We Design Efficient Algorithms

• Constrained problem version:– ArgminC iD(Cf(s_i)- s_i)

– s.t. (i,j) ML : f(s_i) = f(s_j), (i,j) CL : f(s_i) f(s_j)

– Feasibility sub-problem– i.e. No solution for k=2: CL(x,y), CL(x,z), CL(y,z)

– Important: Relates to generating a feasible clustering

x

y

z

Page 34: Knowledge Enhanced Clustering

Clustering Under Cannot Link Constraints is Graph Coloring

Instances a thru z Constraints: ML(g,h) CL(a,c), CL(d,e), CL(f,g), CL(c,g), CL(c,f)

a

c

d e

f

g,h Graph k-coloring problem

Page 35: Knowledge Enhanced Clustering

Sample of Feasibility Problem Complexity Results: Not So Bad

[Bounded k: non-hierarchical clustering Davidson, Ravi Journal of DMKD in press] [Unbounded k: hierarchical clustering]

Constraint Type

Complexity Corresponding Problem

Conjunction of ML P Transitive closure

Conjunction of CL NP-Complete Graph coloring

ML in DNF P Just satisfy first disjunct

CL in CNF/DNF NP-Complete Graph coloring

ML in CNF NP-Complete Minimum vertex cover

ML and CL Choice sets P

Page 36: Knowledge Enhanced Clustering

Other Implications of Results For Algorithm Design: Getting Worse

[Davidson and Ravi 2007b]

• Algorithm design idea:

• Find the best clustering that satisfies most constraints in C.

• Can’t be done efficiently:– Repair to satisfy C. – Minimally prune C to satisfy

Page 37: Knowledge Enhanced Clustering

Incrementally Adding In Constraints: Quite Bad

[Davidson, Ester and Ravi 2007c]

• User-centric mining

• Given a clustering that satisfies a set of constraints C

• Minimally modifying to satisfy C and just one more ML or CL constraint is intractable.

Page 38: Knowledge Enhanced Clustering

Outline

• Knowledge enhanced mining with constraints– Motivation – How to add in domain expertise– Complexity results– Sufficient conditions and algorithms

• Other work potentially applicable to sky survey data– Speeding up algorithms by scaling down data– Mining poor quality data– Mining with biased data sets

Page 39: Knowledge Enhanced Clustering

Interesting Phenomena – CL Only [Davidson et’ al DMKD Journal, AAAI06]

Phase-Transitions?[Wagstaff, Cardie 2002]

Cancer

No feasibilityIssues

Page 40: Knowledge Enhanced Clustering

Satisfying All Constraints(Cop-k-Means) [Wagstaff, Thesis 2000]

Algorithm aims to minimize VQE while satisfying all constraints.

1. Calculate the transitive closure over ML points.2. Replace each connected component with a

weighted single point.3. Randomly generate cluster centroids.4. Begin Nearest feasible centroid assignment

Calculate centroids5. Loop until VQE is small

Page 41: Knowledge Enhanced Clustering

COP-K-Means: Nearest Feasible Centroid Assignment

Cannot-link

Must-link

Height

Weight

x

x

4. Nearest feasible centroid assignment

Page 42: Knowledge Enhanced Clustering

Why The Algorithm Fails

• Explanation: Order Instances Are Processed in

Can be clustered for k=2But consider instance ordering:abc (1), hi (1), de (2), jk (?)

x

x

x

x

Page 43: Knowledge Enhanced Clustering

Why The Algorithm Fails

• Explanation: Instance Ordering

Can be clustered for k=2But consider instance ordering:abc (1), hi (1), de (2), jk (?)

x

x

x

x

Page 44: Knowledge Enhanced Clustering

Why The Algorithm Fails

• Explanation: Instance Ordering

Can be clustered for k=2But consider instance ordering:abc (1), hi (1), de (2), jk (?)

x

x

x

x

Page 45: Knowledge Enhanced Clustering

Why The Algorithm Fails

• Explanation: Instance Ordering

• Question: Is there a sufficient condition for any ordering of the points so an algorithm will converge.

Can be clustered for k=2Instance ordering:abc (1), hi (1), de (2), jk (?)

x

x

x

x

Page 46: Knowledge Enhanced Clustering

Why The Algorithm Fails• Explanation: Instance Ordering

• Question: Is there a sufficient condition for any ordering of the points so an algorithm will converge.

• Yes. Brooks’s Theorem: If k+ 1. Restrict constraint language so that most CL constraints on a point is less than k (number of clusters).

Can be clustered for k=2Instance ordering:abc (1), hi (1), de (2), jk (?)

x

x

x

x

Page 47: Knowledge Enhanced Clustering

We Can Also Reorder Points To Make Some Problem Instances “Easy”

[Davidson et’ al AAAI 2006]

• [Irani 1984]- q-inductiveness of a graph– Theorem: If G(V,E) is q-inductive, G can be clustered

• Any algorithm that processes the points in reverse order will always find a feasible solution.

Brooks’s Thm.: k=41-Inductive Ordering {fg, l, abc, hi, jk, de}

fg l abc hi jk de

x

x

x

x

Page 48: Knowledge Enhanced Clustering

Assignment #2CSI535 – Introduction to A.I. Assignment #2): Constraint Representation

Due: Question 1 Due): Sunday May 6rd NOON: All Questions Due Friday 05/16/07 NOON Worth: 20% of Final Grade Late Policy: You lose one full grade for each week (including partial weeks) you are late. Read the instructions carefully, ask questions if you have any doubts. Adding constraints to pattern recognition algorithms is a growing area. One popular pattern recognition problem is identifying good groups (clusters) of points. Clustering in this context is essentially enforcing a k block set partition on the groups of points. Consider clustering the points below into two groups/clusters. There is a natural horizontal and vertical groupings.

Height

Weight

Page 49: Knowledge Enhanced Clustering

Assignment #2A recent addition to the field is adding in constraints to express background or domain knowledge in the form of constraints. The two most popular constraints are: ML (must-link) and CL (cannot-link). For example the following constraints rule out the horizontal grouping.

Cannot-link

Must-link

Cannot-link

Must-link

Cannot-link

Must-link

Height

Weight

Page 50: Knowledge Enhanced Clustering

Assignment #2 Question 1) a) Completely describe a logic that can represent must-link and cannot-link constraints so as to enforce desirable structure on a clustering (set partition)? Begin by choosing whether you will be using propositional or first order logic. Then describe the syntax, semantics and what a model corresponds to in the “real-world”. Show how this language can be used to describe the following types of knowledge. Question 1 b) Diameter: The minimum/maximum diameter of any cluster is . ClusterSeparation: The minimum cluster separation is . AllInstancesApart: Of these m instances (x1 … xn) each should be in a separate cluster. NearestNeighbor: Each point in a cluster (containing at least 2 points) has a neighbor in that cluster within . distance to it. AnyC together: of a list of points (x1 … xn), c must be together (in the same cluster) n > c. AtLeastC: of a list of points (x1 … xn), c must be part (not in the same cluster) n > c. NotMoreThanCTogether of a list of points (x1 … xn), not than c must be in the same cluster n > c. Question 2 Suppose you are given a set partition = {1 … k} on the points describe how you would use your logic to verifying all constraints are satisfied Question 3 Suppose you are given a set of points X describe how you would use your logic to create a set partition of k blocks = {1 … k} where all constraints are satisfied.