unsupervised group discovery in relational datasets: a nonparametric bayesian approach p.s....

46
Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering Cornell University Artificial Intelligence Seminar, 10/12/07 Joint work with T. Eliassi-Rad, LLNL

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

Unsupervised Group Discovery in Relational Datasets:

A nonparametric Bayesian Approach

P.S. Koutsourelakis

School of Civil and Environmental Engineering

Cornell University

Artificial Intelligence Seminar, 10/12/07Joint work with T. Eliassi-Rad, LLNL

Page 2: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Problem Setting

A B

D

C

ageincomelocation

ageincomelocation

ageincomelocation

…age

incomelocation

friend

co-worker

phone call

Traditional Clustering

Can we improve clustering by using relational data ?

What if only relational data was available ?

Can we make predictions about missing links or attributes?

Page 3: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Problem Setting

• A collection of objects belonging to various types/domains

(i.e. people, papers, locations, devices, movies, etc)

• Each object might have (observable) attributes

• Links/relations between:

– Two or more objects

– Objects can be of the same or different types

– Binary (absence/presence), integer or real-valued

• Each link might have (observable) attributes

• Find groups of objects of each type, or

• Find common identities between objects of each type, or

• Organize objects into clusters that relate to each other in predictable ways

Goal:

Page 4: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Problem Setting

A B

D

C

Given an adjacency matrix where Ri,j= 0 or 1 (observables),

find cluster assignment Ii (hidden/latent).

A B C D

A • 0 0 0

B 0 • 0 0

C 1 0 • 0

D 0 1 1 •

Probabilistic - Bayesian Formulation

data data |p p pI | R R I I

posterior likelihood prior

Page 5: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Problem Setting

, ,,,

| | ,i j i i j i ji ji j

p R I p R I I

Likelihood:

The relational behavior of the objects is completely determined by their cluster assignments Ii

For example:

matrix specifying link probability between any two groups

, , , ~ ,

, ~ ,

i j i j i j

i j

R I I Bernoulli I I

I I Beta

Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N.Learning systems of concepts with an infinite relational model. AAAI 2006.

Page 6: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Augmented Problem Setting

If objects have attributes (i.e., xi which are also observed), then we can augment likelihood :

If links Ri,j are real-valued (i.e. duration of phone call, number of bytes etc):

,

,

, ~ ,

or , ~ , ,b ,

i j i j i j

i j i j i j i j

R I I Exponential I I

R I I Gamma a I I I I

, ,,, ,

| | , |i j i i i j i j i iii ji j i j

p R x I p R I I p x I

Functions of group assignments

, , , ~ ,

, ~ ,

i j i j i j

i j

R I I Bernoulli I I

I I Beta

Page 7: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Problem Setting

We need a prior on group assignments p(I).

• What is an appropriate prior p(K) on the number of clusters K?

• Groups are unlikely to be related as above.

• The distribution on Ii should be exchangeable.

•That is, the order in which nodes are assigned can be permuted without changing the probability of resulting partition.

, ,,,

| | ,i j i i j i ji ji j

p R I p R I I

Likelihood Function

, , , η ~ ,

, ~ ,

i j i j i j

i j

R I I Bernoulli I I

I I Beta

Page 8: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Nonparametric Bayesian Methods*

• Bayesian methods are most powerful when your prior adequately

captures your beliefs.

• Inflexible models (e.g. with a fixed number of groups) might yield

unreasonable inferences.

• Non-parametrics provide a way of getting very flexible models.

• Non-parametric models can automatically infer an adequate model

size/complexity from the data, without needing to explicitly do

Bayesian model comparison

• Many can be derived by starting with a finite parametric model and

taking the limit as number of parameters

* Nonparametric doesn’t mean there are no parameters, but that “the number of parameters grows with the data” (e.g. as in Parzen window density estimation)

Page 9: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Chinese Restaurant Process (CRP)

(potentially infinite dishes)MENU

.

1/ 1

/ 1 2 / 2

/ 2

2 / 3

/ 3

1/ 3

Page 10: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Chinese Restaurant Process (CRP)

if 0

1|

if j is a new group1

jj

i

nn

np I j

n

-iI

Properties:

• CRP is exchangeable (i.e. order in which customers entered doesn’t matter)

• The number of groups grows as O(log n) where n is the number of nodes

• Inference with Gibbs sampling can be based on the conditionals above

• Larger γ favors more clusters

number of people already eating dish j

Page 11: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Infinite Relational Model (IRM)

“Forward” Interpretation (single domain)

1) Sample group assignments Ii from CRP(γ) resulting in K clusters

2) Sample iid η(a,b) for all a,b=1,2,..,K from Beta(β1,β2 )

3) Sample iid each Ri,j from Bernoulli(η(Ii, Ij))

From Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N.Learning systems of concepts with an infinite relational model. AAAI 2006.

,

1 2 1 2

, , η ~ ,

, , ~ ,

i j i j i j

i j

R I I Bernoulli I I

I I Beta

~ CRP γ

I

Page 12: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

• 2 domains (animals + features)

• Animals form two groups: birds + 4-legged mammals

Application: Object-Feature Dataset

Page 13: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Object-Feature Dataset

Maximum-Likelihood Configuration

Animal Domain

Group 1: dove, hen, owl, falcon, eagle

Group 2: duck, goose

Group 3: fox, cat

Group 4: horse, zebra

Group 5: dog, wolf, tiger, lion, cow

Feature Domain

Group 1: small, 2-legs, feathers, fly

Group 2: medium, hunt

Group 3: big, hooves, mane, run

Group 4: 4-legs, hair

Group 5: swim

Page 14: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Object-Feature Dataset

Page 15: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Predicting Missing Links

% of Missing Links AUC Accuracy

10% 0.96 0.95

25% 0.96 0.91

50% 0.91 0.87

65% 0.82 0.80

Can we make predictions about missing links?

Page 16: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Infinite Relational Model (IRM)

Advantages:

• It is an unsupervised learner with only two tunable parameters β and γ.

• It can be applied to multiple node types and relations.

• It has all the advantages of a Bayesian formulations (missing data, confidence intervals) and nonparametric methods (adaptation to data, outlier accommodation).

• It has been successfully used for co-clustering object features, learning ontologies and social networks.

Disadvantages:

• Significant computational effort

• It does not capture “multiple personalities.”

Page 17: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

“Multiple Personalities”

• In real data, objects (e.g. people) do not belong exclusively to one group, i.e. their identity is a mixture of basic components.

• These components can be the same for each object type but the mixing proportions might vary from one object to another..

• IRM assumes that each object participates in all the relations it is involved with a single identity.

A proper model should account for a different mixture for each object over all the possible identity components (which are common for the whole domain).

• This way we learn not only all the groups of the population but also all the existing mixtures of them.

• This can be achieved by introducing a Bayesian hierarchy

groups ≡ identities

Page 18: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Mixed-Membership Model (MMM)

,Each object can assume as many identities as the number of

links it partcipates ini mI

,

, ,

The likelihood of a link depends only on the identities of the

participating objects for that link ,

mi j

i m j m

R

I I

The personality of each object can be made up of several components

A: No, because the groups for each CRP will not be shared across objects

Q: Can we use an independent CRP for each object

Page 19: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Chinese Restaurant Franchise

N restaurants with a common menu

Object 1 = restaurant 1

Object 2 = restaurant 2

Object N = restaurant N

………………

Phase 1: Table Assignment

Phase 2: Dish AssignmentY.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei .

Hierarchical Dirichlet Processes. JASA, 2006 .

Page 20: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Chinese Restaurant Franchise

,

,

, ,

if 01

|

if t is a new table1

i ti t

i ii m m

i

i i

nn

mp t t

m

itnumber of customers already sitting at table t

table assignment for customer m at restaurant i

,

if s 01

|

if is a new dish1

dd

i t

s

Mp d d

dM

-id

number of tables

number of tables already eating dish k

dish assignment for table t in restaurant i

Page 21: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Mixed-Membership Model

, , , ,,

,

| | ,i j i j

m mi m j m

i ji j

p R p R I II

dish assignment of node i

Properties:

- Has a few more parameters, γi, but also has higher expressivity

- Inference with Gibbs sampling can be based on the conditionals above

, ~i mI CRF γ

Page 22: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Non-Identifiability

A Btwo objects: A, B

1 21,2 2,1two links : ,R R

two groups: 1, 2100% group 1 50% group 1

50% group 2

11,2R

22,1R

four latent variables: 1,1 1,2,I I 2,1 2,2,I I

1 0 matrix:

0 1

Probability of a 1 link between any pair of groups

Page 23: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Non-Identifiability

A B

100% group 1 50% group 1

50% group 2

Different configurations (with 2, 3 or 4 groups) have the same likelihood

Prior determines inference results

Page 24: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership

• 1 domains – 16 objects

• 4 distinct identities

• fully observed adjacency matrix

Page 25: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership Model

Page 26: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership Model

Page 27: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership Model

Page 28: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership

IRM MMM

Error w.r.t. actual probability that any pair of objects belong to the same group

Page 29: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership

IRM MMM

Page 30: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership Model

• 2 domains (animals + features)

• Animals form two groups: birds + 4-legged mammals

Page 31: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership Model

Page 32: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Mixed-Membership Model

00.10.20.30.40.50.60.70.80.9

1

do

ve

he

n

du

ck

go

ose ow

l

falc

on

ea

gle fox

do

g

wo

lf

cat

tige

r

lion

ho

rse

zeb

ra

IRM

MMM

COW: Average posterior pairwise probabilities of belonging to the same group

Page 33: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

34 people

A disagreement between administrator (34) and instructor (1) led to the split of the club in two (circles and squares)

Used binary matrix that records “like” relation

Zachary’s Karate Club

from M Girvan and MEJ Newman,Proc. Natl. Acad. Sci. USA, 2002

Page 34: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Zachary’s Karate Club

Page 35: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Learning Hierarchies

Can we meaningfully infer a hierarchy of groups/identities?

Identity 1

Identity 2

Identity 3 Identity 4

most generalmost general

most specificmost specific

Page 36: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Learning Hierarchies

Nonparametric prior on trees

Level 0

Level L

Level L-1

each box is a different group/identity

CRPL(aL)

CRPL-1(aL-1)

Page 37: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Learning Hierarchies

“Forward” interpretation (for a single domain)

( )

0

( )

1) Generate 1 level tree using nested hierarchical CRP

2) Each object is assoicated with an 1 level branch

3)Let the probability that object belongs to level of its branch

(dra

Lil l

il

L

i L z

i l

,

( ) ( )

( ) ( )

wn from a Dirichlet prior)

4) For each link between objects and :

a) Sample ~ and assign identity

b) Sample ~ and assign identity

c)

i

j

i j

i ii l i l

j jj l j l

R i j

l Discrete I z

l Discrete I z

,Sample Bernoulli ,mi j i jR from I I

Hierarchical Mixed Membership Model (HMMM)

Page 38: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Artificial Dataset

• 1 domain – 40 objects

• 4 distinct identities

• fully observed adjacency matrix

Page 39: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Artificial Dataset

Page 40: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Political Books

43 liberal, 49 conservative, 13 neutral

Links imply frequent co-purchasing by the same buyers (Amazon.com)

Page 41: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Political Books

27%

50%

23%

19%

46%

35%

0%

0%

100%

8%

8%

84%

6%

94%

0%

0%

100%

0%

0%

100%

0%

0%

0%

100%

0%

0%

100%

2222

2626 99

66 1818 77 11 66 33

Page 42: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Reality Mining MIT Data

1 node type (people)

97 people + all outsiders in one node

22 different positions (professor,staff,1styeargrad,….)

sloan29%

faculty& staff5%

students52%

other14%

Page 43: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Reality Mining MIT Data

12%

64%

12%

12%

50%

33%

17%

0%

7%

86%

7%

0%

0%

4%

0%

96%

100%

0%

0%

0%

0%

100%

0%

0%

33%

0%

50%

17%

15%

83%

4%

8%

25%

50%

0%

25%

2626

1717 66

2323 44 11

1414 11 66

Page 44: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Conclusions and Outlook

• Relational data contain significant information about group structure

• Bayesian models allow the analyst to make inferences about communities of interest while quantifying the level of confidence, even when a significant proportion of the data is missing

• Nonparametric models provide a way of getting very flexible priors that allow the model to adapt to the data.

• IRM is a very lightweight framework with a very wide range of applicability, but cannot capture multiple identities.

• MMM and HMMM allows for increased flexibility and provides additional information about objects that simultaneously belong to several groups.

Challenges:Challenges: Accelerated inference especially when dealing with large datasets: - Variational methods - Sequential Monte Carlo

Appropriate priors for time dependent datasets are needed

Page 45: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Senate Vote 2002

Murkowski, Frank AK R YES 19700Stevens, Ted AK R YES 13000Sessions, Jeff AL R YES 9500Shelby, Richard AL R YES 25000Hutchinson, Tim AR R YES 4900Lincoln, Blanche AR D YES 5500McCain, John AZ R NO 29350Kyl, Jon AZ R YES 14500Boxer, Barbara CA D NO 1500Feinstein, Dianne CA D NO 9750Allard, Wayne CO R YES 7500Campbell, Ben CO R YES 4000Dodd, Christopher CT D NO 500Lieberman, Joseph CT D NO 3000Carper, Thomas DE D YES 17640

50 Democrats, 49 Republicans, 1 Independent

Link Ri,j =1 if:

- voted the same

- have both taken more or less than the average contribution

average: $13,800

Page 46: Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering

P.S. Koutsourelakis, [email protected]

Application: Senate Vote 2002

0%

100%

0%

0%

67%

33%

0%

25%

75%

3%

12%

85%

0%

70%

30%