statistical models for partial membership katherine heller gatsby computational neuroscience unit,...

29
Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University of Cambridge

Upload: paul-willis

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Statistical Models for Partial Membership

Katherine HellerGatsby Computational Neuroscience Unit, UCL

Sinead Williamson and Zoubin GhahramaniUniversity of Cambridge

Page 2: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Partial Membership Example: Person with mixed ethnic background.

Someone who is 50% Asian and 50% European partly belongs to 2 different groups (ethnicities).

This partial membership may be relevant for predicting this person’s phenotype or food preferences.

Conceptually not the same as uncertain membership.

Being certain that someone is half Asian and half European is very different than being unsure of their ethnicity.

More evidence (like DNA tests) can help resolve uncertainty but will not change their ethnicity memberships.

Work on modeling partial membership by fuzzy logic community

Page 3: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

OutlineGoal: Describe a fully probabilistic approach to data modeling with partial memberships.

Introduction Bayesian Partial Membership Model (BPM)

BPM Learning Experiments

Synthetic Senate Roll Call data

Related Work Conclusions

Nonparametric Extension?

Page 4: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Finite Mixture Models

Generative Process:

where:

Consider modeling a data set, , using a finite mixture of K components…

and

denote memberships of data points to clusters!

1) Choose a cluster

2) Generate a data point from that cluster

Page 5: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

denote memberships of data points to clusters!denote partial memberships of data points to clusters!

Finite Mixture Models

where:

and

Continuous Relaxation

Page 6: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Why does this make sense?

If there is an “Asian” cluster and a “European” cluster, the partial membership model will better capture people with mixed ethnicity, whose features lie in between.

Partial MembershipMixture Model

(1,0)

(0,1)

(.5,.5)

Page 7: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Exponential Family Distributions

Sufficient Statistics:

Conjugate prior can be written as:

Lets consider the case where:

Natural Parameters:

It follows that:

Page 8: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Bayesian Partial Membership ModelGenerative Process:

For each k:

For each n:

Ethnicity Example:

Defines a distribution over features for each of k ethnic groups

Defines ethnic composition of the population

Controls how similar to the populationan individual is expected to be

Ethnic composition of individual n

Feature values of individual n

Page 9: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Bayesian Partial Membership ModelGenerative Process:

For each k:

For each n:

Page 10: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

BPM Sampled Data Each of the four plots shows 3000 data points

drawn from the BPM with the same 3 full-covariance Gaussian clusters.

Page 11: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

BPM TheoryLemma 1In the limit as a0 the exponential family BPM model is a mixture of K components with mixing proportionsLemma 2In the limit as a the exponential family BPM model has only one component with natural parameters

Page 12: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

BPM Learning

Want to infer all unknowns given X:

We treat as fixed hyperparameters:

Goal: Infer using MCMC

All parameters in the BPM are continuous so we can use Hybrid Monte Carlo. Hybrid Monte Carlo is an efficient MCMC method that uses gradient information to find high probability regions.

Page 13: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Synthetic Data Generated synthetic binary data set of 50 data points, 32 dimensions, and 3 clusters. Ran HMC sampler for 4000 iterations. Computed:is the true generated matrix andwhere is sampled.

Page 14: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Senate Roll Call Data (2001-2002) (99 senators + 1 outcome) x 633 votes

K=2 multivariate Bernoulli clusters Model adapted to handle missing data

Page 15: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Senate Roll Call Comparisons Fuzzy K-means:

Blue: Senator Schumer

Black: “Outcome”

Red: Senator Ensign

Partial membership values are very sensitive to exponent

For no value of do the membership values make sense

Page 16: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Senate Roll Call Comparisons Dirichlet Process Mixtures:

DPM confidently infers 4 clusters

Uncertainty is not a good substitute for partial membership

187 168 93 422 224

196 178 112 412 245DPM

BPM

Mean Median Min Max “Outcome”

Negative log predictive probability (in bits) across senators

Page 17: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Image Data

329 Tower and Sunset Images with 240 simple binary texture and color features and K=2 clusters.

Page 18: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Related Work

Latent Dirichlet Allocation (LDA) Mixed Membership Models

Fuzzy Clustering

Exponential Family PCA

Page 19: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Future Work

Would be nice to have a nonparametric version.

Obvious thing to try: Hierarchical Dirichlet Processes. But this would require summing over all infinitely many elements of , which isn’t computationally feasible. Also semantically not very nice.

Indian Buffet Processes might work. Sample an IBP matrix with interpretation that a 1 means having some non-zero amount of membership in that cluster, then draw continuous exact amount separately.

Page 20: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Conclusions

Developed a fully probabilistic approach to data modeling with partial membership.

Uses continuous latent variables and can be seen as a relaxation of clustering with standard mixture models.

Used Hybrid Monte Carlo for inference which was extremely fast (finding sensible partial membership structure after very few samples).

Page 21: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Thank You

Page 22: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Partial Membership Cornerstone of fuzzy set theory

Traditional set theory: Items belong to a set or they don’t {0,1}.

Fuzzy set theory: membership function where denotes the degree to which belongs to set

Fuzzy logic versus probabilistic models

Misguided arguments that fuzzy logic is different or supercedes probability theory.

While it might be easy to dismiss fuzzy logic, its framework for representing partial membership has inspired many researchers.

Google Scholar: Over 45,000 fuzzy clustering papers. Most cited papers cited as frequently as most cited “NIPS” area papers.

Page 23: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Related Work - Latent Dirichlet Allocation (LDA) and

Mixed Membership Models BPM generates data points at the document level of LDA (no word plate).

Whereas LDA (or Mixed Membership models) assume words (or attributes) are drawn using as mixing proportions in a mixture model, and are factorized, the BPM uses to form a convex combination of natural parameters. Attributes not drawn from mixture model and need not be factorized.

BPM - potentially faster MCMC sampling since BPM has all continuous parameters and LDA must infer a discrete topic assignment for each word.

Page 24: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Mixed Membership Model Generation

Page 25: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Related Work: Fuzzy Clustering Fuzzy k-means iteratively minimizes the following objective:

where d is the distance between a data point and a cluster center, is the degree of membership of a data point in a cluster, and controls the amount of partial membership ( =1 is normal k-means)

None of these variables have probabilistic interpretations.

Page 26: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Related Work: Exponential Family PCA Originally formulated in terms of Bregman divergences, it can be seen as a non-Bayesian version of the BPM where the s are not constrained (to normalize to 1 or be positive). Not a convex combination of natural parameters with the same sort of partial membership interpretation.

If we wanted we could relax these same constraints to get a Bayesian version of Exponential Family PCA , but we’d have to tweak the model e.g. a Gaussian prior on .

Page 27: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Hybrid Monte Carlo is an MCMC method that uses gradient information.

Hybrid Monte Carlo simulates dynamics of a system with continuous state variable on an energy function:

provide forces on the state variables which encourage the system to find high probability regions, while maintaining detailed balance.

BPM Learning

Page 28: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

Bregman Divergence

F is a strictly convex function, p and q are points

Intuitively the difference between the value of F at p and the value of the first order Taylor expansion of F around q, evaluated at p.

Page 29: Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University

LDA Review1. for z=1…K,

Draw

2. For d=1…D,

a) Draw

b) for n=1…Nd

i. Draw

ii. Draw

, hyperparameters, multinomial parameters for topics

multinomial parameters for words given topics, words, topics

- # topics

- # words in doc

- # documents