statistical models for partial membership katherine heller gatsby computational neuroscience unit,...

Statistical Models for Partial Membership

Katherine HellerGatsby Computational Neuroscience Unit, UCL

Sinead Williamson and Zoubin GhahramaniUniversity of Cambridge

Partial Membership Example: Person with mixed ethnic background.

Someone who is 50% Asian and 50% European partly belongs to 2 different groups (ethnicities).

This partial membership may be relevant for predicting this person’s phenotype or food preferences.

Conceptually not the same as uncertain membership.

Being certain that someone is half Asian and half European is very different than being unsure of their ethnicity.

More evidence (like DNA tests) can help resolve uncertainty but will not change their ethnicity memberships.

Work on modeling partial membership by fuzzy logic community

OutlineGoal: Describe a fully probabilistic approach to data modeling with partial memberships.

Introduction Bayesian Partial Membership Model (BPM)

BPM Learning Experiments

Synthetic Senate Roll Call data

Related Work Conclusions

Nonparametric Extension?

Finite Mixture Models

Generative Process:

where:

Consider modeling a data set, , using a finite mixture of K components…

and

denote memberships of data points to clusters!

1) Choose a cluster

2) Generate a data point from that cluster

denote memberships of data points to clusters!denote partial memberships of data points to clusters!

Finite Mixture Models

where:

and

Continuous Relaxation

Why does this make sense?

If there is an “Asian” cluster and a “European” cluster, the partial membership model will better capture people with mixed ethnicity, whose features lie in between.

Partial MembershipMixture Model

(1,0)

(0,1)

(.5,.5)

Exponential Family Distributions

Sufficient Statistics:

Conjugate prior can be written as:

Lets consider the case where:

Natural Parameters:

It follows that:

Bayesian Partial Membership ModelGenerative Process:

For each k:

For each n:

Ethnicity Example:

Defines a distribution over features for each of k ethnic groups

Defines ethnic composition of the population

Controls how similar to the populationan individual is expected to be

Ethnic composition of individual n

Feature values of individual n

Bayesian Partial Membership ModelGenerative Process:

For each k:

For each n:

BPM Sampled Data Each of the four plots shows 3000 data points

drawn from the BPM with the same 3 full-covariance Gaussian clusters.

BPM TheoryLemma 1In the limit as a0 the exponential family BPM model is a mixture of K components with mixing proportionsLemma 2In the limit as a the exponential family BPM model has only one component with natural parameters

BPM Learning

Want to infer all unknowns given X:

We treat as fixed hyperparameters:

Goal: Infer using MCMC

All parameters in the BPM are continuous so we can use Hybrid Monte Carlo. Hybrid Monte Carlo is an efficient MCMC method that uses gradient information to find high probability regions.

Synthetic Data Generated synthetic binary data set of 50 data points, 32 dimensions, and 3 clusters. Ran HMC sampler for 4000 iterations. Computed:is the true generated matrix andwhere is sampled.

Senate Roll Call Data (2001-2002) (99 senators + 1 outcome) x 633 votes

K=2 multivariate Bernoulli clusters Model adapted to handle missing data

Senate Roll Call Comparisons Fuzzy K-means:

Blue: Senator Schumer

Black: “Outcome”

Red: Senator Ensign

Partial membership values are very sensitive to exponent

For no value of do the membership values make sense

Senate Roll Call Comparisons Dirichlet Process Mixtures:

DPM confidently infers 4 clusters

Uncertainty is not a good substitute for partial membership

187 168 93 422 224

196 178 112 412 245DPM

BPM

Mean Median Min Max “Outcome”

Negative log predictive probability (in bits) across senators

Image Data

329 Tower and Sunset Images with 240 simple binary texture and color features and K=2 clusters.

Related Work

Latent Dirichlet Allocation (LDA) Mixed Membership Models

Fuzzy Clustering

Exponential Family PCA

Future Work

Would be nice to have a nonparametric version.

Obvious thing to try: Hierarchical Dirichlet Processes. But this would require summing over all infinitely many elements of , which isn’t computationally feasible. Also semantically not very nice.

Indian Buffet Processes might work. Sample an IBP matrix with interpretation that a 1 means having some non-zero amount of membership in that cluster, then draw continuous exact amount separately.

Conclusions

Developed a fully probabilistic approach to data modeling with partial membership.

Uses continuous latent variables and can be seen as a relaxation of clustering with standard mixture models.

Used Hybrid Monte Carlo for inference which was extremely fast (finding sensible partial membership structure after very few samples).

Thank You

Partial Membership Cornerstone of fuzzy set theory

Traditional set theory: Items belong to a set or they don’t {0,1}.

Fuzzy set theory: membership function where denotes the degree to which belongs to set

Fuzzy logic versus probabilistic models

Misguided arguments that fuzzy logic is different or supercedes probability theory.

While it might be easy to dismiss fuzzy logic, its framework for representing partial membership has inspired many researchers.

Google Scholar: Over 45,000 fuzzy clustering papers. Most cited papers cited as frequently as most cited “NIPS” area papers.

Related Work - Latent Dirichlet Allocation (LDA) and

Mixed Membership Models BPM generates data points at the document level of LDA (no word plate).

Whereas LDA (or Mixed Membership models) assume words (or attributes) are drawn using as mixing proportions in a mixture model, and are factorized, the BPM uses to form a convex combination of natural parameters. Attributes not drawn from mixture model and need not be factorized.

BPM - potentially faster MCMC sampling since BPM has all continuous parameters and LDA must infer a discrete topic assignment for each word.

Mixed Membership Model Generation

Related Work: Fuzzy Clustering Fuzzy k-means iteratively minimizes the following objective:

where d is the distance between a data point and a cluster center, is the degree of membership of a data point in a cluster, and controls the amount of partial membership ( =1 is normal k-means)

None of these variables have probabilistic interpretations.

Related Work: Exponential Family PCA Originally formulated in terms of Bregman divergences, it can be seen as a non-Bayesian version of the BPM where the s are not constrained (to normalize to 1 or be positive). Not a convex combination of natural parameters with the same sort of partial membership interpretation.

If we wanted we could relax these same constraints to get a Bayesian version of Exponential Family PCA , but we’d have to tweak the model e.g. a Gaussian prior on .

Hybrid Monte Carlo is an MCMC method that uses gradient information.

Hybrid Monte Carlo simulates dynamics of a system with continuous state variable on an energy function:

provide forces on the state variables which encourage the system to find high probability regions, while maintaining detailed balance.

BPM Learning

Bregman Divergence

F is a strictly convex function, p and q are points

Intuitively the difference between the value of F at p and the value of the first order Taylor expansion of F around q, evaluated at p.

LDA Review1. for z=1…K,

Draw

2. For d=1…D,

a) Draw

b) for n=1…Nd

i. Draw

ii. Draw

, hyperparameters, multinomial parameters for topics

multinomial parameters for words given topics, words, topics

- # topics

- # words in doc

- # documents

statistical models for partial membership katherine heller gatsby computational neuroscience unit,...

Documents

modeling partial membership

data modeling

ethnicity memberships

uncertain membership

missing data senate

exponential family bpm

mixed ethnicity

european cluster