nonparametric bayesian classification marc a. coram university of chicago coram persi diaconissteve...

66
Nonparametric Bayesian Classification Marc A. Coram University of Chicago http://galton.uchicago.edu/~coram Persi Diaconis Steve Lalley

Post on 22-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Nonparametric Bayesian Classification

Marc A. CoramUniversity of Chicago

http://galton.uchicago.edu/~coram

Persi Diaconis Steve Lalley

Page 2: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Related Approaches

• Chipman, George, McCullough• Bayesian CART (1998 a,b)

• Nested• CART-like• Coordinate aligned splits• Good “search” ability

• Denison, Mallick, Smith• Bayesian CART• Bayesian splines and “MARS”

Page 3: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Outline

• Medical example

• Theoretical framework

• Bayesian proposal

• Implementation

• Simulation experiments

• Theoretical results

• Extensions to a general setting

Page 4: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Example: AIDS Data(1-dimensional)

• AIDS patients

• Covariate of interest: viral resistance level in blood sample

• Goal: estimate conditional probability of response

Page 5: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Idealized Setting

(X,Y) iid pairs

X (covariate) X [0,1]

Y (response) Y {0,1}

f0 (true parameter) f0(x)=P(Y=1|X=x)What, then, is a straightforward way to

proceed, thinking like a Bayesian?

Page 6: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Prior on f: 1-dimension

• Pick a non-negative integer M at randomSay, choose M=0 with prob 1/2

M=1 with prob 1/4M=2 with prob 1/8….

• Conditional on M=m, Randomly choose a step functionfrom [0,1] into [0,1] with m jumps

• (i.e. locate the m jumps and (m+1) valuesindependently and uniformly)

Page 7: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Perspective• Simple prior on stepwise functions• Functions are parameterized by:

• Goal: Get samples from the posterior; average to estimate posterior mean curve

• Idea: Use MCMC, but prefer analytical calculations whenever possible

0

1

0

]1,0[

]1,0[

,2,1,0

m

m

m

m

v

u

m # regions

jump locations

function values

Page 8: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Observations

•The joint distribution of U, V, and the data has density proportional to:

• Conditional on u, the counts are sufficient for v.

interval'intails#),,(

interval'inheads#),,(00

11

thjnn

thjnn

jj

jj

yxu

yxu

1

1

)1(01

)1(2m

j

nj

nj

m jj vv

where:

Page 9: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Observations IIThe marginal of the posterior on U has density proportional to:

1||

1

01)1|(| ),(2u

jjj

u nn

Conditional on U=u and the data, V’s are independent Beta random variables

)1,1(~)data,(| 10 jjj nnV u

2

1)data,|(

10

1

jj

jj nn

nVE u

and

1

0 )!1(!!

)1(),(baba

duuuba baWhere:

Page 10: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Consequently…

• In principle:• We put a prior on piecewise constant curves• The curves are specified by

•u, a vector in [0,1]m

•v, a vector in [0,1]m+1

• for some m• We sample curves from the posterior using MCMC• We take the posterior mean (pointwise) of the

sampled curves

• In practice:• We need only sample from the posterior on u• We can then compute the conditional mean of all

the curves with this u.

Page 11: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Implementation

• Build a reversible base chain to sample U from the prior• E.g., start with an empty vector and add,

delete, and move coordinates randomly

• Apply Metropolis-Hastings to construct a new chain which samples from the posterior on U

• Compute:

VV

u

1

u

f

fEf

xnn

n

xfExf

m

jj

jj

j

ˆAverage

)data|(ˆ

)(2

1

))(data,|()(ˆ

1

101

1

Page 12: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Simulation Experiment (a)

True Posterior Mean

n=1024

Page 13: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

True Posterior Mean

n=1024

Page 14: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

True Posterior Mean

n=1024

Page 15: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

True Posterior Mean

n=1024

Page 16: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Predictive Probability Surface

Page 17: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Posterior on #-jumps

Page 18: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Stable w.r.t Prior

Page 19: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Decomposition

Page 20: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Classification and Regression Trees

(CART)• Consider splitting the data into the set with

X<x and the set with X>x• Choose x to maximize the fit• Recurse on each subset• “Prune” away splits according to a

complexity criterion whose parameter is determined by cross-validation

• Splits that do not “explain” enough variability get pruned off

Page 21: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Simulation Experiment (b) True Posterior Mean CART

Page 22: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Bagging

• To “bag” an estimator you treat the estimator as a black box

• Repeatedly, generate bootstrap resamples from the data set and run the estimator on these new “data sets.”

• Average the resulting estimates

Page 23: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Simulation Experiment (c) True Posterior Mean CART Bagged Cart: Full Trees

Page 24: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Simulation Experiment (d) True Posterior Mean CART Bagged Cart: cp=0.005

Page 25: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Simulation Experiment (e) True Posterior Mean CART Bagged Cart: cp=0.01

Page 26: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Sim

ula

tion

s 2-1

0

Page 27: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

CART

Bagged CART: cp=0.01

Page 28: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Bagged Bayes??

Page 29: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Smoothers?

Page 30: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Boosting? (Lasso Stumps)

Page 31: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Dyadic Bayes [Diaconis, Freedman]

Page 32: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Monotone Invariance?

Page 33: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Bayesian Consistency

• • Consistent at f0 if:

The posterior probability of N tends to 1 a.s. for any > 0

• Since all f are bounded in L1, Consistency implies a fortiori that:

}||||:{ 10 fffN

nff asa.s.Linˆ 10

Page 34: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Sample Size 8192

Page 35: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Related WorkDiaconis and Freedman (1995)

• Similar hierarchical prior, but:• Aggressive splitting• Fixed split points

• Strong Results:• If dies off at a specific geometric rate

Consistency for all f0

• If dies off just slower than this Posterior will be inconsistent at f0=1/2

Consistency results cannot be taken for granted

DF: K~Given K=k, split into 2k equal pieces.

(k=3)

Page 36: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Consistency Theorem: ThesisIf (Xi,Yi) are drawn iid via (i=1..n)

X ~ U(0,1)Y|X=x ~ Bernoulli(f0(x))

And if is the specified prior on f, chosen so that the tails the prior on hierarchy level M,

decay like exp(-n log(n) )Then n, the posterior,

is a consistent estimate of f0,for any measurable f0.

Page 37: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Method of Proof

• Barron, Schervish, Wasserman (1999)

• Need to show:• Lemma 1: Prior puts positive mass on all

Kullback-Leibler information neighborhoods of f0

• Choose sieves:

Fn={f: f has no more than n/log(n) splits}

• Lemma 2: The -upper metric entropy of Fn is o(n)

• Lemma 3: (Fnc) decays exponentially

Page 38: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

New Result

• Coram and Lalley 2004/5 ( hopefully )• Consistency holds for any prior with infinite

support, if the true function is not identically ½.• Consistency for the ½ case depends on the tail

decay*

• Proof revolves around a large-deviation question:• How does predictive probability behave as n --> infinity

for a model with m=an splits? (0<a<infinity)

• Proof uses subadditive ergodic theorem to take advantage of self-similarity in the problem

Page 39: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

A Guessing Game

1/2

1/2

Flip a fair coin repeatedly

Pick p in [0,1] at randomFlip that p-coin repeatedly

Page 40: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

64

Page 41: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

128

Page 42: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

256

Page 43: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

512

Page 44: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

1024

Page 45: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

2048

Page 46: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

4096

Page 47: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

8192

Page 48: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

A Voronoi Prior for [0,1]d

1 2

3

5

4

V1 V2

V3

V4

V5

Page 49: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

A Modified Voronoi Prior for General Spaces

• Choose M, as before

• Draw V=(V1, V2, … Vk)

• With each Vj drawn without replacement from an a-priori fixed set A

• In practice, I take A={X1, …, Xn}

• This approximates drawing the V’s from the marginal dist of X

Page 50: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Discussion

• CON:• Not quite Bayesian

• A depends on the data

• PRO:• Only partitions the relevant subspace• Applies in general metric spaces• Only depends on D, the pairwise distance

matrix• Intuitive Content

Page 51: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Intuition

2 3 4 5 6Samples from the prior with k parts

k =

Page 52: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

2-dimensional Simulated Data

Page 53: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Posterior Samples

Page 54: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Posterior Mean

Page 55: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Bagged Cart

Page 56: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Weighted Voronoi

Page 57: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Acknowledgements

• Steve Lalley• Persi Diaconis

National Science Foundation Lieberman Fellowship

• AIDS Data: Andrew Zolopa, Howard Rice

Page 58: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley
Page 59: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Future Directions

• Theoretical• Extend theoretical results to more general setting• Tighten results to determine where inconsistency

first appears• Determine rate of convergence

• Practical• Refine MCMC mixing using better simulated

tempering• Improve computational speed• Explore weighted Voronoi and “smoothed” Voronoi

priors• Compare with SVMs and Boosting• Use the posterior to produce confidence statements

Page 60: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Smoothing

Page 61: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Highlights

• Straightforward Bayesian motivation• Implementation actually works• Prior can be adjusted to utilize

domain knowledge• Provides a framework for inference• Compares favorably with

CART/Bagged CART• Theoretically tractable• Targets high dimensional problems

Page 62: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Background• Enormous Literature

• Theoretical results starting from the consistency of nearest neighbors• Methodologies

• CART• Logistic Regression• Wavelets• SVM’s• Neural Nets

• Bayesian Literature• Bayesian CART• Image Segmentation

• Bayesian Theory• Diaconis and Freedman• Barron, Schervish, Wasserman

Page 63: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Posterior Calculation(2-dimensional example)

Page 64: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Spatial Adaptation

Stephane NullinsPRISME

Page 65: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

1. Pick K=k from

2. Partition of [0,1] into k intervals

3. Assign Each j an Sj iid U(0,1)

Nonparametric Prior: 1-dimension

)21

(Geometric

)[)[

[ ]0 1

v2 v1 v3

[ ])[1 2 3 4

)[)[[ ])[1 2 3 4

k

jjj xSxf

1

)()( 1

Page 66: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley

Consistency Results(1-dimensional)

Setup:Setup:X’s iid U(0,1)Y|X=x ~ Bernoulli(f0(x)) is the prior on k

Result:Result:• If the tails of decay geometrically, then

for any measurable f0,n is consistent at f0.

Key tools:Key tools:• Kullback-Leibler inequalities, Weierstrass approximation

(Prior is ``Dense’’)

• Sieves:

(Prior is “almost” finite dimensional)

• Upper brackets:

(Prior is “almost” finite)

• Large deviations:

(Each likelihood ratio test is asymptotically powerful)