virtual vector machine for bayesian online classification yuan (alan) qi cs & statistics purdue...

29
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Upload: berenice-montgomery

Post on 04-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Virtual Vector Machine for Bayesian Online Classification

Yuan (Alan) QiCS & StatisticsPurdue

June, 2009

Joint work with T.P. Minka and R. Xiang

Page 2: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

MotivationUbiquitous data stream: emails, stock prices, images

from satellites, video surveillance

How to process a data stream using a small memory buffer and make accurate predictions?

Page 3: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Outline

• Introduction• Virtual Vector Machine• Experimental Results• Summary

Page 4: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Introduction

• Online learning:– Update model and make predictions based on

data points received sequentially– Use a fixed-size memory buffer

Page 5: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Classical online learning

• Classification: – Perceptron

• Linear regression:– Kalman filtering

Page 6: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Bayesian treatment

• Monte Carlo methods (e.g., particle filters)– Difficult for classification model due to high

dimensionality

• Deterministic methods:– Assumed density filtering: Gaussian process

classification models (Csato 2002).

Page 7: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Virtual Vector Machine preview

• Two parts: –Gaussian approximation factors–Virtual points for nonGaussian factors • Summarize multiple real data points• Flexible functional forms• Stored in data cache with a user-defined size.

Page 8: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Outline

• Introduction• Virtual Vector Machine• Experimental Results• Summary

Page 9: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Online Bayesian classification

Model parameters: Data from time 1 to T:Likelihood function at time t:Prior distribution: Posterior at time T:

Page 10: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Flipping noise model

• Labeling error rate• : Feature vector scaled by 1 or -1 depending on

the label. • Posterior distribution: planes cutting a sphere for

3-D case.

Page 11: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Gaussian approximation by EP

• approximates the likelihood

• Both and have the form of Gaussian. Therefore, is a Gaussian.

Page 12: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

VVM enlarges approximation family

: virtual point : exact form of the original likelihood

function. (Could be more flexible.) : residue

Page 13: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Reduction to Gaussian

• From the augmented representation, we can reduce to a Gaussian by EP smoothing on virtual points with prior :

• is Gaussian too.

Page 14: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Cost function for finding virtual points

• Minimizing cost function with ADF spirit:

contains one more nonlinear factor than . • Maximizing surrogate function:

• Keep informative (non-Gaussian) information in virtual points.

Computationally intractable…

Page 15: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Cost function for finding virtual points

• Minimizing cost function with ADF spirit:

contains one more nonlinear factor than . • Maximizing surrogate function:

• Keep informative (non-Gaussian) information in virtual points.

Page 16: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Two basic operations

• Searching over all possible locations for : computationally expensive!

• For efficiency, consider only two operations to generate virtual points: – Eviction: delete the least informative point– Merging: merge two similar points to one

Page 17: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Eviction

• After adding the new point into the virtual point set, – Select by maximizing – Remove from the cache – Update the residual via

Page 18: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Version space for 3-D case

Version space: brown areaEP approximation: red ellipseFour data points: hyperplanes

Version space with three points after deleting one point (with the largest margin)

Page 19: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Merging

• Remove from the cache, • Insert the merged point into the cache• Update the residual via

where Gaussian residual term captures the lost information from the original two factors.

Equivalent to replace by

Page 20: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Version space for 3-D case

Version space: brown areaEP approximation: red ellipseFour data points: hyperplanes

Version space with three points after merging two similar points

Page 21: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Compute residue term

• Inverse ADF: match the moments of the left and right distributions:

Efficiently solved by Gauss-Newton method as an one-dimensional problem

Page 22: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Algorithm Summary

Page 23: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

• Random feature expansion (Rahimi & Recht, 2007):

• For RBF kernels, we use random Fourier features:

• Where are sampled from a special .

Classification with random features

Page 24: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Outline

• Introduction• Virtual Vector Machine• Experimental Results• Summary

Page 25: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Estimation accuracy of posterior mean

Mean square error of estimated posterior mean obtained by EP, virtual vector machine , ADF and window-EP (W-EP). The exact posterior mean is obtained via a Monte Carlo method. The results are averaged over 20 runs.

Page 26: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Online classification (1)

Accumulative prediction error rates of VVM, the sparse online Gaussian process classier (SOGP), the Passive-Aggressive (PA) algorithm and the Topmoumoute online natural gradient (NG) algorithm on the Spambase dataset. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.

Page 27: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Online nonlinear classification (2)

Accumulative prediction error rates of VVM and competing methods on the Thyroid dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 10 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 12 and 91 basis points, respectively.

Page 28: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Online nonlinear classification (3)

Accumulative prediction error rates of VVM and the competing methods. on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 279 and 189 basis points, respectively.

Page 29: Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Summary

• Efficient Bayesian online classification• A small constant space cost• A smooth trade-off between prediction

accuracy and computational cost• Improved prediction accuracy over alternative

methods• More flexible functional form for virtual

points, and other applications