Classification for High DimensClassification for High Dimensional Problems Using Bayesiaional Problems Using Bayesian Neural Networks and Dirichln Neural Networks and Dirichlet Diffusion Treeset Diffusion Trees
Radford M. Neal and Jianguo Zhang Radford M. Neal and Jianguo Zhang the winners of NIPS2003 feature selection challengethe winners of NIPS2003 feature selection challenge
University of TorontoUniversity of Toronto
The resultsThe results
•Combination of Bayesian neural networks and classification based on Bayesian clustering with a Dirichlet diffusion tree model. •A Dirichlet diffusion tree method is used for Arcene. •Bayesian neural networks (as in BayesNN-large) are used for Gisette, Dexter, and Dorothea. •For Madelon, the class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.
Their General ApproachTheir General Approach
Use simple techniques to reduce Use simple techniques to reduce the computational difficulty of the the computational difficulty of the problem, then apply more problem, then apply more sophisticated Bayesian methods.sophisticated Bayesian methods.– The simple techniques: PCA and The simple techniques: PCA and
feature selection by significance tests.feature selection by significance tests.– Bayesian neural networks.Bayesian neural networks.– Automatic Relevance Determination.Automatic Relevance Determination.
(I) First level feature (I) First level feature reductionreduction
Feature selection using significance tests (first level) An initial feature subset was found by sim
ple univariate significance tests. (correlation coefficient, symmetrical uncertainty )
Assumption: Relevant variables will be at least somewhat relevant on their own.
For all tests, a p-value was found by comparing to the distribution found when permuting the class labels.
Dimensionality reduction with PCA (an alternative for FS) There are probably better
dimensionality reduction methods than PCA, but that’s what we used. One reason is that it’s feasible even when p is huge, provided n is not too large - time required is of order min(pn2, np2).
PCA was done using all the data (training, validation, and test).
(II) Building learning model (II) Building learning model & Second level feature & Second level feature SelectionSelection
Bayesian Neural Networks
Conventional neural network learning
Bayesian Neural Network Learning Based on the statistic Based on the statistic
interpretation of the conventional interpretation of the conventional neural network learningneural network learning
Bayesian Neural Network Learning Bayesian predictions are found by integration rather
than maximization. For a test case x, y is predicted:
Conventional neural network only consider Conventional neural network only consider parameters with maximum posteriorparameters with maximum posterior
Bayesian Neural Network consider all possible Bayesian Neural Network consider all possible parameters in the parameter space.parameters in the parameter space.
Can be implemented by Gaussian Can be implemented by Gaussian approximation and MCMCapproximation and MCMC
ARD Prior
Still remember the decay?
How? (by optimize the decay parameter)– Associate weights from each input with a decay
parameter– There are theories for optimizing the decays.
Result.If an input feature x is irrelevant, its relevance hyper-parameter β=1/a will tend to be small, forcing the relevant weight from that input to be near zero.
Some Strong Points of Some Strong Points of This AlgorithmThis Algorithm Bayesian learning integrates over the post
erior distribution for the network parameters, rather than picking a single “optimal” set of parameters. This farther helps to avoid overfitting.
ARD can be used to adjust the relevance of input features
We can using prior to incorporate external knowledge
Dirichlet Diffusion Trees An Bayesian hierarchical clustering
method
The methodsThe methods
BayesNN-smallfeatures selected using significance tests.
BayesNN-largeprinciple components
BayesNN-DFT-combothe class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.
About the datasetsAbout the datasets
The resultsThe results
•http://www.nipsfsc.ecs.soton.ac.uk/
Thanks.Thanks.
Any Question?Any Question?