incremental feature construction for deep learning … terms—feature learning, sparse...
TRANSCRIPT
Incremental Feature Construction for Deep
Learning Using Sparse Auto-Encoder
Mongkol Udommitrak and Boonserm Kijsirikul Chulalongkorn University, Bangkok, Thailand
Email: [email protected], [email protected]
Abstract—A sparse auto-encoder is one of effective
algorithms for learning features from unlabeled data in
deep neural-network learning. In conventional sparse
auto-encoder training, for each layer of a deep neural-
network, all feature units are simultaneously constructed
at the beginning and after being trained, several similar/
redundant features are obtained at the end of the learning
process. In this paper, we propose a novel alternative
method for learning features of each layer of the network;
our method incrementally constructs features by adding
primitive/simple features first and then gradually learns
finer/more complicated features. We believe that using our
proposed method, more variety of features can be obtained
that will lead to the performance of the network. We run
experiments on the MNIST data set. The experimental
results show that sparse auto-encoders using our in-
cremental feature construction provides better accuracy
than a sparse auto-encoder using the conventional feature
construction. Moreover, the shapes of our obtained
features contain both primitive strokes/lines as well as
finer curves/more complicated shapes which comprise the
digits, as expected.
Index Terms—feature learning, sparse auto-encoder, deep
learning
I. INTRODUCTION
Deep learning has received much attention from
many researchers for a decade [1], [2]. Before that,
some researchers found out that shallow learning
algorithms, such as a neural network with 1, 2 or 3
layers gave a statistically better result than a deeper
learning algorithm with more than 3 layers [3]. In 2006,
however, deep learning was once again brought back
into the interest of researcher. Hinton and Salakhutdinov
presented an effective approach for deep learning that
uses stack auto-encoder based on the concept of RBM
in training [4]. This idea, later on, motivated many
researchers to come up with many interesting works
such as sparse auto-encoder [5], denoising auto-encoder
[6], [7], contractive auto-encoder [8], convolutional
deep belief networks [9], discriminative recurrent sparse
auto-encoder [10], maxout network [11], saturating
auto-encoder [12].
The neural network is the structure that is generally
used for deep learning. The first level of the structure
can be thought of as a feature representation learner
Manuscript received July 15, 2013; revised September 4, 2013
which employs unsupervised learning process. In the
higher levels, more features are constructed from the
com-position of features in the prior levels, and in
highest level, the process of learning is usually fully
supervised. As a result, feature learning is a substantial
factor that affects the accuracy of classification in the
top level. This paper proposes a novel method for
constructing effective features for the first level of
neural networks using sparse auto-encoders.
Our method is based on the idea of incremental
feature construction by adding primitive/simple features
first and then gradually learns finer/more complicated
features. We believe that using our proposed method,
more variety of features can be obtained that will lead to
the performance of the network.
The paper is organized as follows. Section II
describes the sparse auto-encoder. Section III explains
our incremental feature learning. The experimental
results are reported in Section IV. Section V then gives
our conclusion and our future work.
II. SPARSE AUTO-ENCODER
An auto-encoder is a method for constructing a deep
network. It is a multi-layer neural network that
automatically learns features from the data. For each
hidden layer of the network, it transforms the input data
vector x into a low-dimensional feature representation
using a non-linear activation
function ( ) a sigm Wx b where W is a weight matrix
and b is a bias vector, given that 1( )
1 xsigm x
e
. Then
the auto-encoder recon-structs output (y) from the
function ( ) y sigm W a c . We need to optimize W and
b so that the difference between x and y (reconstruction
error) will be minimized.
Therefore, cost function is
( ) ( )
( , ) 1
1( , ) min ( , )
2
mi i
AEW b i
J W b L x ym
Regularly, either squared error function 2( , ) ( )L x y y x or Cross-Entropy function
( , ) [ log( ) (1 ) log(1 )]L x y x y x y can be chosen
as loss function.
173©2013 Engineering and Technology Publishingdoi: 10.12720/ijoee.1.3.173-176
International Journal of Electrical Energy, Vol. 1, No. 3, September 2013
Sparse auto-encoder is a kind of auto-encoder which
adds a condition that most of its feature components are
close to zero. To apply the sparsity, the Kullback-
Leibler divergence term is added into the cost function
1
1ˆ( , ) ( , ) ( || )
m
sparse AE ii
J W b J W b KLm
where controls the weight of the sparsity penalty
term, is a sparsity parameter and( )
( )
1
ˆia
di
i jj
a
.
The sparse auto-encoder returns the optimized
parameter { , }W b such that ( ) a sigm Wx b is a feature
representing data x and most of ja are near zero. In
other words, the important information from sparse
auto-encoder training is the feature vector of units ja
which has less dimension than the input data, but it is
still able to capture almost all of input information, and
represent the input data with low reconstruction error.
The trained features can be used as inputs for the
classification or as inputs for an auto-encoder in the
next level.
One characteristic of auto-encoders training is that all
feature units are simultaneously constructed at the
beginning. In other words, we have to set a number of
feature units and all feature units are updated during the
learning process. The problem occurs when the number
of features is large, and simultaneously training may
produce some redundant features. This could affect the
performance of the learned model. In this paper, we
propose a novel method to improve feature construction
by firstly constructing fundamental/primitive features,
and then adding additional/more complex features. This
method is called “incremental feature construction”.
The fundamental/simple features which are most
important features for roughly representing input data,
should be constructed first. Then the additional/finer
features are added for fine detail representation. In the
next section we discuss how to train using our proposed
method of incremental feature construction.
III. INCREMENTAL FEATURE CONSTRUCTION
In this section, we describe the process of sparse
auto-encoder training using the incremental feature
construction. At the beginning, we get examples
[0,1] xdx and set (1) (2) ( ){ , ,..., }ka a a as a set of feature
groups where ( )( ) [0,1]ia
dia and (1)a is a fundamental
feature group and (2)a to ( )ka are other groups of
additional features. At the first training step (see Fig. 1a
for a schematic (1)a feature representation of the
process), we encode by the first group of features (1)a
using the function
(1) (1)(1)1 1( )a sigm W x b
where (1)
1W is a (1) xad d weight matrix,
(1)(1)1
ad
b
is a feature bias vector. Then the feature group (1)a is
used to decode reconstruction data [0,1] xdy from the
function
(1) (1)(1)2 2( )y sigm W a b
where (1)2W is a (1)x a
d d weight matrix and
(1)2
xdb is a bias vector. After the basic feedforward
step, we use an optimization method to update
parameters (1) (1) (1) (1)1 2 1 2{ , , , }W W b b for minimizing the
reconstruction error. It is important that when the first
fundamental feature set is completely learned, we have
to keep the term (1) (1)(1)2 2W a b fixed for the training
process of the next feature group. The next training step
(see Fig. 1b for a schematic (2)a feature representation
of the process), we add the second group of features (2)a in the hidden layer and start feedforward step:
maps input data to
(2) (2)(2)1 1( )a sigm W x b
and then (2)a is mapped back to
(2) (2) (1) (1)(2) (1)2 2 2 2( ) ( )y sigm W a b W a b
where (2)1W , (2)
2W are weight matrices and (2)1b ,
(2)2b
are bias vector for (2)a and y, respectively. Finally (2) (2) (2) (2)
1 2 1 2{ , , , }W W b b are updated and optimized to
minimize the reconstruction error. The process is then
repeated until ( )ka is completely trained.
Y
X
a(1)
L(X, Y)
(a)
a(2)
Y
X
a(1)
L(X, Y)
(b)
Figure 1. (a) Fundamental feature set(1)a is trained. (b) Additional
feature set(2)a is trained after feature set
(1)a is completely
trained and feature set(1)a is fixed (the grey units) during the
training process of(2)a , and so on.
174©2013 Engineering and Technology Publishing
International Journal of Electrical Energy, Vol. 1, No. 3, September 2013
IV. EXPERIMENTAL RESULTS
We ran experiments on the MNIST1 data set. Each
data in this dataset is a 28x28 grey-scale image of a
handwritten digit. We ran the 5-fold cross-validation
method and used the L-BFGS2 algorithm from
minFunc3 function and the Softmax regression for
optimization and classification, respectively. The
Softmax regression is multi-class logistic regression
which is supervised learning for the top level of the
network. The incremental feature construction was
compared to the conventional feature construction; both
methods used the sparse auto-encoder with the same
initial weights. We set 400 feature hidden units, so they
are simultaneously trained in conventional feature
construction. On the other hand, our proposed approach
was done with three setting: 1) the number of
fundamental features is small and the number of
additional features is increased, i.e. 40, 80, 80, 200
feature units, respectively 2) the numbers of
fundamental features and additional features are the
same size, i.e. 100, 100, 100 and 100 feature units,
respectively, and 3) the number of fundamental features
is higher than those of additional features, i.e. 200, 80,
80, 40 feature units, respectively.
TABLE I. THE ACCURACY OF THE COMPARED APPROACHES
Method Accuracy (%)
Conventional feature construction
Incremental feature construction
(40-80-80-200)
Incremental feature construction (100-100-100-100)
Incremental feature construction
(200-80-80-40)
96.7514 0.1099
96.6466 0.1116
96.8157 0.1088
96.8400 0.1084
(a)
1http://yann.lecun.com/exdb/mnist/ 2The gradient descent, L-BFGS or conjugate gradient can be
chosen as an optimization method. In this paper, we choose L-BFGS because it works well for low dimensional problems [13].
3http://www.di.ens.fr/ ~mschmidt/Software/ minFunc.html
(b)
(c)
Figure 2. (a) The weights obtained from conventional feature construction all look like a part of digits. (b) The weights from
200-80-80-40 incremental feature construction contains
fundamental features as the majority part which look like a part of digits and contains additional features as the minority part
which contain finer details. (c) The weights from 40-80-80-200
incremental feature construction contains fundamental features are the minority part and additional features as the majority part,
but the network produces lower classification accuracy.
The results are shown in Table I. The accuracy of
200-80-80-40 incremental feature construction is higher
than conventional feature construction with 90%
confidence level using the standard paired t-test method.
However, the accuracy of 40-80-80-200 incremental
feature construction is lower, and this lower accuracy
may be due to the insufficient number of fundamental
features.
Additionally, we can see weights obtained from
conventional feature construction that all weights look
like a part of digits, and several weights look very
similar, as shown in Fig. 2a. This shows the redundancy
of the learned features. However, weights from in-
cremental feature construction can be divided into two
groups, i.e. (1) fundamental features which look like a
part of digits as the above features from the
conventional feature construction and (2) additional
features which contain more details for each class of the
data, as shown in Fig. 2b and Fig. 2c.
175©2013 Engineering and Technology Publishing
International Journal of Electrical Energy, Vol. 1, No. 3, September 2013
V. CONCLUSION AND FUTURE WORK
This paper presents an incremental feature con-
struction method using sparse auto-encoders. Instead of
construction of all features simultaneously at the
beginning as in the conventional method, our method
incrementally constructs features by adding
primitive/simple features first and then gradually learns
finer/more complicated features. The results show that
the proposed method, when the number of fundamental
features is not too small, can improve the accuracy of
the conventional method. Our method also has another
advantage that it learns both primitive features as well
as fine features for representing the data; from the result
observable from the weights of the obtained features, it
shows that simple features look like a part of digit and
finer features contains more details for representing
each class of data.
For our future work, we plan to apply the proposed
method to other unsupervised learning algorithms for
building deep networks. Furthermore, investigation for
other approaches to construct good feature
representation is also in our future plan.
REFERENCES
[1] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, 2009.
[2] Y. Bengio, A. Courille, and P. Vincent, “Representation learning:
a review and new perspectives,” arXiv: 1206.5538v2, 2012.
[3] P. Utgoff and D. Stracuzzi, “Many-layered learning,” Neural
Computation, vol. 14, pp. 2497–2539, 2002.
[4] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimen-sionality of data with neural networks,” Science Volume 311,
2006.
[5] P. Christopher, S. Chopra, and Y. L. Cun, "Efficient learning of sparse representations with an energy-based model," Advances
in Neural Information Processing Systems, 2006.
[6] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and composing robust features with denoising auto-
encoders,” in Proc. 25th International Conference on Machine Learning, Helsinki, Finland, 2008.
[7] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A.
Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising
criterion,” The Journal of Machine Learning Research, vol. 11,
2010. [8] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio,
“Contractive auto-encoders: Explicit invariance during feature
extraction,” in Proc. 28th International Conference on Machine Learning, WA, USA, 2011.
[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. 26th International
Conference on Machine Learning, Montreal, Canada, 2009.
[10] R. J. Tyler, and Y. L. Cun, "Discriminative recurrent sparse auto-encoders," arXiv preprint arXiv: 1301.3775, 2013.
[11] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courvile, and
Y. Bengio, “Maxout networks,” eprintarXiv: 1302.4389 [stat.ML], February 2013.
[12] G. Rostislav, and Y. L. Cun, "Saturating auto-encoder," arXiv
preprint arXiv: 1301.3577 , 2013. [13] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochonow, and A.
Y. Ng., “On optimization methods for deep learning,” in Proc.
28th International Conference on Machine Learning, WA, USA, 2011.
Mongkol Udommitrak was born in Bangkok,
Thailand. He received the B.Sc. degree in mathematics from Chulalongkorn University,
Bangkok, Thailand, in 2009. He is currently a M.Sc.
student in Dept. of Computer Engineering, Chulalongkorn University. His research interests
include machine learning and optimization method.
Boonserm Kijsirikul is a professor at Dept. of Computer Engineering, Chulalongkorn Uni-versity.
His research interests include artificial intelligence,
machine learning and natural language processing.
176©2013 Engineering and Technology Publishing
International Journal of Electrical Energy, Vol. 1, No. 3, September 2013