incremental feature construction for deep learning … terms—feature learning, sparse...

Incremental Feature Construction for Deep

Learning Using Sparse Auto-Encoder

Mongkol Udommitrak and Boonserm Kijsirikul Chulalongkorn University, Bangkok, Thailand

Email: [email protected], [email protected]

Abstract—A sparse auto-encoder is one of effective

algorithms for learning features from unlabeled data in

deep neural-network learning. In conventional sparse

auto-encoder training, for each layer of a deep neural-

network, all feature units are simultaneously constructed

at the beginning and after being trained, several similar/

redundant features are obtained at the end of the learning

process. In this paper, we propose a novel alternative

method for learning features of each layer of the network;

our method incrementally constructs features by adding

primitive/simple features first and then gradually learns

finer/more complicated features. We believe that using our

proposed method, more variety of features can be obtained

that will lead to the performance of the network. We run

experiments on the MNIST data set. The experimental

results show that sparse auto-encoders using our in-

cremental feature construction provides better accuracy

than a sparse auto-encoder using the conventional feature

construction. Moreover, the shapes of our obtained

features contain both primitive strokes/lines as well as

finer curves/more complicated shapes which comprise the

digits, as expected.

Index Terms—feature learning, sparse auto-encoder, deep

learning

I. INTRODUCTION

Deep learning has received much attention from

many researchers for a decade [1], [2]. Before that,

some researchers found out that shallow learning

algorithms, such as a neural network with 1, 2 or 3

layers gave a statistically better result than a deeper

learning algorithm with more than 3 layers [3]. In 2006,

however, deep learning was once again brought back

into the interest of researcher. Hinton and Salakhutdinov

presented an effective approach for deep learning that

uses stack auto-encoder based on the concept of RBM

in training [4]. This idea, later on, motivated many

researchers to come up with many interesting works

such as sparse auto-encoder [5], denoising auto-encoder

[6], [7], contractive auto-encoder [8], convolutional

deep belief networks [9], discriminative recurrent sparse

auto-encoder [10], maxout network [11], saturating

auto-encoder [12].

The neural network is the structure that is generally

used for deep learning. The first level of the structure

can be thought of as a feature representation learner

Manuscript received July 15, 2013; revised September 4, 2013

which employs unsupervised learning process. In the

higher levels, more features are constructed from the

com-position of features in the prior levels, and in

highest level, the process of learning is usually fully

supervised. As a result, feature learning is a substantial

factor that affects the accuracy of classification in the

top level. This paper proposes a novel method for

constructing effective features for the first level of

neural networks using sparse auto-encoders.

Our method is based on the idea of incremental

feature construction by adding primitive/simple features

first and then gradually learns finer/more complicated

features. We believe that using our proposed method,

more variety of features can be obtained that will lead to

the performance of the network.

The paper is organized as follows. Section II

describes the sparse auto-encoder. Section III explains

our incremental feature learning. The experimental

results are reported in Section IV. Section V then gives

our conclusion and our future work.

II. SPARSE AUTO-ENCODER

An auto-encoder is a method for constructing a deep

network. It is a multi-layer neural network that

automatically learns features from the data. For each

hidden layer of the network, it transforms the input data

vector x into a low-dimensional feature representation

using a non-linear activation

function ( ) a sigm Wx b where W is a weight matrix

and b is a bias vector, given that 1( )

1 xsigm x

e

. Then

the auto-encoder recon-structs output (y) from the

function ( ) y sigm W a c . We need to optimize W and

b so that the difference between x and y (reconstruction

error) will be minimized.

Therefore, cost function is

( ) ( )

( , ) 1

1( , ) min ( , )

2

mi i

AEW b i

J W b L x ym

Regularly, either squared error function 2( , ) ( )L x y y x or Cross-Entropy function

( , ) [ log( ) (1 ) log(1 )]L x y x y x y can be chosen

as loss function.

173©2013 Engineering and Technology Publishingdoi: 10.12720/ijoee.1.3.173-176

International Journal of Electrical Energy, Vol. 1, No. 3, September 2013

Sparse auto-encoder is a kind of auto-encoder which

adds a condition that most of its feature components are

close to zero. To apply the sparsity, the Kullback-

Leibler divergence term is added into the cost function

1

1ˆ( , ) ( , ) ( || )

m

sparse AE ii

J W b J W b KLm

where controls the weight of the sparsity penalty

term, is a sparsity parameter and( )

( )

1

ˆia

di

i jj

a

.

The sparse auto-encoder returns the optimized

parameter { , }W b such that ( ) a sigm Wx b is a feature

representing data x and most of ja are near zero. In

other words, the important information from sparse

auto-encoder training is the feature vector of units ja

which has less dimension than the input data, but it is

still able to capture almost all of input information, and

represent the input data with low reconstruction error.

The trained features can be used as inputs for the

classification or as inputs for an auto-encoder in the

next level.

One characteristic of auto-encoders training is that all

feature units are simultaneously constructed at the

beginning. In other words, we have to set a number of

feature units and all feature units are updated during the

learning process. The problem occurs when the number

of features is large, and simultaneously training may

produce some redundant features. This could affect the

performance of the learned model. In this paper, we

propose a novel method to improve feature construction

by firstly constructing fundamental/primitive features,

and then adding additional/more complex features. This

method is called “incremental feature construction”.

The fundamental/simple features which are most

important features for roughly representing input data,

should be constructed first. Then the additional/finer

features are added for fine detail representation. In the

next section we discuss how to train using our proposed

method of incremental feature construction.

III. INCREMENTAL FEATURE CONSTRUCTION

In this section, we describe the process of sparse

auto-encoder training using the incremental feature

construction. At the beginning, we get examples

[0,1] xdx and set (1) (2) ( ){ , ,..., }ka a a as a set of feature

groups where ( )( ) [0,1]ia

dia and (1)a is a fundamental

feature group and (2)a to ( )ka are other groups of

additional features. At the first training step (see Fig. 1a

for a schematic (1)a feature representation of the

process), we encode by the first group of features (1)a

using the function

(1) (1)(1)1 1( )a sigm W x b

where (1)

1W is a (1) xad d weight matrix,

(1)(1)1

ad

b

is a feature bias vector. Then the feature group (1)a is

used to decode reconstruction data [0,1] xdy from the

function

(1) (1)(1)2 2( )y sigm W a b

where (1)2W is a (1)x a

d d weight matrix and

(1)2

xdb is a bias vector. After the basic feedforward

step, we use an optimization method to update

parameters (1) (1) (1) (1)1 2 1 2{ , , , }W W b b for minimizing the

reconstruction error. It is important that when the first

fundamental feature set is completely learned, we have

to keep the term (1) (1)(1)2 2W a b fixed for the training

process of the next feature group. The next training step

(see Fig. 1b for a schematic (2)a feature representation

of the process), we add the second group of features (2)a in the hidden layer and start feedforward step:

maps input data to

(2) (2)(2)1 1( )a sigm W x b

and then (2)a is mapped back to

(2) (2) (1) (1)(2) (1)2 2 2 2( ) ( )y sigm W a b W a b

where (2)1W , (2)

2W are weight matrices and (2)1b ,

(2)2b

are bias vector for (2)a and y, respectively. Finally (2) (2) (2) (2)

1 2 1 2{ , , , }W W b b are updated and optimized to

minimize the reconstruction error. The process is then

repeated until ( )ka is completely trained.

Y

X

a(1)

L(X, Y)

(a)

a(2)

Y

X

a(1)

L(X, Y)

(b)

Figure 1. (a) Fundamental feature set(1)a is trained. (b) Additional

feature set(2)a is trained after feature set

(1)a is completely

trained and feature set(1)a is fixed (the grey units) during the

training process of(2)a , and so on.

174©2013 Engineering and Technology Publishing


IV. EXPERIMENTAL RESULTS

We ran experiments on the MNIST1 data set. Each

data in this dataset is a 28x28 grey-scale image of a

handwritten digit. We ran the 5-fold cross-validation

method and used the L-BFGS2 algorithm from

minFunc3 function and the Softmax regression for

optimization and classification, respectively. The

Softmax regression is multi-class logistic regression

which is supervised learning for the top level of the

network. The incremental feature construction was

compared to the conventional feature construction; both

methods used the sparse auto-encoder with the same

initial weights. We set 400 feature hidden units, so they

are simultaneously trained in conventional feature

construction. On the other hand, our proposed approach

was done with three setting: 1) the number of

fundamental features is small and the number of

additional features is increased, i.e. 40, 80, 80, 200

feature units, respectively 2) the numbers of

fundamental features and additional features are the

same size, i.e. 100, 100, 100 and 100 feature units,

respectively, and 3) the number of fundamental features

is higher than those of additional features, i.e. 200, 80,

80, 40 feature units, respectively.

TABLE I. THE ACCURACY OF THE COMPARED APPROACHES

Method Accuracy (%)

Conventional feature construction

Incremental feature construction

(40-80-80-200)

Incremental feature construction (100-100-100-100)

Incremental feature construction

(200-80-80-40)

96.7514 0.1099

96.6466 0.1116

96.8157 0.1088

96.8400 0.1084

(a)

1http://yann.lecun.com/exdb/mnist/ 2The gradient descent, L-BFGS or conjugate gradient can be

chosen as an optimization method. In this paper, we choose L-BFGS because it works well for low dimensional problems [13].

3http://www.di.ens.fr/ ~mschmidt/Software/ minFunc.html

(b)

(c)

Figure 2. (a) The weights obtained from conventional feature construction all look like a part of digits. (b) The weights from

200-80-80-40 incremental feature construction contains

fundamental features as the majority part which look like a part of digits and contains additional features as the minority part

which contain finer details. (c) The weights from 40-80-80-200

incremental feature construction contains fundamental features are the minority part and additional features as the majority part,

but the network produces lower classification accuracy.

The results are shown in Table I. The accuracy of

200-80-80-40 incremental feature construction is higher

than conventional feature construction with 90%

confidence level using the standard paired t-test method.

However, the accuracy of 40-80-80-200 incremental

feature construction is lower, and this lower accuracy

may be due to the insufficient number of fundamental

features.

Additionally, we can see weights obtained from

conventional feature construction that all weights look

like a part of digits, and several weights look very

similar, as shown in Fig. 2a. This shows the redundancy

of the learned features. However, weights from in-

cremental feature construction can be divided into two

groups, i.e. (1) fundamental features which look like a

part of digits as the above features from the

conventional feature construction and (2) additional

features which contain more details for each class of the

data, as shown in Fig. 2b and Fig. 2c.



http://yann.lecun.com/exdb/mnist/

http://www.di.ens.fr/%20~mschmidt/Software/%20minFunc.html

V. CONCLUSION AND FUTURE WORK

This paper presents an incremental feature con-

struction method using sparse auto-encoders. Instead of

construction of all features simultaneously at the

beginning as in the conventional method, our method

incrementally constructs features by adding

primitive/simple features first and then gradually learns

finer/more complicated features. The results show that

the proposed method, when the number of fundamental

features is not too small, can improve the accuracy of

the conventional method. Our method also has another

advantage that it learns both primitive features as well

as fine features for representing the data; from the result

observable from the weights of the obtained features, it

shows that simple features look like a part of digit and

finer features contains more details for representing

each class of data.

For our future work, we plan to apply the proposed

method to other unsupervised learning algorithms for

building deep networks. Furthermore, investigation for

other approaches to construct good feature

representation is also in our future plan.

REFERENCES

[1] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, 2009.

[2] Y. Bengio, A. Courille, and P. Vincent, “Representation learning:

a review and new perspectives,” arXiv: 1206.5538v2, 2012.

[3] P. Utgoff and D. Stracuzzi, “Many-layered learning,” Neural

Computation, vol. 14, pp. 2497–2539, 2002.

[4] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimen-sionality of data with neural networks,” Science Volume 311,

2006.

[5] P. Christopher, S. Chopra, and Y. L. Cun, "Efficient learning of sparse representations with an energy-based model," Advances

in Neural Information Processing Systems, 2006.

[6] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and composing robust features with denoising auto-

encoders,” in Proc. 25th International Conference on Machine Learning, Helsinki, Finland, 2008.

[7] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A.

Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising

criterion,” The Journal of Machine Learning Research, vol. 11,

2010. [8] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio,

“Contractive auto-encoders: Explicit invariance during feature

extraction,” in Proc. 28th International Conference on Machine Learning, WA, USA, 2011.

[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional

deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. 26th International

Conference on Machine Learning, Montreal, Canada, 2009.

[10] R. J. Tyler, and Y. L. Cun, "Discriminative recurrent sparse auto-encoders," arXiv preprint arXiv: 1301.3775, 2013.

[11] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courvile, and

Y. Bengio, “Maxout networks,” eprintarXiv: 1302.4389 [stat.ML], February 2013.

[12] G. Rostislav, and Y. L. Cun, "Saturating auto-encoder," arXiv

preprint arXiv: 1301.3577 , 2013. [13] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochonow, and A.

Y. Ng., “On optimization methods for deep learning,” in Proc.

28th International Conference on Machine Learning, WA, USA, 2011.

Mongkol Udommitrak was born in Bangkok,

Thailand. He received the B.Sc. degree in mathematics from Chulalongkorn University,

Bangkok, Thailand, in 2009. He is currently a M.Sc.

student in Dept. of Computer Engineering, Chulalongkorn University. His research interests

include machine learning and optimization method.

Boonserm Kijsirikul is a professor at Dept. of Computer Engineering, Chulalongkorn Uni-versity.

His research interests include artificial intelligence,

machine learning and natural language processing.



http://arxiv.org/abs/1302.4389

http://arxiv.org/abs/1302.4389

incremental feature construction for deep learning … terms—feature learning, sparse...

Documents