computers and electrical engineeringdownload.xuebalib.com/xuebalib.com.43406.pdf · coding along...

Computers and Electrical Engineering 62 (2017) 499–510

Contents lists available at ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier.com/locate/compeleceng

Richer feature for image classification with super and sub

kernels based on deep convolutional neural network

�

Pengjie Tang

a , b , c , Hanli Wang

a , b , ∗

a Department of Computer Science and Technology, Tongji University, Shanghai 201804, P. R. China b Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 20 0 092, P. R. China c College of Math and Physics, Jinggangshan University, Ji’an 343009, P. R. China

a r t i c l e i n f o

Article history:

Received 29 January 2016

Revised 13 January 2017

Accepted 13 January 2017

Available online 20 January 2017

Keywords:

Deep convolutional neural network

Super convolutional kernel

Sub convolutional kernel

Parallel crossing

Image classification

a b s t r a c t

Deep convolutional neural network (DCNN) has obtained great successes for image classi-

fication. However, the principle of human visual system (HVS) is not fully investigated and

incorporated into the current popular DCNN models. In this work, a novel DCNN model

named parallel crossing DCNN (PC–DCNN) is designed to simulate HVS with the con-

cepts of super convolutional kernel and sub convolutional kernel being introduced. More-

over, a multi-scale PC–DCNN (MS-PC-DCNN) framework is designed, with which a batch

of PC–DCNN models are deployed and the scores from each PC–DCNN model are fused by

weighted average for the final prediction. The experimental results on four public datasets

verify the superiority of the proposed model as compared to a number of state-of-the-art

models.

© 2017 Elsevier Ltd. All rights reserved.

1. Introduction

Image classification plays an important role in computer vision. Traditionally, the histogram of orientation gradients

(HOG) [1] , the scale invariant feature transform (SIFT) [2] , and other biological features [3] are extracted from image first.

Then, several techniques such as the principle component analysis (PCA) and the linear discriminant analysis (LDA) are usu-

ally employed for dimension reduction. Afterwards, the bag of features (BoF) or fisher vector (FV) approach is applied to

encode the descriptors as feature vectors. At last, all the feature vectors are fed to a classifier such as the support vector

machine (SVM) to predict the image class. Many effective works, such as the spatial pyramid matching (SPM) [4] and sparse

coding along with pooling and spatial pyramid matching (Sc + SPM) [5] , have emerged and achieved wonderful performances

for image classification. However, these handcrafted features generally possess less semantic and structural information, and

the image classification performance can be further improved.

Nowadays, deep convolutional neural network (DCNN) has attracted a lot of research attentions because of its amazing

performance [6–8] . It simulates the human visual system (HVS) and the brain multi-level architecture. A number of DCNN

models( e.g. , Alex-Net [6] , VGG16 [8] , GoogLeNet [7] ) have been designed and obtained astonishing results on a number

of visual tasks ( e.g. , image classification, object detection, human action recognition). In DCNN, the depth is one of the

key factors to enhance the discriminative ability of features. Generally speaking, the deeper the model is, the better the

� Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. M. Senthil Kumar. ∗ Corresponding author.

E-mail address: [email protected] (H. Wang).

http://dx.doi.org/10.1016/j.compeleceng.2017.01.011

0045-7906/© 2017 Elsevier Ltd. All rights reserved.


http://www.ScienceDirect.com

http://www.elsevier.com/locate/compeleceng

http://crossmark.crossref.org/dialog/?doi=10.1016/j.compeleceng.2017.01.011&domain=pdf

mailto:[email protected]


500 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510

(a) PC-DCNN model

(b) MS-PC-DCNN model

Fig. 1. Overview of the proposed PC–DCNN and MS-PC-DCNN models.

performance becomes. Many research efforts are devoted to optimize the DCNN model architecture such as [9–11] . The

works mentioned above achieve great breakthroughs on the task of image classification. However, many of these works

pay much more attentions to the model depth, and neglect the fact that with increasing the number of layers, the model

complexity is dramatically increased and the performance is not always improved [12] . In addition, most of the current

models ignore another fact that human eyes have different visual fields, and the types of information they collect are also

different.

As we know, the visual information goes into the brain through two visual pathways, and then it comes into being

more comprehensive information via the optic chiasma. The comprehensive information at last is more discriminative and

abstract. Therefore, in this work, we simulate human eyes with two types of convolutional kernels with different sizes at

the bottom layer; and then, the extracted information forms two streams and is forwarded via each other’s pathway. When

the features arrive at the top of the proposed model, these two streams will be fused, which is similar to the mechanism of

optic chiasma. According to this process, we design a novel architecture called parallel crossing DCNN (PC–DCNN) as shown

in Fig. 1 (a).

Actually, the ability of human eyes is limited, and most often people have to turn to some optical equipments for more

information about the objects they want to understand or recognize. For example, people use telescope for observing the

macro features of objects, and they use microscope to get the information of microstructure. The more information about the

objects can be used, the higher precision can be obtained. We simulate this process by multi-scale convolutional kernels and

develop the multi-scale PC–DCNN (MS-PC-DCNN) model as shown in Fig. 1 (b). In MS-PC-DCNN, we use super convolutional

kernel to simulate the process of using telescopes, and employ sub convolutional kernel to simulate the process of using

microscopes. By this way, a batch of trained DCNN models and a few groups of scores are obtained, then, we fuse all

the models’ scores by computing their weighted average. The experimental results demonstrate that the proposed models

can improve the performance greatly for image classification. Meanwhile, the proposed framework is expandable, so if we

employ more and smaller sub convolutional kernels, more PC–DCNN modules can be generated, and the performance can

be further improved.

The main contributions of this work include the following three folds. First, inspired by the principle of human vision,

we propose a novel model which has a reasonable architecture with low complexity called PC–DCNN for the task of image

classification. Second, the concepts of super convolutional kernel and sub convolutional kernel are proposed in accordance

P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510 501

with the process that human observes objects. Third, the MS-PC-DCNN framework is designed, in which a batch of PC–

DCNN models are generated according to the principle of super convolutional kernel and sub convolutional kernel. The rest

of this paper is structured as follows. Section 2 reviews the related works about DCNN, including the history and several

state-of-the-art DCNN architectures. In Section 3 , the proposed PC–DCNN and MS-PC-DCNN models are detailed with the

principles of super convolutional kernel and sub convolutional kernel introduced. The experimental results are presented in

Section 4 . Finally, Section 5 concludes this work.

2. Related works

LeCun et al. [13] present the convolutional neural network (CNN), which constructs convolutional kernels to simulate the

human’s receptive field and employs the method of convolutional operation to filter image patches. Hinton et al. propose

the idea of deep learning in 2006, and design the deep belief network (DBN) model in which a few of restricted boltzmann

machines (RBM) are stacked and the layer-wise learning mechanism is used for training [14] .

An astonishing achievement of DCNN is made on the competition of ImageNet in 2012 that Krizhevsky et al. design the

Alex-Net model via combing the idea of deep learning with CNN [6] , and the classification accuracy arrives to 84.7% (Top

5) on the Imagenet2012 dataset [15] , outperforming the state-of-the-art model (SIFT+FV) more than 10%. In [9] , Zeiler et al.

propose an approach which can visualize the features of every layer based on Alex-Net and further improve the classification

performance by refining the Alex-Net model. The DCNN models of VGG16 and VGG19 are designed which are inspired by the

conclusion that the depth is significant for feature representation [8] . As compared to VGG models, the GoogLeNet model

has lower model complexity and better performance because of its small size of convolutional kernels and contraption

ingenuity, in which the module named Inception is designed for clustering sparse features [7] . In [10] , the network in

network (NIN) model is designed which uses repetitive convolutional operations before pooling to generate richer image

features and reduces the model complexity. The classification performances achieved by NIN on the benchmark datasets of

CIFAR-10 [16] and CIFAR-100 [16] reach to 92% and 64.3%, respectively. Other improved models such as All-CNN [11] are also

developed for image classification and perform well. Besides the model architecture, a number of works focus on developing

new techniques for DCNN. Zeiler et al. propose the stochastic pooling method [17] for eliminating the drawbacks of max

pooling and average pooling. Dropout [6] , Maxout [18] and DropConnect [19] are designed to prevent over-fitting. In [20] ,

the parametric rectified linear units (PReLU) method is designed to improve the widely applied ReLU method for neuron

activation.

The aforementioned works concentrate on the models themselves or their optimization for finding and designing more

reasonable architectures and techniques. As compared with these works, the proposed method further follows the philoso-

phy of HVS as mentioned in Section 1 so that the parallel crossing mechanism in the process of visual information transfer

is explicitly explored and the super convolutional kernel as well as the sub convolutional kernel are designed for the pro-

posed PC–DCNN and MS-PC-DCNN models. In order to implement the proposed PC–DCNN and MS-PC-DCNN models, we

employ the relatively low-complexity Alex-Net model to verify our ideas.

3. Proposed PC–DCNN model and MS-PC-DCNN model

3.1. Convolutional neural network

Generally, a convolutional neural network consists of convolutional layers, pooling layers and fully connected layers.

Given an input feature map x ∈ R

H×W ×N , where H and W are the height and width of the feature map, and N is the number

of feature maps in a layer. The convolutional operation can be given by

y i j = (K ∗ x k ) i j + b, (1)

where y ij is the output at ( i , j ), and K is the convolutional kernel. x k is the k th patch in x , and b is the bias. Then, the max

or average pooling is followed for sampling by x k = max (x 1 k , x 2

k , · · · , x n

k ) or x k = (x 1

k + x 2

k + · · · + x n

k ) /n, where n indicates the

number of feature map elements within the k th patch. Next, the activation function is employed for enhancing sparseness of

the features with the format of f (y i j ) = max (0 , y i j ) + α · min (0 , y i j ) if we use the PReLU method [20] , where the parameter

α can be updated as

α = α − η∂�

∂α, (2)

where η is the learning rate, and � is the cost function with the cross entropy method usually employed as

� = − 1

m

m ∑

i =1

[ t i log (o i ) + (1 − t i ) log (1 − o i ) ] , (3)

where m is the number of samples in one iteration, t i is the ground truth of the i th sample, and o i is the output of the

system. When training a DCNN model, the object is to minimize � : f (x, w ) �→ R , where w is the set of weights that needs

to be optimized.


3.2. Super convolutional kernel and PC–DCNN model

Suppose that the convolutional kernel has the same value of length and width, we define z l as the times of its scale as

z l =

K

c l

− K

s l

K

c l

, K

c l � = K

s l , (4)

where l is the l th convolutional layer, K

c l

is the size of the original convolutional kernel and K

s l

is the size of the scaled

convolutional kernel. If z l < 0, we have

K

c l

− K

s l

K

c l

< 0 , (5)

namely,

K

s l > K

c l , (6)

where if K

s l

� K

c l

under this condition, the size of convolutional kernel with K

s l

will be very large. We define this type of ker-

nel as super convolutional kernel. For obtaining super convolutional kernels, a naive approach is to increase the kernel size

directly. However, the number of parameters and the time complexity will be increased greatly if this method is employed.

Instead, we propose another way to generate super convolutional kernels.

As we know, the size of feature map in the l th convolutional layer can be calculated by {M

H l = � (M

H l−1 − K

H l ) /s H l + 1

M

W

l = � (M

W

l−1 − K

W

l ) /s W

l + 1

, (7)

where M l and M l−1 denote the sizes of feature maps in the l th and (l − 1) th convolutional layers, respectively, and K l is the

size convolutional kernel in the l th convolutional layer, while s l is the stride of convolutional operation in the l th convolu-

tional layer and H and W are the height and width, respectively. For simplicity, we set H = W, and Eq. (7) can be simplified

as

M l = � (M l−1 − K l ) / s l + 1 , (8)

Given K l = K

c l , the size of the original feature map M l−1 and the stride s l are denoted by M

c l−1

and s c l , respectively,

Eq. (8) can be written as

M

c l = � (M

c l−1 − K

c l ) / s

c l + 1 , (9)

Then, we increase s c l

and denote it as s s l , and the size of the output feature map M

s l

can be calculated by

M

s l = � (M

c l−1 − K l

c ) / s s l + 1 , (10)

Afterwards, since M

s l

can be fixed and thus we have

K

s l = M

c l−1 − s c l (M

s l − 1) . (11)

We can find that M

s l

will be decreased according to Eq. (10) . Because M

c l−1

and s c l

are fixed, K

s l

will be increased based

on Eq. (11) . Therefore, we utilize the method of increasing the stride to get the super convolutional kernel. In this way, the

size of convolutional kernels is not increased, and the complexity can be restricted. Then, we can further rewrite Eq. (11) as

K

s l = M

c l−1 − s c l (� (M

c l−1 − K

c l ) /s s l ) . (12)

Therefore, the new stride can be calculated by

s s l = � (M

c l−1 − K

c l )(s c l / (M

c l−1 − K

s l )) . (13)

At the training stage, the weights w and the bias b are updated by a chain rule. But at the top of the model, we cross

the two streams twice, and the process is different from the traditional back propagation (BP) algorithm. Suppose that g L is

the function of the last fully connected layer which can be denoted as g L = f (w · X L + b) , where X L is the neuron vector. At

the second crossing layer, the weights and bias can be updated by

w L −1 = w L −1 − η∂�

∂ w L −1

= w L −1 − η(

∂�

∂g L

∂g L

∂g A L −1

∂g A L −1

∂w

A L −1

+

∂�

∂g L

∂g L

∂g B L −1

∂g B L −1

∂w

B L −1

), (14)


b L −1 = b L −1 − η∂�

∂ b L −1

= b L −1 − η(

∂�

∂g L

∂g L

∂g A L −1

∂g A L −1

∂b A L −1

+

∂�

∂g L

∂g L

∂g B L −1

∂g B L −1

∂b B L −1

). (15)

In a similar way, w and b at the first crossing layer can be updated with

w L −2 = w L −2 −η∂�

∂g L

[ ∂g L

∂g A L −1

(∂g A L −1

∂g A L −2

∂g A L −2

∂w

A L −2

+

∂g A L −1

∂g B L −2

∂g B L −2

∂w

B L −2

)

+

∂g L

∂g B L −1

(∂g B L −1

∂g A L −2

∂g A L −2

∂w

A L −2

+

∂g B L −1

∂g B L −2

∂g B L −2

∂w

B L −2

)] , (16)

b L −2 = b L −2 −η∂�

∂g L

[ ∂g L

∂g A L −1

(∂g A L −1

∂g A L −2

∂g A L −2

∂b A L −2

+

∂g A L −1

∂g B L −2

∂g B L −2

∂b B L −2

)

+

∂g L

∂g B L −1

(∂g B L −1

∂g A L −2

∂g A L −2

∂b A L −2

+

∂g B L −1

∂g B L −2

∂g B L −2

∂b B L −2

)] , (17)

where g A and g B stand for the transformation functions of Stream A and Stream B, respectively. Similarly, w

A and b A are the

weights and bias for Stream A, and w

B and b B are for Stream B.

In our model, the number of parameters does not increase because the size of convolutional kernels in practice does not

increase though we use the super convolutional kernel. Meanwhile, the number of neurons and the time complexity are

both reduced because of the smaller generated feature maps caused by increasing the stride of convolutional operations.

And it can not improve the performance if we only use a single model with the proposed super convolutional kernel,

because much more detailed information will be lost when we use a long stride. Inspired by HVS, we combine the model of

super convolutional kernel with the model of the original convolutional kernel and form two data transformation streams

to simulate two visual pathways. At the top of the model, these two streams will be mixed for simulating the mechanism

of optic chiasma.

3.3. Sub convolutional kernel and MS-PC-DCNN model

According to Eq. (4) , given z l > 0, then we have

K

c l

− K

s l

K

c l

> 0 , (18)

that is

K

s l < K

c l , (19)

where if K

s l

> 0 , K

s l

is the size of the traditional convolutional kernel, and if the scaled smaller convolutional kernel is used

directly, the parameters and the complexity will be reduced, but the number of neurons will be increased, so that it requires

more memory space. And if K

s l

< 0 , there is no such a convolutional kernel actually, so we name this type of kernel as sub

convolutional kernel.

In order to get sub convolutional kernels, the following two situations are investigated. The first is that M l−1 is fixed when

the size of convolutional kernel may be increased or decreased, and the second is that the size of convolutional kernel is

fixed while M l−1 increases or decreases. As illustrated in Fig. 2 , we can find the effects of the two conditions are equivalent.

In the first situation, because the size of feature maps is fixed, the size of regions covered by different convolutional kernels

is different too and the larger size of kernels will cover bigger regions while the smaller size of kernels will cover smaller

regions of the feature map. As for the second case, if the size of input feature maps is smaller, the size of regions covered

by convolutional kernels will become larger which is caused by their fixed size. And in a similar way, the size of regions

covered by convolutional kernels will be smaller if the feature maps become larger. For image classification, these two

approaches have the same effect. Therefore, we get sub convolutional kernels by increasing the size of input feature maps

under the condition of using fixed size of convolutional kernels.

If M l−1 increases, M l will also increase according to Eq. (8) . Let K l = K

c l , and the increased M l and M l−1 are denoted as

(M

s l ) ′ and (M

s l−1

) ′ , respectively, we can compute (K

s l ) ′ as

(K

s l )

′ = (M

s l−1 )

′ − s c l ((M

s l )

′ − 1) . (20)

If (K

s l ) ′ < 0 , (K

s l ) ′ is the size of the sub convolutional kernel. However, our goal is not for the sub convolutional kernel

according to Fig. 2 but for (M

s l−1

) ′ . From Eq. (20) , we derive

(M

s l−1 )

′ = (K

s l )

′ + s c l ((M

s l )

′ − 1) , (21)


Fig. 2. Illustration of the proposed super convolutional kernel and sub convolutional kernel via image decreasing and increasing.

that is

(M

s l−1 )

′ = (K

s l )

′ + s c l (� (M

c l−1 − K

c l ) /s s l ) . (22)

According to Eq. (4) , if we have known the scale ratio z l and the original convolutional kernel K

c l , then (K

s l ) ′ can be

obtained, thus we can compute (M

s l−1

) ′ with the aforementioned equations. If l = 1 , (M

s l−1

) ′ = (M

s 0 ) ′ which represents the

expected size of the input image.

The ideas just discussed above are applied to the proposed PC–DCNN model. By setting different scale ratios, a few

types of input images with different sizes can be obtained leading to the generation of more than one PC–DCNN models.

In addition, the experimental results show that a single model with sub convolutional kernel has no obvious improvement

on performances. This is mainly because a single model will miss the macro structure information of the entire object.

Therefore, we regard the PC–DCNN model as the basic module and combine multiple basic modules with different sizes

of sub convolutional kernels, and then we compute the weighted average score from each of these modules. Assume there

are c categories in a dataset and R modules in the proposed MS-PC-DCNN model. The feature vector generated by the i th

module is denoted as v i . For an image, we define the estimation which belongs to the j th class as ˆ sc j = P (c j | v i ) + ε i j (v i ) ,

where P (c j | v i ) is the true probability from the i th module and ε i j (v i ) denotes the error of estimation. Because ε i

j (v i ) may

be different among the modules, we apply the average combination rule with weights like

ˆ SC j =

1

R

R ∑

i =1

ˆ sc j =

1

R

R ∑

i =1

β i P (c j | v i ) , (23)

where R = 4 in our model, β i is the weight for the i th PC–DCNN module, and

ˆ SC j is the final score estimation from all

the images belonging to the j th class. The experimental results demonstrate that the proposed framework can improve

the performance greatly. It is worth noting that, our multi-scale method is different from the traditional method [8] that

enlarges the original images and just uses fixed crop patches ( e.g. , 224 × 224).

4. Experimental results

The image resolution has a big effect on the performance, so we implement two different PC–DCNN models for the

datasets with high-resolution images and low-resolution images. For the first situation, we use Caltech101 [21] and Cal-

tech256 [22] datasets for evaluating our approach. The Caltech101 dataset has more than 90 0 0 images, including 101 cate-

gories, and each category has 90 images on average. In the Caltech256 dataset, there is a small number of images overlap-

ping with the Caltech101 dataset, but it contains much more images and categories. The Caltech256 dataset has more than

30,0 0 0 images and 256 categories, and each category includes about 120 images on average. On each dataset, we repeat

the experiments three times and take the mean value. And for the second situation, we use the popular CIFAR-10 [16] and

CIFAR-100 [16] datasets to evaluate the model. Both the datasets of CIFAR-10 and CIFAR-10 0 contain 60,0 0 0 color images

with size of 32 × 32, and the training set includes 50,0 0 0 images and the others are for test.

4.1. Model design

High-resolution images have more pixels than tiny images, and more neurons are required for representing them, leading

to higher model complexities. We design Model(a) and Model(b) for the large image datasets and tiny image datasets,


Table 1

The architectures of Model(a) and Model(b).

Layer type Model(a) Model(b)

Stream A Stream B Stream A Stream B

Conv. {96@11 × 11, 4} {96@11 × 11, 6} {96@3 × 3, 1} {96@3 × 3, 2}

PReLU/LRN −− −− −− −−Max pooling {96@3 × 3, 2} {96@3 × 3, 2} {96@3 × 3, 2} {96@3 × 3, 2}

Conv. {256@5 × 5, 1} {256@5 × 5, 1} {256@3 × 3, 1} {256@3 × 3, 1}

PReLU/LRN −− −− −− −−Max pooling {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2}

Conv. {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1}

PReLU −− −− −− −−Conv. {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1}

PReLU −− −− −− −−Conv. {256@3 × 3, 1} {256@3 × 3, 1} {256@3 × 3, 1} {256@3 × 3, 1}

PReLU −− −− −− −−Max pooling {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2}

FC 2048 2048 2048 2048

Dropout ratio:0.5 ratio:0.5 ratio:0.5 ratio:0.5

Concat Fusion Fusion

FC 2048 2048 2048 2048

Dropout ratio:0.5 ratio:0.5 ratio:0.7 ratio:0.7

Concat Fusion Fusion

FC 1024 1024

Table 2

The size of super convolutional kernels and sub convolutional kernels with different input patches.

Model(a) Model(b)

Crop size Stream A Stream B Crop size Stream A Stream B

Sub Super Sub Sub Super Sub

224 × 224 −− 84 × 84 −− 28 × 28 −− 16 × 16 −−280 × 280 40 × 40 48 × 48 −− 42 × 42 12 × 12 9 × 9 −−336 × 336 100 × 100 8 × 8 −− 56 × 56 26 × 26 2 × 2 −−392 × 392 156 × 156 −− 28 × 28 70 × 70 40 × 40 −− 5 × 5

Table 3

Crop size and image size used by Model(a) and Model(b).

Model(a) Model(b)

Crop size Image size Crop size Image size

224 × 224 256 × 256 28 × 28 32 × 32

280 × 280 320 × 320 42 × 42 48 × 48

336 × 336 384 × 384 56 × 56 64 × 64

392 × 392 448 × 448 70 × 70 80 × 80

Table 4

Configuration of MS-PC-DCNN with Model(a).

MS-PC-DCNN Fusion scale

1-Scale PC-DCNN:256 × 256

2-Scale PC-DCNN:256 × 256 + 320 × 320

3-Scale PC-DCNN:256 × 256 + 320 × 320 + 384 × 384

4-Scale PC-DCNN:256 × 256 + 320 × 320 + 384 × 384 + 448 × 448

respectively, based on the idea of super convolutional kernel and sub convolutional kernel. The model settings are shown

in Table 1 , where in the format of { n 1 @ n 2 × n 2 , s }, n 1 is the number of output channels, n 2 is the size of convolutional

kernel, and s is the stride. Suppose the size of the original convolutional kernels in the first layer is 11 × 11 and 3 × 3 in

Model(a) and Model(b), respectively, a series of super convolutional kernels and sub convolutional kernels are generated.

The detailed configuration is shown in Table 2 . It is worth noting that super convolutional kernels are always in Stream B,

and sub convolutional kernels are always in Stream A in a single PC–DCNN module.

In the proposed MS-PC-DCNN model, we resize the image according to the patch size we have cropped ( Table 3 ). And

the fusion scales are shown in Tables 4 and 5 for Model(a) and Model(b), respectively, where the format of m 1 × m 1 + · · · +m r × m r denotes that there are r PC–DCNN modules with different scales in the MS-PC-DCNN model.


Table 5

Configuration of MS-PC-DCNN with Model(b).

MS-PC-DCNN Fusion scale

1-Scale PC-DCNN:32 × 32

2-Scale PC-DCNN:32 × 32 + 48 × 48

3-Scale PC-DCNN:32 × 32 + 48 × 48 + 64 × 64

4-Scale PC-DCNN:32 × 32 + 48 × 48 + 64 × 64 + 80 × 80

Table 6

Configuration for training.

Parameter type Model(a) Model(b) Policy

Learning rate 0 .001 0 .01 poly

Decay power 0 .6 0 .6 fixed

Weight decay 0 .0 0 05 0 .0 0 05 fixed

Momentum 0 .9 0 .9 fixed

Batch size 32 50 fixed

Table 7

Performance comparison of different models on Caltech256 and Caltech101.

Deep model Caltech256 ( N train = 60 ) Caltech101 ( N train = 30 )

Acc%(Top1) Acc%(Top5) Acc%(Top1) Acc%(Top5)

Alex-Net [6] 38 .8 57 .2 56 .6 75 .9

ZF-Net [9] 38 .8 −− 46 .5 −−GoogLeNet [7] 43 .8 63 .9 53 .3 73 .2

VGG16 [8] 41 .5 59 .9 58 .1 77 .7

Ex-CNN [24] 53 .6 −− 87 .1 −−PC-DCNN (256 × 256) 47 .3 66 .6 66 .7 85 .4

PC-DCNN (320 × 320) 48 .6 68 .2 67 .2 87 .0

PC-DCNN (384 × 384) 49 .6 68 .1 67 .5 86 .5

PC-DCNN (448 × 448) 50 .1 68 .5 69 .7 88 .1

MS-PC-DCNN (4-Scale) 54 .4 71 .5 72 .6 88 .8

4.2. Configuration

In all experiments, we use the technology of data augmentation [16] . First of all, we crop the image from the top left

corner, top right corner, bottom left corner, bottom right corner and center for patches, and then flip them horizontally at

the training stage. So the size of augmented dataset is 10 times of the original dataset.

During training, for Model(a), 30 images are selected from each category randomly from the Caltech101 dataset, and the

rest images are used for test. For finding hyper parameters, we use the first 20 images in each class for training and the

rest for validation. Next, we find the best performance point on the validation set. Then, we use the whole training set for

training until the iteration number matches the best point we have found. On the Caltech256 dataset, we first randomly

select 60 images from each category as the training samples, and use the other images for test. Then we use the first 50

images for training and the rest for validation.

Regarding Model(b), we follow the method used by Goodfellow et al. [18] , which uses the first 40,0 0 0 images for training

and the rest 10,0 0 0 images for validation on CIFAR-10. Meanwhile, the same protocol is applied on the CIFAR-100 dataset

and does not fine-tune the parameters on the model trained on CIFAR-10. The other configuration parameters are shown in

Table 6 . In all experiments, we employ the popular Caffe toolkit [23] to deploy and implement the proposed models. And we

use two NVIDIA TITAN-X GPUs for accelerating the computing process. On the datasets of Caltech256 and Caltech101, about

3 days are required for training the models, and on the datasets of CIFAR-10 and CIFAR-100, about 1.5 days are required for

training. At the test stage, the average time for one image in Caltech256 and Caltech101 is about 0.15s, and for one image

in CIFAR-10 and CIFAR-100, the test average time is about 0.07s.

4.3. Performance analysis

For large image, a number of research works are implemented for comparison including Alex-Net [6] , GoogLeNet [7] and

VGG16 [8] with the same configuration. And the results of the ZF-Net model are cited from [9] directly. The comparative

results are represented in Table 7 , where it can be observed that the proposed PC-DCNN model has better performances

than the other competing models no matter on Caltech256 or Caltech101. Furthermore, the proposed MS-PC-DCNN (4-Scale)

model outperforms GoogLeNet more than 10% (Top1) and 19% (Top1) on Caltech256 and Caltech101, respectively, and out-

performs VGG16 about 13% (Top1) and 14% (Top1) on Caltech256 and Caltech101, respectively. As compared with Ex-CNN,


Table 8

Performance comparison of different models on CIFAR-100 and CIFAR-10.

Deep model CIFAR-100 (Err%(Top1)) CIFAR-10 (Err%(Top1))

Stochastic pooling [17] 42 .51 15 .13

Conv. Maxout [18] 38 .57 9 .38

Deeply supervised [25] 34 .57 7 .97

DropConnect [19] −− 9 .32

NIN + Dropout [10] 35 .68 8 .81

All-CNN [11] 33 .71 7 .25

Ex-CNN [24] −− 15 .7

PC–DCNN (32 × 32) 31 .91 7 .73

PC–DCNN (48 × 48) 28 .81 7 .41

PC–DCNN (64 × 64) 28 .87 7 .49

PC–DCNN (80 × 80) 29 .90 7 .23

MS-PC-DCNN (4-Scale) 25.90 6 .07

Table 9

Complexity of different models. Columns 1–2 list the results of Model(a) and Columns

3–4 list the results of Model(b).

Deep model T _ Complexity Deep Model T _ Complexity

Alex-Net [6] 0 .88 Stochastic pooling [17] 0 .14

ZF-Net [9] 0 .94 Conv. Maxout [18] 28 .23

GoogLeNet [7] 1 .32 Deeply supervised [25] 0 .14

VGG16 [8] 20 .1 NIN [10] 1 .13

Ex-CNN [24] 76 .43 All-CNN [11] 1 .29

PC-DCNN 1 .0 Ex-CNN [24] 10 .02

MS-PC-DCNN (4-Scale) 10 .51 PC-DCNN 1 .0

MS-PC-DCNN (4-Scale) 13 .43

the performance of the proposed model also has obvious improvement on Caltech256. However, on Caltech101, there is a

great gap between our model and Ex-CNN.

We also compare the performances achieved by the proposed Model(b) with several state-of-the-art models on the tiny

image datasets as shown in Table 8 , where it can be seen that the proposed MS-PC-DCNN (4-Scale) model obtains the state-

of-the-art results on the CIFAR-100 dataset such that the test error reduces to 25.90%. Even we just use a single 1-Scale

model, the performance outperforms NIN [10] and All-CNN [11] . On CIFAR-10, we also get the best result with the test error

being 6.07% by the 4-Scale model.

We further make a comparison of different multi-scale models which include different numbers of single PC–DCNN

models. We fuse different PC–DCNN models with different scales according to the rule in Tables 4 and 5 on Model(a) and

Model(b), respectively. The results are shown in Fig. 3 , where it is obvious that the performances are improved by increasing

the number of PC–DCNN modules. Especially on Caltech256 and Caltech101, the 4-Scale model achieves the accuracy of

54.4% and 72.6%, respectively, outperforming the original PC–DCNN model more than 7% and 6%, respectively. On the CIFAR-

100 dataset, the 4-Scale model obtains the test error which outperforms the original PC–DCNN with 6.01%. Even on the

CIFAR-10 dataset, the 4-Scale model reduces the test error about 1.7% as compared to the original PC–DCNN model.

4.4. Model complexity

The model complexity of a DCNN model is related to the size of convolutional kernels, feature maps and stride. The

smaller the convolutional kernel is, the lower the model complexity is. If we use a smaller stride, the complexity will

increase greatly. The following formula is employed to evaluate the model complexity [12] .

T _ Complexity = O

( d ∑

l=1

n l−1 · K

2 l · n l ·

(⌊

M l−1 − K l + 2 · pad l s l

⌋

+ 1

)2 ), (24)

where d is the number of convolutional layers, K l and n l are the size of kernel and the number of kernels (or outputs) in

the l th convolutional layer, M l is the size of output feature maps generated by the l th convolutional layer, and if l = 1 , n 0and M 0 are the number of channels and the size of input images. The parameter of pad l is the border added on the feature

map for avoiding the convolutional kernel exceeding the region of feature maps and the values are usually set to 0.

The comparison of complexity in different models is shown in Table 9 , where it can be observed that the GoogLeNet

model has 1.32 times of complexity as compared with the single Model(a) for large images, and the complexity of the

4-Scale Model(a) has 10 times of complexity as compared to the single PC–DCNN model. However, even for the 4-Scale

model, the complexity is 1/7 times less than that of Ex-CNN. And for tiny images, our single model has lower complexity

than NIN, All-CNN and Ex-CNN. But the 4-Scale Model(b) is about 13 times of model complexity as compared with the

single PC–DCNN model.


(a) Caltech256 (b) Caltech101

(c) CIFAR-100 (d) CIFAR-10

Fig. 3. Performance comparison of the proposed MS-PC-DCNN model with different multiple scales.

5. Conclusion

The deep learning technology is on the way of development rapidly, especially DCNN. Nowadays, lots of novel methods

and architectures are constantly emerging, and the state-of-the-art results on public datasets are usually surpassed in the

field of computer vision. In the DCNN model, the end-to-end method is employed for feature learning, avoiding to design

complicated algorithms for handcrafted features and special models for an application. It extracts more abstract, richer

semantic features through more than one time of nonlinear transformations. However, the current DCNN models become

deeper and deeper, and the architectures are more and more sophisticated. From the view of HVS, the PC–DCNN model

and the MS-PC-DCNN model are proposed based on the super convolutional kernel and the sub convolutional kernel in this

work. In a sense, the proposed approach doesn’t increase the depth of the model, instead it increases the width of the model

according to using more than one transformation stream for an image, leading to more abstract and robust features. The

experimental results demonstrate that the proposed PC-DCNN model obtains better performances for image classification

than a number of state-of-the-art models. Meanwhile, we further improve the performance with the MS-PC-DCNN model.

However, the proposed model complexity is higher compared to most of the other deep models under the same conditions.

In the future work, more effort s will be put in reducing the complexity and speeding up convergence of the model.

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grants 61622115 and 61472281 ,

and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No.

GZ2015005).

References

[1] Felzenszwalb P , Girshick R , Mcallester D , Ramanan D . Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal MachIntell 2010;32(9):1627–45 .

http://dx.doi.org/10.13039/501100001809

http://refhub.elsevier.com/S0045-7906(17)30085-X/sbref0001






[2] Lowe DG . Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004;69(2):91–110 . [3] Waheed Z , Akram MU , Waheed A , Khan MA , Shaukat A , Ishaq M . Person identification using vascular and non-vascular retinal features. Comput Electr

Eng 2016;53:359–71 . [4] Lazebnik S , Schmid C , Ponce J . Beyond bags of features: Spatial pyramid matching for recognizing natural scene categorie. In: conference on computer

vision and pattern recognition. IEEE; 2006. p. 2169–78 . [5] Zhang C , Liu J , Liang C , Xue Z , Pang J . Image classification by non-negative sparse coding, correlation constrained low-rank and sparse decomposition.

Comput Vis Image Und 2014;123(7):14–22 .

[6] Krizhevsky A , Sutskever I , Hinton GE . Imagenet classification with deep convolutional neural networks. In: advances in neural information processingsystems. NIPS; 2012. p. 1106–14 .

[7] Szegedy C , Liu W , Jia Y , Sermanet P , Reed S , Anguelov D , et al. Going deeper with convolutions. In: conference on computer vision and patternrecognition. IEEE; 2015. p. 1–9 .

[8] Simonyan K , Zisserman A . Very deep convolutional networks for large-scale image recognition. international conference on learning representation;2015 .

[9] Zeiler MD , Fergus R . Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33 . [10] Lin M , Chen Q , Yan S . Network in network. international conference on learning representation; 2015 .

[11] Springenberg JT , Dosovitskiy A , Brox T . The all convolutional net. international conference on learning representation; 2015 .

[12] He K , Sun J . Convolutional neural networks at constrained time cost. In: conference on computer vision and pattern recognition. IEEE; 2015.p. 5353–60 .

[13] Lecun Y , Bottou L , Bengio Y , Haffner P . Gradient-based learning applied to document recognition. Proc IEEE 1998;86(11):2278–324 . [14] Hinton GE , Salakhutdinov RR . Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504–7 .

[15] Russakovsky O , Deng J , Huang Z , Berg A , Li F-F . Detecting avocados to zucchinis: what have we done, and where are we going?. IEEE internationalconference on computer vision. IEEE; 2013 . 2604–2071.

[16] Krizhevsky A . Learning multiple layers of features from tiny images. University of Toronto; 2009 .

[17] Zeiler MD , Fergus R . Stochastic pooling for regularization of deep convolutional neural networks. international conference on learning representation;2013 .

[18] Goodfellow IJ , Warde-Farley D , Mirza M , Courville A , Bengio Y . Maxout networks. In: international conference on machine learning; 2013. p. 1319–27 .[19] Li W , Zeiler MD , Zhang S , LeCun Y , Fergus R . Regularization of neural networks using dropconnect. In: international conference on machine learning;

2013. p. 1058–66 . [20] He K , Zhang X , Ren S , Sun J . Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: IEEE international

conference on computer vision. IEEE; 2015. p. 1026–34 .

[21] Li F-F , Fergus R , Perona P . Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 objectcategories. Comput Vis Image Und 2010;106(1):59–70 .

[22] Griffin G , Holub A , Perona P . Caltech-256 object category dataset. Tech Report. California Institute of Technology; 2007 . [23] Jia Y , Shelhamer E , Donahue J , Karayev S , Long J , Girshick R , et al. Caffe: Convolutional architecture for fast feature embedding. In: ACM international

conference on multimedia. ACM; 2014. p. 675–8 . [24] Dosovitskiy A , Fischer P , Springenberg J , Riedmiller M , Brox T . Discriminative unsupervised feature learning with exemplar convolutional neural net-

works. In: Advances in neural information processing systems. NIPS; 2014. p. 766–74 .

[25] Lee CY , Xie S , Gallagher P , Zhang Z , Tu Z . Deeply-supervised nets. In: international conference on artificial intelligence and statistics; 2015. p. 562–70 .



















































































































Pengjie Tang received the M.S. degree in Computer Software and Theory from Nanchang University, China, in 2009. He is currently a Ph.D. candidate atthe Department of Computer Science and Technology, Tongji University, Shanghai, China. His current research interests include computer vision and deep

learning.

Hanli Wang received the M.E. degree in Electrical Engineering from Zhejiang University, Hangzhou, China, in 2004, and the Ph.D. degree in ComputerScience from City University of Hong Kong, Kowloon, Hong Kong, in 2007. He is a professor at the Department of Computer Science and Technology, Tongji

University, Shanghai, China. His current research interests include digital video coding, computer vision, and machine learning.

本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html


computers and electrical engineeringdownload.xuebalib.com/xuebalib.com.43406.pdf · coding along...

Documents