disclaimer - seoul national...
Post on 04-Jul-2020
0 Views
Preview:
TRANSCRIPT
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
Abstract
Yeongjun Han
Department of Statistics
The Graduate School
Seoul National University
In this paper, Deep Convolutional Neural Network (CNN) with various struc-
tures and loss functions are verified. Our data is VGGFace2 which is widely
spread datasets. Using CNN models like VGG and ResNet, with cross entropy,
cosface loss, and arcface loss, verify performance each models at first. Later,
we use stacking method for ensemble. Also, unlike any other Face image clas-
sification problem, we used face detection to improve performance. So, First,
do face detection so that we can focus only on each person’s identity. Second,
using face image, construct convolution layers. And then the last, gather all
convolution neural network results and summarize them. Focusing on model
structures and tunning procedures, we verify that the performance of model
with ensemble get better than without ensemble.
Keywords: Convolutional Neural Network, VGGFace2, Loss functions, Stack-
ing
Student Number: 2017-21465
i
Contents
1 Introduction 1
2 Model Structures 4
2.1 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Structured models . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Data 10
3.1 Data collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Simulation and Evaluation 13
4.1 Training methods . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Discussion 18
ii
References 21
국문초록 24
iii
List of Figures
3.1 Extracted face on VGGFace2 dataset on the same identity . . . 12
4.1 Training loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Non-normalized plot . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Normalized plot on training . . . . . . . . . . . . . . . . . . . . 20
iv
List of Tables
4.1 Tuning hyperparameters . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Tuning result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Stacking result . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
Chapter 1
Introduction
Over the past several years, Convolutional Neural Networks (CNNs) have sig-
nificantly boosted the state-of-the-art performance in many visual classifica-
tion tasks such as object recognition. Face classification is important for many
industries such as military, finance, and public security. In the area of face
classification problem, there were many improvement of model structures like
GoogleNet (Szegedy et al. (2015)), VGG (Simonyan and Zisserman (2014)),
and ResNet (He et al. (2015)) and recently efficient net. And also, the evo-
lution of discriminative loss functions are improved recent years. Traditional
models have too many trainable parameters than the number of data. The con-
ventional practice for model scaling is to arbitrarily increase the CNN depth
or width, or to use larger input image resolution for training and evaluation.
While these methods do improve accuracy, they usually require tedious man-
ual tuning, and still often yield suboptimal result. In efficient net (Tan and
1
Le (2019)) suggested principled layers to scale up a CNN to obtain better ac-
curacy and efficiency.
Traditional learning methods minimizes ||H(x) − y||, where H(·) is a function
that H : x → y. In image network, x and H(x) have to be the similar meaning
(H(·) have to be trained in that way). In other words, it is similar point of
view if we minimize F (x) := ||H(x) − x||. So we define F (x) as residual and
this is the word ResNet came from. Also, ResNet used bottleneck techniques
to reduce the numper of parameters.
In the aspect of loss functions, traditional models adopted softmax loss for fea-
ture learning. But after that, more researchers began to modifying softmaxloss
into discriminative loss functions, which means that adds penality term for
each classes’s distance to enlarge their margin. This was the hot topic of 2017,
angular/cosine margin based loss methods became popular. The main advan-
tages of make loss function differently is the number of networks. FaceNet
(Schroff et al. (2015)) used 25 network but Arcface (Deng et al. (2018)) used
only single networks, even get better performance (about 99.83% accuracy in
MS-Celeb-1M dataset). In Arcface, they added an additive angular margin
penalty m between xi and Wyito simultaneously enhance the intra-calss com-
pactness and inter-class distances.
We use VGGFace2 dataset have over 8000 identities and per-subject samples
over 300. In VGGFace2 data may not be suit for our real life, as there can
exists many identities but small samples for each identities. In that case, it can
2
use models with augmentation techniques to get enough samples to overcome
n ≪ p problem.
We use simple CNN as an baseline network, VGG16, ResNet18, and ResNet26
which have the small number of parameters compared to other complilcated
CNNs. And as ensemble methods, majority voting and weighted mean, cross
validation based stackking method.
3
Chapter 2
Model Structures
2.1 Baseline models
He et al. (2016) suggested that batch normalization and ReLU’s position can
affect the model’s performances. They did experiments varing activation’s
position. They conclude that batch normalization and ReLU layer before the
weight layer, called full pre-activation, performed well. Zhou et al. (2015)
suggested that Global Average Pooling (GAP) acts like a regularizer and hence
can avoid overffiting problem. So, we used full pre-activation in ResNet model,
using average pooling layer instead of weight layer.
Let B(·) denotes batch normalization, and let σ(·) denotes leaky ReLU. Then,
σ(x, k) = max(x, kx), (2.1)
4
where k is scale parameter. Equation 2.1 can avoid banishing gradient in
compared to original ReLU (Xu et al. (2015)). So, baseline model stack four
blocks is described as follows:
fi(x) = σ(B(C(x))), i = 1, . . . , 4, (2.2)
where C(·) is a convolution layer with stride 2.
5
2.2 Structured models
We used structured models VGG16, ResNet18, and ResNet26. VGG16 have
16-layers, which is 13 convolution layer and 3 fully connected layers with stride
2 and max-poolings. The conventional practice for model scaling is to arbi-
trarily increase the CNN depth or width, it is not simple for build models. So
based on the number of our dataset and computational costs, we used simple
layer network which are relatively small number of parameters than others.
In Vgg16 network, learning methods are minimizing ||H(x) − y||, where H(·)
is a function that H : x → y.
ResNet interpreted the minimizing procedure x and H(x) have to be the simi-
lar meaning (H(·) have to be trained in that way). In other words, it is similar
point of view if we minimize residuals that F (x) := ||H(x) − x||. He et al.
(2015) called H(·) as skip-connection.
H(x) = F (x) + x, (2.3)
where F (·) is a non-linear function which contains convolution funcion.
Note that xl+1 = h(xl)+f(F (xl)), where xl is l-th layer output, and h is a iden-
tity function. We used full pre-activation on the function f(·) = σ(B(C(·))).
Proposed networks are as follows:
xl+1 = xl + R(xl),
xl = x0 +l−1∑i=0
R(xi),(2.4)
where R is residual block.
6
2.3 Loss functions
Cross-entropy loss function in Equation 2.5 is widely used for optimizing model
parameters. softmaxloss is presented as follows:
Lsoftmax = − 1N
N∑i=1
log eW Tyi
xi+byi∑nj=1 eW T
j xi+bj, (2.5)
where xi ∈ Rd denotes the i-th sample in yi-th class, Wj ∈ Rd denotes the
j-th row of the weight W ∈ Rn×d where d is input data’s dimension, bj ∈ Rn
is bias term, and N, n are batch size and class number, respectively. Equa-
tion 2.5 is separable for classification problem but not fully discriminative.
Researchers began to modifying softmax loss into discriminative loss func-
tions, which means that adds penality term for each classes’s distance. Since
loss function is directly determines the learning algorithm, it can be performed
better if it is modified.
Liu et al. (2017) introduced sphereface angular margin, but series of approx-
imation make training unstable. Cosface (Wang et al. (2018)) added margin
penalty to target class, which is obtained better performance. Sphereface is
presented as follows:
Lsphere = − 1N
N∑i=1
log es cos θyi
es cos θyi + ∑nj=1,j ̸=yi
es cos θj, (2.6)
where s is scale parameter, which is the radius of hypersphere.
Cosface features normalization technique to clarify its effectiveness. More
specifically, Using L2 normalization both features and weight vectors, they
7
removed scale variations. And, based on cosine margin term, maximize the
decision margin in the angular distances. As a result, it is minimizing intra-
class variance and maximizing inter-class variances. Cosface is presented as
follows:
Lcosface = − 1N
N∑i=1
log es(cos(θyi +m))
es(cos(θyi +m)) + ∑nj=1,j ̸=yi
es cos θj, (2.7)
where m is margin parameter.
Deng et al. (2018) suggested arcface modifying the loss functions to enhance
the intra-class compactness and inter-class discrepancy. Arcface is presented
as follows:
Larcface = − 1N
N∑i=1
log es(cos(m1θyi +m2)−m3)
es(cos(m1θyi +m2)−m3) + ∑nj=1,j ̸=yi
es cos θj, (2.8)
where m1, m2, m3 are margin parameters.
8
2.4 Stacking
Stacking is a method using a high-level model to combine lowerlevel models
to achieve greater predictive accuracy. We used stacking with three differ-
ent methods, which is un-weighted mean, stacked generalization. Moreover,
Young et al. (2018) suggested the super learner, which is a cross-validation
based stacking method. It should be noted that super learner weights on each
classes but ours do not.
Unweighted averaging is the most common ensemble approach for neural net-
works. It takes unweighted average of the output probability for all the models.
pl = arg maxj
N∑i=1
pi,j, for j = 1, . . . , K, (2.9)
where K is the number of class and pi,j is the j-th class softmax probability.
In stacked generalization, generally use logistic regression where covariate is
output probability or predicted classes. Ting and Witten (1997) suggested
that using output class probability as a covariate get better performance in
stacked generalization. So in our network, produces output class probability
to use level-1 generalizer.
9
Chapter 3
Data
3.1 Data collecting
VGGFace21 dataset contains over 8000 identities which is populated with a
wide range of different ethnicities, accents, professions and ages. And over 3.3
million faces, where over 300 per-subject samples. Using face detection, we
get images from 150 to 400 for 300 classes. In face detection, we used Haar-
cascades on opencv modules. In some case that we have to get whole data,
model can depend on the performance of face detection. But we found that
other face detection methods like lbp or improved lbp does not have difference
in capturing identity’s face. And also, we proposed that face captured version
of our model performs well than original image dataset. About 40GB images
are too much on computational costs, and time spending.
1http://www.robots.ox.ac.uk/~vgg/data/vgg_face2/
10
3.2 Data preprocessing
We used opencv to capture the identity’s face. Haar-cascade (Viola and Jones
(2001)) and lbp do not affect on capturing face. Capturing identity’s face is
crucial because each picture can have different ourdoor circumstansce, hair
styles, and so on. Also, as person growing up, it is the main problem on
his/her different style. Then later we have to re-training our previous model
on aged dataset. If then, training term could be the other issues. VGGFace2
dataset also have wide range of ages, style is some issues for capturing identity.
But using face only image set, we can neglecting the style issues, so we can
only focus on each person’s unique identity. Also, for convenience, we resized
each pictures (128, 128, 3).
After then, we splitted dataset into train/validation/test set. Validation
and test set have about 10% of total set for each, and remaining 80% dataset
is used on training. Surely, we stratified for each identities set. Since we have
many model structures and loss functions with many hyperparameters, it takes
tremendously heavy computational costs to perform K-fold cross-validation.
Suppose that we have three model structures with the three loss functions
with hyperparameters, vast learning time is consumed. Hence In many deep
learning framework, K-fold cross-validation is hard to be performed.
11
Figure 3.1: Extracted face on VGGFace2 dataset on the same identity
12
Chapter 4
Simulation and Evaluation
4.1 Training methods
In Training procedures, we performed as following. At first, we fix hyperparam-
eters and estimate all model parameters (trainable parameters) that minimize
loss functions with L2 penalty term. And then, we choose hyperparameters
with accuracy measure.
E(I(y = f̂(x, Θ, Λ))), (4.1)
where x is feature data, and y is class number. (x,y) is a pair of validation
set, and Θ, Λ are model parameters and hyperparameters, respectively. The
model which has the highest accuracy denoted in Equation 4.1 is selected. And
evaluate the performance on test data.
For efficient tuning method, we used random search (Bergstra and Bengio
13
(2012)). Grid search can be powerful in aspect of model performance, but
it could take a lot of time spent. In random search case, proved in many
papers about deep learning, can save time and relatively same performance
with respect to grid search algorithm. Hyperparameters are regularization
parameter (λ), learning rate (α), and loss functions (l).
Let V (x, λ, α, l) is validation accuracy. We tunned these procedure on each
model structures, and loss functions. Algorithm 1 is used for hyperparameters
tuning procedure.
Algorithm 1 Hyperparameters tuningInitialize λ(0), α(0), l(0) and t = 0
while t < 5 doλ(t+1) = arg maxλ V (x, λ, α(t), l(t))
α(t+1) = arg maxα V (x, λ(t+1), α, l(t))
l(t+1) = arg maxl V (x, λ(t+1), α(t+1), l)
t = t + 1return λ(5), α(5), l(5)
14
4.2 Results
We note that in our case, cosface loss and VGG16 network does not converge
for all tuning parameters. Even though they converged, performance were not
good. And, our baseline models sometimes converge but not good performance.
And ResnNet26 have about 1.5 times more parameters than ResNet18, it fails
to capturing identities. Here is some good results of tuning procedure.
Figure 4.1: Training loss
In Figure 4.1, ResNet18 and softmax are selected during tuning procedures.
Model1, model2, and model3 are all ResNet18 and softmax loss function. The
15
difference in three models is hyperpameters clarified in Table 4.1. For each
models, corresponding results are presented in Table 4.2.
Table 4.1: Tuning hyperparameters
models α 1 − r λ epoch
model1 0.01 1 0 200
model2 0.001 1 0.00002 200
model3 0.001 1 0.0002 200
Table 4.2: Tuning result
models train loss train accuracy valid accuracy test accuracy
model1 0.16009 0.97033 0.75412 0.94371
model2 0.04374 0.98728 0.87010 0.97098
model3 0.04438 0.99661 0.90033 0.97672
Since the number of each class sizes are different, validation set and test
set are stratified on each classes into 10% of total data set.
Note that arcface loss outperformed than the other loss functions. Model per-
formance achieved 1. Deng et al. (2018) achieved model performance at 99.85%
on MS-Celeb-1M dataset and ResNet100 with single network. In Wang and
Deng (2018), they summarized face classification papers performances. Our
preprocessed VggFace2 dataset are smaller than the Celeb data. It is not
surprising that our model achieved 99.99% test accuracy with single network
16
compared to Deng et al. (2018) results. So arcface can achieve 100% perfor-
mance.
Also, capturing face image can induce overfitting problem. It can be the same
data in train, validation, and test set after capturing face image. This means
that perfectly overfitted model acts like a perfect model from the perspective
of the performance.
And Here is our stacking result.
Table 4.3: Stacking result
methods performance
model (train) 0.98474
model (test) 0.96380
un-weighted mean 0.88300
stacked generalization (train) 0.98593
stacked generalization (test) 0.96448
We used 3 different models in stacking. The average of three model’s accu-
racy is model (train) and model (test). Note that our stacked generalization
with 5-fold cross-validation result gets finely better.
17
Chapter 5
Discussion
We showed 4 models (baseline, VGG16, ResNet18, ResNet26) with 3 losses
(softmax, cosface, arcface). But we found that VGG16 and ResNet26 does
not converge well, and so does cosface. And also, we wanted the other ensem-
ble methods like bagging and boosting. For using these methods, we had to
identify the trained feature’s performance how our trained feature (last layer’s
feature) get clustered. Using t-SNE (van der Maaten and Hinton (2008)), our
trained data were not clustered as Figure 5.1. But during training step, if we
did image normalization, then perfectly clusters are made as Figure 5.2. But
after training ends, we could not found that clusters. So we conclude that we
cannot identify the trained feature’s performance. So stacking methods were
used.
18
Figure 5.1: Non-normalized plot
19
Figure 5.2: Normalized plot on training
20
References
Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter opti-
mization. The Journal of Machine Learning Research, 13:281–305.
Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2018). Arcface: Additive angular
margin loss for deep face recognition.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for
image recognition.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep
residual networks.
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. (2017). Sphereface:
Deep hypersphere embedding for face recognition.
Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified
embedding for face recognition and clustering. arXiv:1503.03832.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv 1409.1556.
21
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan,
D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolu-
tions. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking model scaling for
convolutional neural networks.
Ting, K. M. and Witten, I. (1997). Stacked generalization: when does it work?
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE.
Journal of Machine Learning Research, 9:2579–2605.
Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cas-
cade of simple features. In Proceedings of the 2001 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol-
ume 1, pages I–I.
Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., and Liu, W.
(2018). Cosface: Large margin cosine loss for deep face recognition.
Wang, M. and Deng, W. (2018). Deep face recognition: A survey.
Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of
rectified activations in convolutional network.
Young, S., Abdou, T., and Bener, A. (2018). Deep super learner: A deep
ensemble for classification problems.
22
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Learn-
ing deep features for discriminative localization.
23
국문초록
본 논문에서는 다양한 구조와 손실 함수를 갖는 합성곱 신경망을 검증한다.
VGGFace2 데이터를 이용하여 교차 엔트로피, Cosine 손실 및 Arcface 손실과
함께 VGG 및 ResNet과 같은 합성곱 신경망 모델을 사용하여 각 모델의 성능을
검증한다. 또한 앙상블을 위해 스태킹을 사용한다. 알고리즘은 먼저 얼굴 인
식을 수행하여 각 사람의 신원에만 집중할 수있게 하여 기존 얼굴 이미지 분류
문제와 달리 얼굴 검색을 사용하여 성능을 향상 시켰다. 그리고 얼굴 이미지를
사용하여 합성곱 신경망을 구성 한다. 마지막으로 모든 신경망 결과를 수집하
고 앙상블하여 모델을 검증한다. 본 논문은 모델 구조와 무작위 검색을 통한
튜닝 절차에 중점을 두었으며, 앙상블을 통하여 기존 모델의 성능이 향상됨을
확인하였다.
주요어: 합성곱 신경망, VGGFace2, 손실함수, 스태킹
학 번: 2017-21465
24
top related