disclaimer - seoul national...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

Abstract

Yeongjun Han

Department of Statistics

The Graduate School

Seoul National University

In this paper, Deep Convolutional Neural Network (CNN) with various struc-

tures and loss functions are verified. Our data is VGGFace2 which is widely

spread datasets. Using CNN models like VGG and ResNet, with cross entropy,

cosface loss, and arcface loss, verify performance each models at first. Later,

we use stacking method for ensemble. Also, unlike any other Face image clas-

sification problem, we used face detection to improve performance. So, First,

do face detection so that we can focus only on each person’s identity. Second,

using face image, construct convolution layers. And then the last, gather all

convolution neural network results and summarize them. Focusing on model

structures and tunning procedures, we verify that the performance of model

with ensemble get better than without ensemble.

Keywords: Convolutional Neural Network, VGGFace2, Loss functions, Stack-

ing

Student Number: 2017-21465

i

Contents

1 Introduction 1

2 Model Structures 4

2.1 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Structured models . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Data 10

3.1 Data collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Simulation and Evaluation 13

4.1 Training methods . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Discussion 18

ii

References 21

국문초록 24

iii

List of Figures

3.1 Extracted face on VGGFace2 dataset on the same identity . . . 12

4.1 Training loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1 Non-normalized plot . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Normalized plot on training . . . . . . . . . . . . . . . . . . . . 20

iv

List of Tables

4.1 Tuning hyperparameters . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Tuning result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Stacking result . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

v

Chapter 1

Introduction

Over the past several years, Convolutional Neural Networks (CNNs) have sig-

nificantly boosted the state-of-the-art performance in many visual classifica-

tion tasks such as object recognition. Face classification is important for many

industries such as military, finance, and public security. In the area of face

classification problem, there were many improvement of model structures like

GoogleNet (Szegedy et al. (2015)), VGG (Simonyan and Zisserman (2014)),

and ResNet (He et al. (2015)) and recently efficient net. And also, the evo-

lution of discriminative loss functions are improved recent years. Traditional

models have too many trainable parameters than the number of data. The con-

ventional practice for model scaling is to arbitrarily increase the CNN depth

or width, or to use larger input image resolution for training and evaluation.

While these methods do improve accuracy, they usually require tedious man-

ual tuning, and still often yield suboptimal result. In efficient net (Tan and

1

Le (2019)) suggested principled layers to scale up a CNN to obtain better ac-

curacy and efficiency.

Traditional learning methods minimizes ||H(x) − y||, where H(·) is a function

that H : x → y. In image network, x and H(x) have to be the similar meaning

(H(·) have to be trained in that way). In other words, it is similar point of

view if we minimize F (x) := ||H(x) − x||. So we define F (x) as residual and

this is the word ResNet came from. Also, ResNet used bottleneck techniques

to reduce the numper of parameters.

In the aspect of loss functions, traditional models adopted softmax loss for fea-

ture learning. But after that, more researchers began to modifying softmaxloss

into discriminative loss functions, which means that adds penality term for

each classes’s distance to enlarge their margin. This was the hot topic of 2017,

angular/cosine margin based loss methods became popular. The main advan-

tages of make loss function differently is the number of networks. FaceNet

(Schroff et al. (2015)) used 25 network but Arcface (Deng et al. (2018)) used

only single networks, even get better performance (about 99.83% accuracy in

MS-Celeb-1M dataset). In Arcface, they added an additive angular margin

penalty m between xi and Wyito simultaneously enhance the intra-calss com-

pactness and inter-class distances.

We use VGGFace2 dataset have over 8000 identities and per-subject samples

over 300. In VGGFace2 data may not be suit for our real life, as there can

exists many identities but small samples for each identities. In that case, it can

2

use models with augmentation techniques to get enough samples to overcome

n ≪ p problem.

We use simple CNN as an baseline network, VGG16, ResNet18, and ResNet26

which have the small number of parameters compared to other complilcated

CNNs. And as ensemble methods, majority voting and weighted mean, cross

validation based stackking method.

3

Chapter 2

Model Structures

2.1 Baseline models

He et al. (2016) suggested that batch normalization and ReLU’s position can

affect the model’s performances. They did experiments varing activation’s

position. They conclude that batch normalization and ReLU layer before the

weight layer, called full pre-activation, performed well. Zhou et al. (2015)

suggested that Global Average Pooling (GAP) acts like a regularizer and hence

can avoid overffiting problem. So, we used full pre-activation in ResNet model,

using average pooling layer instead of weight layer.

Let B(·) denotes batch normalization, and let σ(·) denotes leaky ReLU. Then,

σ(x, k) = max(x, kx), (2.1)

4

where k is scale parameter. Equation 2.1 can avoid banishing gradient in

compared to original ReLU (Xu et al. (2015)). So, baseline model stack four

blocks is described as follows:

fi(x) = σ(B(C(x))), i = 1, . . . , 4, (2.2)

where C(·) is a convolution layer with stride 2.

5

2.2 Structured models

We used structured models VGG16, ResNet18, and ResNet26. VGG16 have

16-layers, which is 13 convolution layer and 3 fully connected layers with stride

2 and max-poolings. The conventional practice for model scaling is to arbi-

trarily increase the CNN depth or width, it is not simple for build models. So

based on the number of our dataset and computational costs, we used simple

layer network which are relatively small number of parameters than others.

In Vgg16 network, learning methods are minimizing ||H(x) − y||, where H(·)

is a function that H : x → y.

ResNet interpreted the minimizing procedure x and H(x) have to be the simi-

lar meaning (H(·) have to be trained in that way). In other words, it is similar

point of view if we minimize residuals that F (x) := ||H(x) − x||. He et al.

(2015) called H(·) as skip-connection.

H(x) = F (x) + x, (2.3)

where F (·) is a non-linear function which contains convolution funcion.

Note that xl+1 = h(xl)+f(F (xl)), where xl is l-th layer output, and h is a iden-

tity function. We used full pre-activation on the function f(·) = σ(B(C(·))).

Proposed networks are as follows:

xl+1 = xl + R(xl),

xl = x0 +l−1∑i=0

R(xi),(2.4)

where R is residual block.

6

2.3 Loss functions

Cross-entropy loss function in Equation 2.5 is widely used for optimizing model

parameters. softmaxloss is presented as follows:

Lsoftmax = − 1N

N∑i=1

log eW Tyi

xi+byi∑nj=1 eW T

j xi+bj, (2.5)

where xi ∈ Rd denotes the i-th sample in yi-th class, Wj ∈ Rd denotes the

j-th row of the weight W ∈ Rn×d where d is input data’s dimension, bj ∈ Rn

is bias term, and N, n are batch size and class number, respectively. Equa-

tion 2.5 is separable for classification problem but not fully discriminative.

Researchers began to modifying softmax loss into discriminative loss func-

tions, which means that adds penality term for each classes’s distance. Since

loss function is directly determines the learning algorithm, it can be performed

better if it is modified.

Liu et al. (2017) introduced sphereface angular margin, but series of approx-

imation make training unstable. Cosface (Wang et al. (2018)) added margin

penalty to target class, which is obtained better performance. Sphereface is

presented as follows:

Lsphere = − 1N

N∑i=1

log es cos θyi

es cos θyi + ∑nj=1,j ̸=yi

es cos θj, (2.6)

where s is scale parameter, which is the radius of hypersphere.

Cosface features normalization technique to clarify its effectiveness. More

specifically, Using L2 normalization both features and weight vectors, they

7

removed scale variations. And, based on cosine margin term, maximize the

decision margin in the angular distances. As a result, it is minimizing intra-

class variance and maximizing inter-class variances. Cosface is presented as

follows:

Lcosface = − 1N

N∑i=1

log es(cos(θyi +m))

es(cos(θyi +m)) + ∑nj=1,j ̸=yi

es cos θj, (2.7)

where m is margin parameter.

Deng et al. (2018) suggested arcface modifying the loss functions to enhance

the intra-class compactness and inter-class discrepancy. Arcface is presented

as follows:

Larcface = − 1N

N∑i=1

log es(cos(m1θyi +m2)−m3)

es(cos(m1θyi +m2)−m3) + ∑nj=1,j ̸=yi

es cos θj, (2.8)

where m1, m2, m3 are margin parameters.

8

2.4 Stacking

Stacking is a method using a high-level model to combine lowerlevel models

to achieve greater predictive accuracy. We used stacking with three differ-

ent methods, which is un-weighted mean, stacked generalization. Moreover,

Young et al. (2018) suggested the super learner, which is a cross-validation

based stacking method. It should be noted that super learner weights on each

classes but ours do not.

Unweighted averaging is the most common ensemble approach for neural net-

works. It takes unweighted average of the output probability for all the models.

pl = arg maxj

N∑i=1

pi,j, for j = 1, . . . , K, (2.9)

where K is the number of class and pi,j is the j-th class softmax probability.

In stacked generalization, generally use logistic regression where covariate is

output probability or predicted classes. Ting and Witten (1997) suggested

that using output class probability as a covariate get better performance in

stacked generalization. So in our network, produces output class probability

to use level-1 generalizer.

9

Chapter 3

Data

3.1 Data collecting

VGGFace21 dataset contains over 8000 identities which is populated with a

wide range of different ethnicities, accents, professions and ages. And over 3.3

million faces, where over 300 per-subject samples. Using face detection, we

get images from 150 to 400 for 300 classes. In face detection, we used Haar-

cascades on opencv modules. In some case that we have to get whole data,

model can depend on the performance of face detection. But we found that

other face detection methods like lbp or improved lbp does not have difference

in capturing identity’s face. And also, we proposed that face captured version

of our model performs well than original image dataset. About 40GB images

are too much on computational costs, and time spending.

1http://www.robots.ox.ac.uk/~vgg/data/vgg_face2/

10

3.2 Data preprocessing

We used opencv to capture the identity’s face. Haar-cascade (Viola and Jones

(2001)) and lbp do not affect on capturing face. Capturing identity’s face is

crucial because each picture can have different ourdoor circumstansce, hair

styles, and so on. Also, as person growing up, it is the main problem on

his/her different style. Then later we have to re-training our previous model

on aged dataset. If then, training term could be the other issues. VGGFace2

dataset also have wide range of ages, style is some issues for capturing identity.

But using face only image set, we can neglecting the style issues, so we can

only focus on each person’s unique identity. Also, for convenience, we resized

each pictures (128, 128, 3).

After then, we splitted dataset into train/validation/test set. Validation

and test set have about 10% of total set for each, and remaining 80% dataset

is used on training. Surely, we stratified for each identities set. Since we have

many model structures and loss functions with many hyperparameters, it takes

tremendously heavy computational costs to perform K-fold cross-validation.

Suppose that we have three model structures with the three loss functions

with hyperparameters, vast learning time is consumed. Hence In many deep

learning framework, K-fold cross-validation is hard to be performed.

11

Figure 3.1: Extracted face on VGGFace2 dataset on the same identity

12

Chapter 4

Simulation and Evaluation

4.1 Training methods

In Training procedures, we performed as following. At first, we fix hyperparam-

eters and estimate all model parameters (trainable parameters) that minimize

loss functions with L2 penalty term. And then, we choose hyperparameters

with accuracy measure.

E(I(y = f̂(x, Θ, Λ))), (4.1)

where x is feature data, and y is class number. (x,y) is a pair of validation

set, and Θ, Λ are model parameters and hyperparameters, respectively. The

model which has the highest accuracy denoted in Equation 4.1 is selected. And

evaluate the performance on test data.

For efficient tuning method, we used random search (Bergstra and Bengio

13

(2012)). Grid search can be powerful in aspect of model performance, but

it could take a lot of time spent. In random search case, proved in many

papers about deep learning, can save time and relatively same performance

with respect to grid search algorithm. Hyperparameters are regularization

parameter (λ), learning rate (α), and loss functions (l).

Let V (x, λ, α, l) is validation accuracy. We tunned these procedure on each

model structures, and loss functions. Algorithm 1 is used for hyperparameters

tuning procedure.

Algorithm 1 Hyperparameters tuningInitialize λ(0), α(0), l(0) and t = 0

while t < 5 doλ(t+1) = arg maxλ V (x, λ, α(t), l(t))

α(t+1) = arg maxα V (x, λ(t+1), α, l(t))

l(t+1) = arg maxl V (x, λ(t+1), α(t+1), l)

t = t + 1return λ(5), α(5), l(5)

14

4.2 Results

We note that in our case, cosface loss and VGG16 network does not converge

for all tuning parameters. Even though they converged, performance were not

good. And, our baseline models sometimes converge but not good performance.

And ResnNet26 have about 1.5 times more parameters than ResNet18, it fails

to capturing identities. Here is some good results of tuning procedure.

Figure 4.1: Training loss

In Figure 4.1, ResNet18 and softmax are selected during tuning procedures.

Model1, model2, and model3 are all ResNet18 and softmax loss function. The

15

difference in three models is hyperpameters clarified in Table 4.1. For each

models, corresponding results are presented in Table 4.2.

Table 4.1: Tuning hyperparameters

models α 1 − r λ epoch

model1 0.01 1 0 200

model2 0.001 1 0.00002 200

model3 0.001 1 0.0002 200

Table 4.2: Tuning result

models train loss train accuracy valid accuracy test accuracy

model1 0.16009 0.97033 0.75412 0.94371

model2 0.04374 0.98728 0.87010 0.97098

model3 0.04438 0.99661 0.90033 0.97672

Since the number of each class sizes are different, validation set and test

set are stratified on each classes into 10% of total data set.

Note that arcface loss outperformed than the other loss functions. Model per-

formance achieved 1. Deng et al. (2018) achieved model performance at 99.85%

on MS-Celeb-1M dataset and ResNet100 with single network. In Wang and

Deng (2018), they summarized face classification papers performances. Our

preprocessed VggFace2 dataset are smaller than the Celeb data. It is not

surprising that our model achieved 99.99% test accuracy with single network

16

compared to Deng et al. (2018) results. So arcface can achieve 100% perfor-

mance.

Also, capturing face image can induce overfitting problem. It can be the same

data in train, validation, and test set after capturing face image. This means

that perfectly overfitted model acts like a perfect model from the perspective

of the performance.

And Here is our stacking result.

Table 4.3: Stacking result

methods performance

model (train) 0.98474

model (test) 0.96380

un-weighted mean 0.88300

stacked generalization (train) 0.98593

stacked generalization (test) 0.96448

We used 3 different models in stacking. The average of three model’s accu-

racy is model (train) and model (test). Note that our stacked generalization

with 5-fold cross-validation result gets finely better.

17

Chapter 5

Discussion

We showed 4 models (baseline, VGG16, ResNet18, ResNet26) with 3 losses

(softmax, cosface, arcface). But we found that VGG16 and ResNet26 does

not converge well, and so does cosface. And also, we wanted the other ensem-

ble methods like bagging and boosting. For using these methods, we had to

identify the trained feature’s performance how our trained feature (last layer’s

feature) get clustered. Using t-SNE (van der Maaten and Hinton (2008)), our

trained data were not clustered as Figure 5.1. But during training step, if we

did image normalization, then perfectly clusters are made as Figure 5.2. But

after training ends, we could not found that clusters. So we conclude that we

cannot identify the trained feature’s performance. So stacking methods were

used.

18

Figure 5.1: Non-normalized plot

19

Figure 5.2: Normalized plot on training

20

References

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter opti-

mization. The Journal of Machine Learning Research, 13:281–305.

Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2018). Arcface: Additive angular

margin loss for deep face recognition.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for

image recognition.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep

residual networks.

Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. (2017). Sphereface:

Deep hypersphere embedding for face recognition.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified

embedding for face recognition and clustering. arXiv:1503.03832.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for

large-scale image recognition. arXiv 1409.1556.

21

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan,

D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolu-

tions. In The IEEE Conference on Computer Vision and Pattern Recognition

(CVPR).

Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking model scaling for

convolutional neural networks.

Ting, K. M. and Witten, I. (1997). Stacked generalization: when does it work?

van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE.

Journal of Machine Learning Research, 9:2579–2605.

Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cas-

cade of simple features. In Proceedings of the 2001 IEEE Computer Society

Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol-

ume 1, pages I–I.

Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., and Liu, W.

(2018). Cosface: Large margin cosine loss for deep face recognition.

Wang, M. and Deng, W. (2018). Deep face recognition: A survey.

Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of

rectified activations in convolutional network.

Young, S., Abdou, T., and Bener, A. (2018). Deep super learner: A deep

ensemble for classification problems.

22

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Learn-

ing deep features for discriminative localization.

23

국문초록

본 논문에서는 다양한 구조와 손실 함수를 갖는 합성곱 신경망을 검증한다.

VGGFace2 데이터를 이용하여 교차 엔트로피, Cosine 손실 및 Arcface 손실과

함께 VGG 및 ResNet과 같은 합성곱 신경망 모델을 사용하여 각 모델의 성능을

검증한다. 또한 앙상블을 위해 스태킹을 사용한다. 알고리즘은 먼저 얼굴 인

식을 수행하여 각 사람의 신원에만 집중할 수있게 하여 기존 얼굴 이미지 분류

문제와 달리 얼굴 검색을 사용하여 성능을 향상 시켰다. 그리고 얼굴 이미지를

사용하여 합성곱 신경망을 구성 한다. 마지막으로 모든 신경망 결과를 수집하

고 앙상블하여 모델을 검증한다. 본 논문은 모델 구조와 무작위 검색을 통한

튜닝 절차에 중점을 두었으며, 앙상블을 통하여 기존 모델의 성능이 향상됨을

확인하였다.

주요어: 합성곱 신경망, VGGFace2, 손실함수, 스태킹

학 번: 2017-21465

24

disclaimer - seoul national...

Documents