sixth italian workshop on machine learning and data mining...

25
Sixth Italian Workshop on Machine Learning and Data Mining (MLDM) Kernel-based non-parametric activation functions for neural networks Authors: S. Scardapane, S. Van Vaerenbergh and A. Uncini

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Sixth Italian Workshop on Machine Learningand Data Mining (MLDM)

Kernel-based non-parametric activationfunctions for neural networks

Authors: S. Scardapane, S. Van Vaerenbergh and A. Uncini

Page 2: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Table of contents

1 Introduction

2 Non-parametric activation functions

3 Proposed kernel activation functions

4 Experimental results

5 Conclusions and future work

Page 3: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Basic NN architecture

The basic layer for a neural network alternates a linear projection witha pointwise nonlinearity:

hl = gl (Wlhl−1 + bl) , (1)

There is a huge literature on the linear component, e.g., initialization,compression, fast multiplication...

In most cases, the matrices {W,bl} are the only adaptable componentsof the network.

Page 4: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

What about the nonlinearity?

The choice of the nonlinearity is crucial:

• Having differentiable activations was the basic ingredient for back-propagation.

• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.

• Many papers recently on new activation functions, e.g., the Swishfunction [1]:

g(s) = s · sigmoid(s) . (2)

Can we learn the activation functions?

[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.

Page 5: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

What about the nonlinearity?

The choice of the nonlinearity is crucial:

• Having differentiable activations was the basic ingredient for back-propagation.

• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.

• Many papers recently on new activation functions, e.g., the Swishfunction [1]:

g(s) = s · sigmoid(s) . (2)

Can we learn the activation functions?

[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.

Page 6: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Parametric activation functions

Making a single activation function parametric is relatively simple, e.g.,we can add a learnable scale and bandwidth to a tanh:

g(s) =a (1− exp {−bs})

1 + exp {−bs}. (3)

Or learn the slope for the negative part of the ReLU (PReLU):

g(s) =

{s if s ≥ 0

αs otherwise. (4)

These parametric AFs have a small amount of trainable parameters,but their flexibility is severely limited.

Page 7: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Table of contents

1 Introduction

2 Non-parametric activation functions

3 Proposed kernel activation functions

4 Experimental results

5 Conclusions and future work

Page 8: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Adaptive piecewise linear units

An APL nonlinearity is the sum of S linear segments:

g(s) = max {0, s}+

S∑i=1

ai max {0,−s+ bi} . (5)

This is non-parametric because S is a user-defined hyper-parametercontrolling the flexibility of the unit.

The APL introduces S+1 points of non-differentiability for each neuron,which may damage the optimization algorithm. Also, in practice havingS > 3 seems to have less effect on the resulting shapes.

[1] Agostinelli, F., Hoffman, M., Sadowski, P. and Baldi, P., 2014. Lear-ning activation functions to improve deep neural networks. arXiv preprintarXiv:1412.6830.

Page 9: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Spline activation functions

A SAF uses cubic interpolation over a set of adaptable control points:

2 0 2Activation

0

2

4

6SA

F ou

tput

However, regularizing the control points is non-trivial, and SAFs cannotbe easily accelerated on GPU.

[1] Vecci, L., Piazza, F. and Uncini, A., 1998. Learning and approximationcapabilities of adaptive spline activation function neural networks. NeuralNetworks, 11(2), pp. 259-270.

Page 10: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Maxout neurons

A Maxout replaces an entire neuron by taking the maximum over Kseparate linear projections:

g(h) = maxi=1,...,K

{wT

i h + bi}. (6)

With two maxout neurons, a NN with one hidden layer remains anuniversal approximator provided K is sufficiently large.

However, it is impossible to plot the functions for K > 3, and thenumber of parameters can increase drastically with respect to K.

[1] Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A. and Bengio, Y.,2013. Maxout networks. Proc. 30th Int. Conf. on Machine Learning.

Page 11: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Visualization of a Maxout neuron

2 0 21D input

1

0

1

2

Max

out a

ctiv

atio

n

Page 12: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Table of contents

1 Introduction

2 Non-parametric activation functions

3 Proposed kernel activation functions

4 Experimental results

5 Conclusions and future work

Page 13: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Basic structure of the KAF

We model each activation function in terms of a kernel expansion overD terms as:

g(s) =

D∑i=1

αiκ (s, di) , (7)

where:

1 {αi}Di=1 are the mixing coefficients;

2 {di}Di=1 are the dictionary elements;

3 κ(·, ·) : R× R→ R is a 1D kernel function.

To make everything tractable, we only adapt the mixing coefficients,and for the dictionary we sample D values over the x-axis, uniformlyaround zero.

[1] Scardapane, S., Van Vaerenbergh, S. and Uncini, A., 2017. Kafnets:kernel-based non-parametric activation functions for neural networks.arXiv preprint arXiv:1707.04035.

Page 14: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Kernel selection

For our experiments, we use the 1D Gaussian kernel defined as:

κ(s, di) = exp{−γ (s− di)2

}, (8)

where γ ∈ R is called the kernel bandwidth. Based on some prelimi-nary experiments, we use the following rule-of-thumb for selecting thebandwidth:

γ =1

6∆2, (9)

where ∆ is the distance between the grid points.

Page 15: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Choosing the bandwidth

2 0 2Activation

KA

F

(a) γ = 2.0

2 0 2Activation

(b) γ = 0.5

2 0 2Activation

(c) γ = 0.1

Figure 1 : Examples of KAFs. In all cases we sample uniformly 20 pointson the x-axis, while the mixing coefficients are sampled from a normaldistribution. The three plots show three different choices for γ.

Page 16: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Inizialization of the mixing coefficients

Other than initializing the mixing coefficients randomly, we can alsoapproximate any initial function using kernel ridge regression (KRR):

α = (K + εI)−1

t , (10)

where K ∈ RD×D is the kernel matrix computed between the desiredpoints t and the elements of the dictionary d.

Page 17: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Examples of initialization

4 2 0 2 4Activation

0.5

0.0

0.5

KA

F

(a) tanh

4 2 0 2 4Activation

0

1

2

3

KA

F

(b) ELU

Figure 2 : Two examples of initializing a KAF using KRR, with ε = 10−6.(a) A hyperbolic tangent. (b) The ELU function. The red dots indicate thecorresponding initialized values for the mixing coefficients.

Page 18: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Multi-dimensional KAFs

We also consider a two-dimensional variant (2D-KAF), that acts on apair of activation values:

g (s) =

D2∑i=1

αiκ (s,di) , (11)

where di is the i-th element of the dictionary, and we now have D2

adaptable coefficients {αi}D2

i=1 sampled over the plane.

In this case, we consider the 2D Gaussian kernel:

κ (s,di) = exp{−γ ‖s− di‖22

}. (12)

Page 19: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Advantages of the framework

1 Universal approximation properties.

2 Very simple to vectorize and to accelerate on GPUs.

3 Smooth over the entire domain.

4 Mixing coefficients can be regularized easily, including the use ofsparse penalties.

Page 20: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Table of contents

1 Introduction

2 Non-parametric activation functions

3 Proposed kernel activation functions

4 Experimental results

5 Conclusions and future work

Page 21: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Visualizing the functions

1 0 1Activation

(a)

2 0 2Activation

(b)

2.5 0.0 2.5Activation

(c)

1 0Activation

(d)

0 2Activation

(e)

1 0 1Activation

(f)

Figure 3 : Examples of 6 trained KAFs (with random initialization) on theSensorless dataset. On the y-axis we plot the output value of the KAF. TheKAF after initialization is shown with a dashed red, while the final KAF isshown with a solid green. The distribution of activation values is shown asa references with a light blue.

Page 22: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Results on the SUSY benchmark

Activation function Testing AUC Trainable parameters

ReLU (five hidden layers) 0.8739(0.001)

367201ELU (five hidden layers) 0.8739(0.001)

SELU (five hidden layers) 0.8745(0.002)

PReLU (five hidden layers) 0.8748(0.001) 368701

Maxout (one layer) 0.8744(0.001) 17401

Maxout (two layers) 0.8744(0.002) 288301

APL (one layer) 0.8744(0.002) 7801

APL (two layers) 0.8757(0.002) 99901

KAF (one layer) 0.8756(0.001) 12001

KAF (two layers) 0.8758(0.001) 108301

Table 1 : Results on the SUSY benchmark.

Page 23: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Table of contents

1 Introduction

2 Non-parametric activation functions

3 Proposed kernel activation functions

4 Experimental results

5 Conclusions and future work

Page 24: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Conclusions and future work

1 We proposed a novel family of non-parametric functions, framedin a kernel expansion of their input value.

2 KAFs combine several advantages of previous approaches, withoutintroducing an excessive number of additional parameters.

3 Networks trained with these activations can obtain a higher accu-racy while being significantly smaller.

4 Alternative choices for the kernel expansion are possible, e.g. dicti-onary selection strategies, alternative kernels (e.g., periodic ker-nels), and several others.

5 The framework provides a further link between neural networksand kernel methods, opening up a large number of variations withrespect to our initial approach.

Page 25: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers

Thanks for the attention,questions?