math 829: introduction to data mining and analysis neural ...dguillot/...neural_networks-ii.pdf ·...

50

Upload: others

Post on 08-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

MATH 829: Introduction to Data Mining andAnalysis

Neural networks II

Dominique Guillot

Departments of Mathematical Sciences

University of Delaware

April 13, 2016

This lecture is based on the UFLDL tutorial (http://deeplearning.stanford.edu/)

Page 2: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Recall

We have:

a(2)1 = f(W

(1)11 x1 +W

(1)12 x2 +W

(1)13 x3 + b

(1)1 )

a(2)2 = f(W

(1)21 x1 +W

(1)22 x2 +W

(1)23 x3 + b

(1)2 )

a(2)3 = f(W

(1)31 x1 +W

(1)32 x2 +W

(1)33 x3 + b

(1)3 )

hW,b = a(3)1 = f(W

(2)11 a

(2)1 +W

(2)12 a

(2)2 +W

(2)13 a

(2)3 + b

(2)1 ).

2/14

Page 3: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Recall (cont.)

Vector form:

z(2) =W (1)x+ b(1)

a(2) = f(z(2))

z(3) =W (2)a(2) + b(2)

hW,b = a(3) = f(z(3)).

3/14

Page 4: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).

Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.

4/14

Page 5: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.

4/14

Page 6: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.

4/14

Page 7: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.

4/14

Page 8: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.

4/14

Page 9: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.

4/14

Page 10: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Training neural networks

Suppose we have

A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .

We would like to choose W (l) and b(l) in some optimal way for all

l.

Let

J(W, b;x, y) :=1

2‖hW,b(x)−y‖22 (Squared error for one sample).

De�ne

J(W, b) :=1

m

m∑i=1

J(W, b;x(i), y(i)) +λ

2

nl−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2.

(average squared error with Ridge penalty).

Note:

The Ridge penalty prevents over�tting.

We do not penalize the bias terms b(l)i .

The loss function J(W, b) is not convex.4/14

Page 11: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Some remarks

The loss function J(W, b) is often used both for regression and

classi�cation.

In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).

For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).

We will use a gradient descent to minimize J(W, b). Note that

since the function is non-convex, we may only �nd a local

minimum.

We need an initial choice for W(l)ij and b

(l)i . If we initialize all

the parameters to 0, then the parameters remain constant over

the layers because of the symmetry of the problem.

As a result, we initialize the parameters to a small constant at

random (say, using N(0, ε2) for ε = 0.01).

5/14

Page 12: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Some remarks

The loss function J(W, b) is often used both for regression and

classi�cation.

In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).

For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).

We will use a gradient descent to minimize J(W, b). Note that

since the function is non-convex, we may only �nd a local

minimum.

We need an initial choice for W(l)ij and b

(l)i . If we initialize all

the parameters to 0, then the parameters remain constant over

the layers because of the symmetry of the problem.

As a result, we initialize the parameters to a small constant at

random (say, using N(0, ε2) for ε = 0.01).

5/14

Page 13: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Some remarks

The loss function J(W, b) is often used both for regression and

classi�cation.

In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).

For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).

We will use a gradient descent to minimize J(W, b). Note that

since the function is non-convex, we may only �nd a local

minimum.

We need an initial choice for W(l)ij and b

(l)i . If we initialize all

the parameters to 0, then the parameters remain constant over

the layers because of the symmetry of the problem.

As a result, we initialize the parameters to a small constant at

random (say, using N(0, ε2) for ε = 0.01).

5/14

Page 14: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Some remarks

The loss function J(W, b) is often used both for regression and

classi�cation.

In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).

For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).

We will use a gradient descent to minimize J(W, b). Note that

since the function is non-convex, we may only �nd a local

minimum.

We need an initial choice for W(l)ij and b

(l)i . If we initialize all

the parameters to 0, then the parameters remain constant over

the layers because of the symmetry of the problem.

As a result, we initialize the parameters to a small constant at

random (say, using N(0, ε2) for ε = 0.01).

5/14

Page 15: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Some remarks

The loss function J(W, b) is often used both for regression and

classi�cation.

In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).

For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).

We will use a gradient descent to minimize J(W, b). Note that

since the function is non-convex, we may only �nd a local

minimum.

We need an initial choice for W(l)ij and b

(l)i . If we initialize all

the parameters to 0, then the parameters remain constant over

the layers because of the symmetry of the problem.

As a result, we initialize the parameters to a small constant at

random (say, using N(0, ε2) for ε = 0.01).

5/14

Page 16: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Some remarks

The loss function J(W, b) is often used both for regression and

classi�cation.

In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).

For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).

We will use a gradient descent to minimize J(W, b). Note that

since the function is non-convex, we may only �nd a local

minimum.

We need an initial choice for W(l)ij and b

(l)i . If we initialize all

the parameters to 0, then the parameters remain constant over

the layers because of the symmetry of the problem.

As a result, we initialize the parameters to a small constant at

random (say, using N(0, ε2) for ε = 0.01).

5/14

Page 17: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Gradient descent and the backpropagation algorithm

We update the parameters using a gradient descent as follows:

W(l)ij ←W

(l)ij − α

∂W(l)ij

J(W, b)

b(l)i ← b

(l)i − α

∂b(l)i

J(W, b).

Here α > 0 is a parameter (the learning rate).

Observe that:

∂W(l)ij

J(W, b) =1

m

m∑i=1

∂W(l)ij

J(W, b;x(i), y(i)) + λW(l)ij

∂b(l)i

J(W, b) =1

m

m∑i=1

∂b(l)i

J(W, b;x(i), y(i)).

Therefore, it su�ces to compute the derivatives of

J(W, b;x(i), y(i)).

6/14

Page 18: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Gradient descent and the backpropagation algorithm

We update the parameters using a gradient descent as follows:

W(l)ij ←W

(l)ij − α

∂W(l)ij

J(W, b)

b(l)i ← b

(l)i − α

∂b(l)i

J(W, b).

Here α > 0 is a parameter (the learning rate).

Observe that:

∂W(l)ij

J(W, b) =1

m

m∑i=1

∂W(l)ij

J(W, b;x(i), y(i)) + λW(l)ij

∂b(l)i

J(W, b) =1

m

m∑i=1

∂b(l)i

J(W, b;x(i), y(i)).

Therefore, it su�ces to compute the derivatives of

J(W, b;x(i), y(i)).

6/14

Page 19: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Gradient descent and the backpropagation algorithm

We update the parameters using a gradient descent as follows:

W(l)ij ←W

(l)ij − α

∂W(l)ij

J(W, b)

b(l)i ← b

(l)i − α

∂b(l)i

J(W, b).

Here α > 0 is a parameter (the learning rate).

Observe that:

∂W(l)ij

J(W, b) =1

m

m∑i=1

∂W(l)ij

J(W, b;x(i), y(i)) + λW(l)ij

∂b(l)i

J(W, b) =1

m

m∑i=1

∂b(l)i

J(W, b;x(i), y(i)).

Therefore, it su�ces to compute the derivatives of

J(W, b;x(i), y(i)).6/14

Page 20: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Computing the derivatives using backpropagation

1 Compute the activations for all the layers.

2 For each output unit i in layer nl (output), compute

δ(nl)i :=

∂z(nl)i

1

2‖y − hW,b(x)‖22 = −(yi − a

(nl)i ) · f ′(znl

i ).

3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set

δ(l)i :=

sl+1∑j=1

W(l)ji δ

(l+1)j

· f ′(z(l)i ).

4 Compute the desired partial derivatives:

∂W(l)ij

J(W, b;x(i), y(i)) = a(l)j δ

(l+1)i

∂b(l)i

J(W, b;x(i), y(i)) = δ(l+1)i .

7/14

Page 21: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Computing the derivatives using backpropagation

1 Compute the activations for all the layers.2 For each output unit i in layer nl (output), compute

δ(nl)i :=

∂z(nl)i

1

2‖y − hW,b(x)‖22 = −(yi − a

(nl)i ) · f ′(znl

i ).

3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set

δ(l)i :=

sl+1∑j=1

W(l)ji δ

(l+1)j

· f ′(z(l)i ).

4 Compute the desired partial derivatives:

∂W(l)ij

J(W, b;x(i), y(i)) = a(l)j δ

(l+1)i

∂b(l)i

J(W, b;x(i), y(i)) = δ(l+1)i .

7/14

Page 22: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Computing the derivatives using backpropagation

1 Compute the activations for all the layers.2 For each output unit i in layer nl (output), compute

δ(nl)i :=

∂z(nl)i

1

2‖y − hW,b(x)‖22 = −(yi − a

(nl)i ) · f ′(znl

i ).

3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set

δ(l)i :=

sl+1∑j=1

W(l)ji δ

(l+1)j

· f ′(z(l)i ).

4 Compute the desired partial derivatives:

∂W(l)ij

J(W, b;x(i), y(i)) = a(l)j δ

(l+1)i

∂b(l)i

J(W, b;x(i), y(i)) = δ(l+1)i .

7/14

Page 23: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Computing the derivatives using backpropagation

1 Compute the activations for all the layers.2 For each output unit i in layer nl (output), compute

δ(nl)i :=

∂z(nl)i

1

2‖y − hW,b(x)‖22 = −(yi − a

(nl)i ) · f ′(znl

i ).

3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set

δ(l)i :=

sl+1∑j=1

W(l)ji δ

(l+1)j

· f ′(z(l)i ).

4 Compute the desired partial derivatives:

∂W(l)ij

J(W, b;x(i), y(i)) = a(l)j δ

(l+1)i

∂b(l)i

J(W, b;x(i), y(i)) = δ(l+1)i .

7/14

Page 24: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Autoencoders

An autoencoder learns the identity function:

Input: unlabeled data.

Output = input.

Idea: limit the number of hidden layers to discover structure in

the data.

Learn a compressed representation of the input.

Source: UFLDL tutorial.

Can also learn a sparse network by including supplementary

constraints on the weights.8/14

Page 25: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (UFLDL)

Train an autoencoder on 10× 10 images with one hidden layer.

Each hidden unit i computes:

a(2)i = f

100∑j=1

W(1)ij xj + b

(1)j

.

Think of a(2)i as some non-linear feature of the input x.

Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.

Claim:

xj =W

(1)ij√∑100

j=1(W(1)ij )2

.

(Hint: Use Cauchy�Schwarz).

We can now display the image maximizing a(2)i for each i.

9/14

Page 26: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (UFLDL)

Train an autoencoder on 10× 10 images with one hidden layer.

Each hidden unit i computes:

a(2)i = f

100∑j=1

W(1)ij xj + b

(1)j

.

Think of a(2)i as some non-linear feature of the input x.

Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.

Claim:

xj =W

(1)ij√∑100

j=1(W(1)ij )2

.

(Hint: Use Cauchy�Schwarz).

We can now display the image maximizing a(2)i for each i.

9/14

Page 27: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (UFLDL)

Train an autoencoder on 10× 10 images with one hidden layer.

Each hidden unit i computes:

a(2)i = f

100∑j=1

W(1)ij xj + b

(1)j

.

Think of a(2)i as some non-linear feature of the input x.

Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.

Claim:

xj =W

(1)ij√∑100

j=1(W(1)ij )2

.

(Hint: Use Cauchy�Schwarz).

We can now display the image maximizing a(2)i for each i.

9/14

Page 28: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (UFLDL)

Train an autoencoder on 10× 10 images with one hidden layer.

Each hidden unit i computes:

a(2)i = f

100∑j=1

W(1)ij xj + b

(1)j

.

Think of a(2)i as some non-linear feature of the input x.

Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.

Claim:

xj =W

(1)ij√∑100

j=1(W(1)ij )2

.

(Hint: Use Cauchy�Schwarz).

We can now display the image maximizing a(2)i for each i.

9/14

Page 29: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (UFLDL)

Train an autoencoder on 10× 10 images with one hidden layer.

Each hidden unit i computes:

a(2)i = f

100∑j=1

W(1)ij xj + b

(1)j

.

Think of a(2)i as some non-linear feature of the input x.

Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.

Claim:

xj =W

(1)ij√∑100

j=1(W(1)ij )2

.

(Hint: Use Cauchy�Schwarz).

We can now display the image maximizing a(2)i for each i.

9/14

Page 30: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (UFLDL)

Train an autoencoder on 10× 10 images with one hidden layer.

Each hidden unit i computes:

a(2)i = f

100∑j=1

W(1)ij xj + b

(1)j

.

Think of a(2)i as some non-linear feature of the input x.

Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.

Claim:

xj =W

(1)ij√∑100

j=1(W(1)ij )2

.

(Hint: Use Cauchy�Schwarz).

We can now display the image maximizing a(2)i for each i.

9/14

Page 31: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Example (cont.)

100 hidden units on 10x10 pixel inputs:

The di�erent hidden units have learned to detect edges at di�erent

positions and orientations in the image.

10/14

Page 32: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Sparse neural networks

So far we discussed dense neural networks.

Dense networks have a lot of parameters to learn. Can be

ine�cient or impossible to train.

Sparse models have been proposed in the literature.

Some of these models �nd inspiration from how the early

visual system is wired up in biology.

11/14

Page 33: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Sparse neural networks

So far we discussed dense neural networks.

Dense networks have a lot of parameters to learn. Can be

ine�cient or impossible to train.

Sparse models have been proposed in the literature.

Some of these models �nd inspiration from how the early

visual system is wired up in biology.

11/14

Page 34: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Sparse neural networks

So far we discussed dense neural networks.

Dense networks have a lot of parameters to learn. Can be

ine�cient or impossible to train.

Sparse models have been proposed in the literature.

Some of these models �nd inspiration from how the early

visual system is wired up in biology.

11/14

Page 35: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Sparse neural networks

So far we discussed dense neural networks.

Dense networks have a lot of parameters to learn. Can be

ine�cient or impossible to train.

Sparse models have been proposed in the literature.

Some of these models �nd inspiration from how the early

visual system is wired up in biology.

11/14

Page 36: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 37: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 38: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 39: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 40: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 41: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 42: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial.

12/14

Page 43: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Using convolutions

Idea: Certain signals are stationary, i.e., their statistical

properties do not change in space or time.

For example, images often have similar statistical properties in

di�erent regions in space.

That suggests that the features that we learn at one part of an

image can also be applied to other parts of the image.

Can �convolve� the learned features with the larger image.

Example: 96× 96 image.

Learn features on small 8× 8 patches sampled randomly (e.g.

using a sparse autoencoder).

Run the trained model through all 8× 8 patches of the image

to get the feature activations.

Source: UFLDL tutorial. 12/14

Page 44: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Pooling features

Once can also pool the features obtained via convolution.

For example, to describe a large image, one natural approach

is to aggregate statistics of these features at various locations.

E.g. compute the mean, max, etc. over di�erent regions.

Can lead to more robust features. Can lead to invariant

features.

For example, if the pooling regions are contiguous, then the

pooling units will be �translation invariant�, i.e., they won't

change much if objects in the image are undergo a (small)

translation.

13/14

Page 45: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Pooling features

Once can also pool the features obtained via convolution.

For example, to describe a large image, one natural approach

is to aggregate statistics of these features at various locations.

E.g. compute the mean, max, etc. over di�erent regions.

Can lead to more robust features. Can lead to invariant

features.

For example, if the pooling regions are contiguous, then the

pooling units will be �translation invariant�, i.e., they won't

change much if objects in the image are undergo a (small)

translation.

13/14

Page 46: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Pooling features

Once can also pool the features obtained via convolution.

For example, to describe a large image, one natural approach

is to aggregate statistics of these features at various locations.

E.g. compute the mean, max, etc. over di�erent regions.

Can lead to more robust features. Can lead to invariant

features.

For example, if the pooling regions are contiguous, then the

pooling units will be �translation invariant�, i.e., they won't

change much if objects in the image are undergo a (small)

translation.

13/14

Page 47: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Pooling features

Once can also pool the features obtained via convolution.

For example, to describe a large image, one natural approach

is to aggregate statistics of these features at various locations.

E.g. compute the mean, max, etc. over di�erent regions.

Can lead to more robust features. Can lead to invariant

features.

For example, if the pooling regions are contiguous, then the

pooling units will be �translation invariant�, i.e., they won't

change much if objects in the image are undergo a (small)

translation.

13/14

Page 48: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Pooling features

Once can also pool the features obtained via convolution.

For example, to describe a large image, one natural approach

is to aggregate statistics of these features at various locations.

E.g. compute the mean, max, etc. over di�erent regions.

Can lead to more robust features. Can lead to invariant

features.

For example, if the pooling regions are contiguous, then the

pooling units will be �translation invariant�, i.e., they won't

change much if objects in the image are undergo a (small)

translation.

13/14

Page 49: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Pooling features

Once can also pool the features obtained via convolution.

For example, to describe a large image, one natural approach

is to aggregate statistics of these features at various locations.

E.g. compute the mean, max, etc. over di�erent regions.

Can lead to more robust features. Can lead to invariant

features.

For example, if the pooling regions are contiguous, then the

pooling units will be �translation invariant�, i.e., they won't

change much if objects in the image are undergo a (small)

translation.

13/14

Page 50: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique

Neural networks with scikit-learn

Need to install the 0.18-dev version

(http://scikit-learn.org/stable/developers/

contributing.html#retrieving-the-latest-code).

sklearn.neural_network.MLPClassifier

sklearn.neural_network.MLPRegressor

14/14