cs 510: lecture 8greenie/cs510/cs510-17-08.pdfdeep learning overview • train networks with many...

78
CS 510: Lecture 8 Deep Learning, Fairness, and Bias

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

CS 510: Lecture 8Deep Learning, Fairness, and Bias

Page 2: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Next Week

• All Presentations, all the time

• Upload your presentation before class if using slides

• Sign up for a timeslot google doc, if you haven’t already done so

Page 3: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Artificial Neural Networks: History

• Belief that it was necessary to model underlying brain architecture for AI

• In contrast to encoded symbolic knowledge (best represented by expert systems)

• Hebb - learning is altering strength of synaptic connections

Page 4: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Neural Networks

• Attempt to build a computation system based on the parallel architecture of brains

• Characteristics:

• Many simple processing elements

• Many connections

• Simple messages

• Adaptive interaction

Page 5: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Brains

1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle timeSignals are noisy “spike trains” of electrical potential

Axon

Cell body or Soma

Nucleus

Dendrite

Synapses

Axonal arborization

Axon from another cell

Synapse

Chapter 20, Section 5 3

Page 6: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Benefits of NN

• User friendly (well, reasonably)

• Non-linear

• Noise tolerant

• Many applications

• Credit fraud/assignment

• Robotic Control

Page 7: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

“Neurons”

• Inputs (either from outside or other “neurons”)

• Weighted connections that correspond to synaptic efficiency

• Threshold values to weight the inputs

• Passed through activation function to determine output

Page 8: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Example Unit

• Binary input/output

• Rule

• 1 if w0*I0 + w1*I1 +wb > 0

• 0 if w0*I0 + w1*I1 +wb <= 0

Page 9: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Activation functions

(a) (b)

+1 +1

iniini

g(ini)g(ini)

(a) is a step function or threshold function

(b) is a sigmoid function 1/(1 + e−x)

Changing the bias weight W0,i moves the threshold location

Chapter 20, Section 5 5

Note similarity to logistic regression...

Page 10: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

W1

W2

W3

f(x)

1.4

-2.5

-0.06

Page 11: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

2.7

-8.6

0.002

f(x)

1.4

-2.5

-0.06

x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34

Page 12: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

How to Adapt?

• Perceptron Learning Rule

• change the weight by an amount proportional to the difference between the desired output and the actual output.

• As an equation: ΔWi = η * (D - Y)Ii, where D is desired output and Y is actual output

• Stop when converges

Page 13: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Limits of Perceptrons

• Minsky and Papert 1969

• Fails on “linearly inseparable” instances

• XOR

• linearly separable - pattern space can be separated by single hyperplane

Page 14: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Perceptrons vs Decision Trees

Page 15: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Multilayer Perceptrons (MLP)

Page 16: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Back Propagation

• Start with a set of known examples (supervised approach)

• Assign random initial weights

• Run examples through and calculate the mean-squared error

• Propagate the error by making small changes to the weights at each level

• Use chain rule to calculate the gradient efficiently

• Lather, rinse, repeat

Page 17: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Gradient Descent Algorithm

Have some function

Want

Outline:

•  Start with some

•  Keep changing to reduce

until we hopefully end up at a minimum

Page 18: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

J(θ0,θ1)

The gradient of J (∇J) at a point can be thought of as a vector indicating which way is “uphill”

If J is an error function we want to move “downhill”- opposite to the gradient

Page 19: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Gradient descent algorithm

Have function JWant to produce vectors s.t. J(θ1)>J(θ2)>... start w/ θ0θi+1 = θi - ɑi∇J(θi)

ɑ(alpha) is the learning rate

Page 20: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Stochastic Gradient Descent

Update J every time you look at a training example

Page 21: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Some non-linear activation functions…

Page 22: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Most common activation function

Page 23: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

A dataset Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Page 24: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training the neural network Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Page 25: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Initialise with random weights

Page 26: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Present a training pattern

1.4

2.7

1.9

Page 27: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Feed it through to get output

1.4

2.7 0.8

1.9

Page 28: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Compare with target output

1.4

2.7 0.8 0 1.9 error 0.8

Page 29: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Adjust weights based on error

1.4

2.7 0.8 0 1.9 error 0.8

Page 30: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Present a training pattern

6.4

2.8

1.7

Page 31: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Feed it through to get output

6.4

2.8 0.9

1.7

Page 32: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Compare with target output

6.4

2.8 0.9 1 1.7 error -0.1

Page 33: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Adjust weights based on error

6.4

2.8 0.9 1 1.7 error -0.1

Page 34: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

And so on ….

6.4

2.8 0.9 1 1.7 error -0.1

Repeat this thousands, maybe millions of times – each time taking a random training instance, and making slight weight adjustments Algorithms for weight adjustment are designed to make changes that will reduce the error

Page 35: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The decision boundary perspective…Initial random weights

Page 36: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The decision boundary perspective…Present a training instance / adjust the weights

Page 37: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The decision boundary perspective…Present a training instance / adjust the weights

Page 38: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The decision boundary perspective…Present a training instance / adjust the weights

Page 39: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The decision boundary perspective…Present a training instance / adjust the weights

Page 40: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The decision boundary perspective…Eventually ….

Page 41: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The point I am trying to make• weight-learning algorithms for NNs are dumb

• they work by making thousands and thousands of tiny adjustments, each making the network do better at the most recent pattern, but perhaps a little worse on many others

• but, by dumb luck, eventually this tends to be good enough to learn effective classifiers for many real applications

Page 42: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Some other points

Detail of a standard NN weight learning algorithm – later

If f(x) is non-linear, a network with 1 hidden layer can, in theory, learn perfectly any classification problem. A set of weights exists that can produce the targets from the inputs. The problem is finding them.

Page 43: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Some other ‘by the way’ pointsIf f(x) is linear, the NN can only draw straight decision boundaries (even if there are many layers of units)

Page 44: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Some other ‘by the way’ pointsNNs use nonlinear f(x) so they can draw complex boundaries, but keep the data unchanged

Page 45: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Some other ‘by the way’ pointsNNs use nonlinear f(x) so they SVMs only draw straight lines, can draw complex boundaries, but they transform the data first but keep the data unchanged in a way that makes that OK

Page 46: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Neural network vocabulary

• Neuron = logistic regression or similar function

• Input layer = input training/test vector

• Bias unit = intercept term/always on feature

• Activation = response

• Activation function is a logistic (or similar “sigmoid” nonlinearity)

• Backpropagation = running stochastic gradient descent across a multilayer network

• Weight decay - regularization or Bayesian prior

Page 47: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Deep Learning

• Most current machine learning works well because of human-designed representations and input features

• Machine learning becomes just optimizing weights to best make a final prediction

• Representation learning attempts toautomatically learn good features or representations

• Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction

Page 48: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Deep Architecture

Page 49: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Deep Learning Overview

• Train networks with many layers (vs. shallow nets with just a couple of layers)

• Multiple layers work to build an improved feature space

• First layer learns 1st order features (e.g. edges…)

• 2nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.)

• In current models layers often learn in an unsupervised mode and discover general features of the input space – serving multiple tasks related to the unsupervised instances (image recognition, etc.)

• Then final layer features are fed into supervised layer(s)

• And entire network is often subsequently tuned using supervised training of the entire net, using the initial weightings learned in the unsupervised phase

• Could also do fully supervised versions, etc. (early BP attempts)

Page 50: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Why Deep Learning?

Page 51: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Learning Representations

• Handcrafting features is time-consuming

• The features are often both over-specified and incomplete

• The work has to be done again for each task/domain/...

• We must move beyond handcrafted features and simple ML• Humans develop representations for learning and reasoning

• Our computers should do the same

Page 52: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The Curse of Dimensionality

Page 53: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Unsupervised Feature and Weight Learning

• Today, most practical, good NLP& ML methods require labeled training data (i.e., supervised learning)

• But almost all data is unlabeled

• Most information must be acquired unsupervised

• Fortunately, a good model of observed data can really help you learn classification decisions

Page 54: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Learning Multiple Levels of Representation

Page 55: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Successive Layers Learn Deeper Representations

edges

object parts(combination of edges)

object models

pixels

Page 56: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Impressive Results Especially on Large Datasets

• Object Recognition - better than anything out there

• Speech Recognition (google voice search)

• Many other perceptual tasks in vision and NLP

Page 57: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Why now?

• Bigger Data - deep learning works best

• Better Hardware - multicore CPUs and GPUs

• Better Algorithms - autoencoders, deep belief networks, etc

• Let us train multiple inner layers well

Page 58: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Breakthrough: Unsupervised Pre-training

Page 59: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Difficulties with Supervised Networks

• Early layers of MLP do not get trained well

• Diffusion of Gradient – error attenuates as it propagates to earlier layers

• Leads to very slow training

• Exacerbated since top couple layers can usually learn any task "pretty well" and thus the error to earlier layers drops quickly as the top layers "mostly" solve the task– lower layers never get the opportunity to use their capacity to improve results, they just do a random feature map

• Need a way for early layers to do effective work

• Often not enough labeled data available while lots of unlabeled data

• Can we use unsupervised/semi-supervised approaches to take advantage of the unlabeled data

• Deep networks tend to have more local minima problems than shallow networks during supervised training

Page 60: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Semi-supervised Learning

Page 61: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Semi-supervised Learning

Page 62: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Training Deep Networks

• Build a feature space

• Note that this is what we do with SVM kernels, or trained hidden layers in BP, etc., but now we will build the feature space using deep architectures

• Unsupervised training between layers can decompose the problem into distributed sub-problems (with higher levels of abstraction) to be further decomposed at subsequent layers

Page 63: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Greedy Layer-wise Training

• Train first layer using your data without the labels (unsupervised)

• Since there are no targets at this level, labels don't help. Could also use the more abundant unlabeled data which is not part of the training set (i.e. self-taught learning).

• Freeze the first layer parameters and start training the second layer using the output of the first layer as the unsupervised input to the second layer

• Repeat this for as many layers as desired

• This builds our set of robust features

• Use the outputs of the final layer as inputs to a supervised layer/model and train the last supervised layer(s) (leave early weights frozen)

• Unfreeze all weights and fine tune the full network by training with a supervised approach, given the pre-processed weight settings

Page 64: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Greedy Layer-wise Training

• Greedy layer-wise training avoids many of the problems of trying to train a deep net in a supervised fashion

• Each layer gets full learning focus in its turn since it is the only current "top" layer

• Can take advantage of the unlabeled data

• When you finally tune the entire network with supervised training the network weights have already been adjusted so that you are in a good error basin and just need fine tuning. This helps with problems of

• Ineffective early layer learning

• Deep network local minima

• We will discuss the two most common approaches

• Stacked Auto-Encoders

• Deep Belief Networks

Page 65: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The new way to train multi-layer NNs…

Train this layer first

Page 66: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The new way to train multi-layer NNs…

Train this layer first

then this layer

Page 67: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The new way to train multi-layer NNs…

Train this layer first

then this layer

then this layer

Page 68: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The new way to train multi-layer NNs…

Train this layer first

then this layer

then this layerthen this layer

Page 69: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The new way to train multi-layer NNs…

Train this layer first

then this layer

then this layerthen this layer

finally this layer

Page 70: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

The new way to train multi-layer NNs…

EACH of the (non-output) layers is trained

to be an auto-encoderBasically, it is forced to learn good features that describe what comes from the previous layer

Page 71: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

an auto-encoder is trained, with an absolutely standard weight-adjustment algorithm to reproduce the input

Page 72: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

an auto-encoder is trained, with an absolutely standard weight-adjustment algorithm to reproduce the input

By making this happen with (many) fewer units than the inputs, this forces the ‘hidden layer’ units to become good

feature detectors

Page 73: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

One Auto-encoder

73

Page 74: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Stacked Auto-encoders• Stack sparse auto-encoders on top of each

other, drop decode layer each time

74

Page 75: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Stacked auto-encoders

• Do supervised training on last layer • Then do supervised training on whole network

to fine tune the weights

75

Page 76: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Manifold Learning Hypothesis

76

Page 77: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Caveats

• Prevent the layers from just learning the identity (learn Features instead) –Undercomplete - middle layer smaller than input –Sparsity - penalize hidden unit activations

• Use regularization to keep most nodes at or near 0 –Denoising - Stochastically corrupt training instance,

but train encoder to decode uncorrupted instance –Contractive - force encoder to have small

derivatives (stay on manifold)77

Page 78: CS 510: Lecture 8greenie/cs510/CS510-17-08.pdfDeep Learning Overview • Train networks with many layers (vs. shallow nets with just a couple of layers) • Multiple layers work to

Fairness and Learning

• Going to show video of Aylin • Link on course website to her talk (may be

easier for online students if there is feedback).

78