neural networks and kernel methods. how are we doing on the pass sequence? we can now track both...
TRANSCRIPT
![Page 1: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/1.jpg)
Neural Networks and Kernel Methods
![Page 2: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/2.jpg)
How are we doing on the pass sequence?
• We can now track both men, provided with– Hand-labeled coordinates of both men in 30 frames– Hand-extracted features (stripe detector, white blob
detector)– Hand-labeled classes for the white-shirt tracker
• We have a framework for how to optimally make decisions and track the men
Generally, this will take a lot longer than 24 hours… We need to avoid doing this by hand!
![Page 3: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/3.jpg)
Recall: Multi-input linear regressiony(x,w) = w0 + w1 1(x) + w2 2(x) + … + wM M(x)
• x can be an entire scan-line or image!
• We could try to uniformly distribute basis functions in the input space:
• This is futile, because of the curse of dimensionality
x = entire scan line
…
![Page 4: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/4.jpg)
Neural networks and kernel methods
Two main approaches to avoiding the curse of dimensionality:– “Neural networks”
• Parameterize the basis functions and learn their locations
• Can be nested to create a hierarchy• Regularize the parameters or use Bayesian
learning
– “Kernel methods”• The basis functions are associated with data
points, limiting complexity• A subset of data points may be selected to further
limit complexity
![Page 5: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/5.jpg)
Neural networks and kernel methods
Two main approaches to avoiding the curse of dimensionality:– “Neural networks”
• Parameterize the basis functions and learn their locations
• Can be nested to create a hierarchy• Regularize the parameters or use Bayesian
learning
– “Kernel methods”• The basis functions are associated with data
points, limiting complexity• A subset of data points may be selected to further
limit complexity
![Page 6: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/6.jpg)
Two-layer neural networks
• Before, we used
• Replace each j with a variable zj,where
and h() is a fixed activation function
• The outputs are obtained from
where () is another fixed function
• In all, we have (simplifying biases):
![Page 7: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/7.jpg)
Typical activation functions
• Logistic sigmoid, aka logit:
h(a) = (a) = 1/(1+e-a)
• Hyperbolic tangent:
h(a) = tanh(a) = (ea-e-a)/(ea+e-a)
• Cumulative Gaussian (error function):
h(a) = 2x-∞a N(x|0,1)dx - 1
– This one has a lighter tail
h(a)
Normalized to have same range and slope at a=0
a
As above, but h is on a log-scale
![Page 8: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/8.jpg)
Examples of functions learned by a neural network(3 tanh hidden units, one linear output unit)
![Page 9: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/9.jpg)
Multi-layer neural networks
• Only weights corresponding to the feed-forward topology are instantiated
• The sum is over those values of j with instantiated weights wkj
From now on, we’ll denote all activation functions by h
![Page 10: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/10.jpg)
Learning neural networks
• As for regression, we consider a squared error cost function:
E(w) = ½ n k ( tnk – yk(xn,w) )2
which corresponds to a Gaussian density p(t|x)
• We can substitute
and use a general purpose optimizer to estimate w, but it is illustrative and useful to study the derivatives of E…
![Page 11: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/11.jpg)
Learning neural networks
E(w) = ½ n k ( tnk – yk(xn,w) )2
• Recall that for linear regression:
E(w)/wm = -n ( tn - yn ) xnm
• We’ll use the chain rule of differentiation to derive a similar-looking expression, where– Local input signals are forward-propagated from the input
– Local error signals are back-propagated from the output
Error signalInput signalWeight in-between error signal and
input signal
![Page 12: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/12.jpg)
Local signals needed for learning
• For clarity, consider the error for one training case:
• To compute En/wji, note that wji appears in only one term of the overall expression, namely
• Using the chain rule of differentiation, we have
where
if wji is in the 1st layer, zi is actually input xi
Weight Local
error signal
Local input signa
l
![Page 13: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/13.jpg)
Forward-propagating local input signals
• Forward propagation gives all the a’s and z’s
![Page 14: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/14.jpg)
Back-propagating local error signals
• Back-propagation gives all the ’s
t2
t1
![Page 15: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/15.jpg)
Back-propagating error signals
• To compute En/aj (j), note that aj appears in all
those expressions ak = i wki h(ai) that depend on aj
• Using the chain rule, we have
• The sum is over k s.t. unit j is connected to unit k and for each such term, ak/aj = wkj h’(aj)
• Noting that En/ak = k, we get the back-propagation rule:
• For output units: -
![Page 16: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/16.jpg)
Putting the propagations together
• For each training case n, apply forward propagation and back-propagation to compute
for each weight wji
• Sum these over training cases to compute
• Use these derivatives for steepest descent learning or as input to a conjugate gradients optimizer, etc
• On-line learning: After each pattern presentation, use the above gradient to update the weights
![Page 17: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/17.jpg)
The number of hidden units determines the complexity of the learned function
(M = # hidden units)
![Page 18: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/18.jpg)
The effect of local minima
• Because of random weight initialization, each training run will find a different solution
M
Validation error
![Page 19: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/19.jpg)
Regularizing neural networks
Demonstration of over-fitting (M = # hidden units)
![Page 20: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/20.jpg)
Regularizing neural networks
• Use cross-validation to select the network architecture (number of layers, number of units per layer)
• Add to E a term (/2)jiwji2 that penalizes large
weights, so
Use cross-validation to select
• Use early-stopping and cross-validation (next slide)
• Take a Bayesian approach: Put a prior on the w’s and integrate over them to make predictions
Over-fitting:
![Page 21: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/21.jpg)
Early stopping
• The weights start at small values and grow
• Perhaps the number of learning iterations is a surrogate for model complexity?
• This works for some learning tasks
Number of learning iterations
Training error
Validation error
![Page 22: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/22.jpg)
Can we use a standard neural network to automatically learn the features needed for tracking?
• x is 320-dimensional, so the number of parameters would be at least 320
• We have only 15 data points (setting aside 15 for cross validation) so over-fitting will be an issue
• We could try weight decay, Bayesian learning, etc, but a little thinking reveals that our approach is wrong…
• In fact, we want the weights connecting different positions in the scan line to use the same feature (eg, stripes)
x = entire scan line
![Page 23: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/23.jpg)
Convolutional neural networks• Recall that a short portion of the scan line was
sufficient for tracking the striped shirt• We can use this idea to build a convolutional network
Same set of weights used for all hidden units
With constrained weights, the number of free parameters is now only ~ one dozen, so…
We can use Bayesian/regularized learning to automatically learn the features
![Page 24: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/24.jpg)
Convolutional neural networks in 2-D(from Le Cun et al, 1989)
![Page 25: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/25.jpg)
Neural networks and kernel methods
Two main approaches to avoiding the curse of dimensionality:– “Neural networks”
• Parameterize the basis functions and learn their locations
• Can be nested to create a hierarchy• Regularize the parameters or use Bayesian
learning
– “Kernel methods”• The basis functions are associated with data
points, limiting complexity• A subset of data points may be selected to further
limit complexity
![Page 26: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/26.jpg)
Kernel methods
• Basis functions offer a way to enrich the feature space, making simple methods (such as linear regression and linear classifiers) much more powerful
• Example: Input x; Features x, x2, x3, sin(x), … • There are two problems with this approach
– Computational efficiency: Generally, the appropriate features are not known, so there is a huge (possibly infinite) number of them to search over
– Regularization: Even if we could search over the huge number of features, how can we select appropriate features so as to prevent overfitting?
• The kernel framework enables efficient approaches to both problems
![Page 27: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/27.jpg)
Kernel methods
x1
x2
1
2
![Page 28: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/28.jpg)
Definition of a kernel
• Suppose (x) is a mapping from the D-dimensional input vector x to a high (possibly infinite) dimensional feature space
• Many simple methods rely on inner products of feature vectors, (x1)T(x2)
• For certain feature spaces, the “kernel trick” can be used to compute (x1)T(x2) using the input vectors directly:
(x1)T(x2) = k(x1, x2)
• k(x1, x2) is referred to as a kernel
• If a function satisfies “Mercer’s conditions” (see textbook), it can be used as a kernel
![Page 29: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/29.jpg)
Examples of kernels
• k(x1, x2) = x1T x2
• k(x1, x2) = x1T x2
(is symmetric positive definite)
• k(x1, x2) = exp(-||x1-x2||2
• k(x1, x2) = exp(-½ x1T x2
(is symmetric positive definite)
• k(x1, x2) = p(x1)p(x2)
![Page 30: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/30.jpg)
Gaussian processes• Recall that for linear regression:
• Using a design matrix , our prediction vector is
• Let’s use a simple prior on w:
• Then
• K is called the Gram matrix, where
• Result: The correlation between two predictions equals the kernel evaluated for the corresponding inputs
Example
![Page 31: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/31.jpg)
Gaussian processes: “Learning” and prediction
• As before, we assume
• The target vector likelihood is
• Using , we can obtain the marginal predictive distribution over targets:
where
• Predictions are based on
where , =
• is Gaussian with
![Page 32: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/32.jpg)
Example: Samples from
![Page 33: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/33.jpg)
Example: Learning and prediction
![Page 34: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/34.jpg)
Sparse kernel methods and SVMs
• Idea: Identify a small number of training cases, called support vectors, which are used to make predictions
• See textbook for details
Support vector
![Page 35: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/35.jpg)
Questions?
![Page 36: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/36.jpg)
How are we doing on the pass sequence?• We can now automatically learn the features needed
to track both people
Same set of weights used for all hidden units
![Page 37: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/37.jpg)
How are we doing on the pass sequence?
Pretty good! We cannow automatically learnthe features needed totrack both people
But, it sucks that we need to hand-label the coordinates of both men in 30 frames and hand-label the 2 classes for the white-shirt tracker
Same set of weights used for all hidden units
![Page 38: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/38.jpg)
![Page 39: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/39.jpg)
Lecture 5 Appendix
![Page 40: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both](https://reader036.vdocuments.net/reader036/viewer/2022062511/551475a9550346414e8b62fd/html5/thumbnails/40.jpg)
Constructing kernels• Provided with a kernel or a set of kernels, we can
construct new kernels using any of the rules: