lecture3 - university of california, san...
TRANSCRIPT
![Page 1: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/1.jpg)
Lecture 3• Homework
• Gaussian, Bishop 2.3• Non-parametric, Bishop 2.5• Linear regression 3.0-3.2
• Pod-cast lecture on-line
• Next lectures: – I posted a rough plan. – It is flexible though so please come with suggestions
![Page 2: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/2.jpg)
Mark’s KL homework
![Page 3: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/3.jpg)
Mark’s KL homework
![Page 4: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/4.jpg)
Bayes for linear model! = #$ + & &~N(*, ,-) y~N(#$, ,-) prior: $~N(*, ,$)
/ $ ! ~/ ! $ / $ ~0 $1, ,/ mean $1 = ,1#2,-34!Covariance ,134 = #2,-34# + ,534
![Page 5: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/5.jpg)
Bayes’ Theorem for Gaussian Variables• Given
• we have
• where
![Page 6: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/6.jpg)
Contribution of the Nth data point, xN
Sequential Estimation
correction given xNcorrection weightold estimate
![Page 7: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/7.jpg)
Bayesian Inference for the Gaussian Bishop2.3.6Assume s2 is known. Given i.i.d. datathe likelihood function for µ is given by
• This has a Gaussian shape as a function of µ (but it is not a distribution over µ).
![Page 8: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/8.jpg)
Bayesian Inference for the Gaussian Bishop2.3.6• Combined with a Gaussian prior over µ,
• this gives the posterior
![Page 9: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/9.jpg)
Bayesian Inference for the Gaussian (3)• Example: for N = 0, 1, 2 and 10.
Prior
![Page 10: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/10.jpg)
Bayesian Inference for the Gaussian (4)Sequential Estimation
The posterior obtained after observing N-1 data points becomes the prior when we observe the Nth data point.
Conjugate prior: posterior and prior are in the same family. The prior is called a conjugate prior for the likelihood function.
![Page 11: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/11.jpg)
Nonparametric Methods (1) Bishop 2.5• Parametric distribution models (… Gaussian) are restricted to specific
forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.
• Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.
• 1000 parameters versus 10 parameters
• Nonparametric models (not histograms) requires storing and computing with the entire data set.
• Parametric models, once fitted, are much more efficient in terms of storage and computation.
![Page 12: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/12.jpg)
Nonparametric Methods (2)
Histogram methods partition the data space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.
• Often, the same width is used for all bins, Di = D.
• D acts as a smoothing parameter.• In a D-dimensional space, using M bins
in each dimension will require MD bins! => it only work for marginals.
![Page 13: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/13.jpg)
Nonparametric Methods (3)
•Assume observations drawn from a density p(x) and consider a small region Rcontaining x such that
•The probability that K out of Nobservations lie inside R is Bin(KjN,P ) and if N is large
If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and
Thus
V small, yet K>0, therefore N large?
![Page 14: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/14.jpg)
Nonparametric Methods (4)
Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window)
• It follows that
• and hence
![Page 15: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/15.jpg)
Nonparametric Methods (5)
To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian
Any kernel such that
will work.h acts as a smoother.
![Page 16: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/16.jpg)
Nonparametric Methods (6)Nearest Neighbour Density Estimation: fix K, estimate Vfrom the data. Consider a hypersphere centred on x and let it grow to a volume, V ?, that includes K of the given N data points. Then
K acts as a smoother.
![Page 17: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/17.jpg)
K-Nearest-Neighbours for Classification (1)• Given a data set with Nk data points from class Ck and
, we have
• and correspondingly
• Since , Bayes’ theorem gives
K = 1K = 3
![Page 18: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/18.jpg)
K-Nearest-Neighbours for Classification (3)
• K acts as a smother• For , the error rate of the nearest-neighbour (K=1) classifier is never more than twice the optimal error (from the true conditional class distributions).
![Page 19: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/19.jpg)
Linear regression: Linear Basis Function Models (1)Generally
• where fj(x) are known as basis functions.• Typically, f0(x) = 1, so that w0 acts as a bias.• Simplest case is linear basis functions: fd(x) = xd.
http://playground.tensorflow.org/
![Page 20: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/20.jpg)
Some types of basis function in 1-D
Sigmoids Gaussians Polynomials
Sigmoid and Gaussian basis functions can also be used in multilayer neural networks, but neural networks learn the parameters of the basis functions. This is more powerful but also harder and messier.
![Page 21: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/21.jpg)
Two types of linear model that are equivalent with respect to learning
• The first and second model has the same number of adaptive coefficients as the number of basis functions +1.
• Once we have replaced the data by basis functions outputs, fitting the second model is exactly the same the first model.– No need to clutter math with basis functions
)(...)()()(
...)(
22110
22110
xwxxwx,
xwwx,
F=+++=
=+++=T
T
wwwy
xwxwwy
ff
bias
![Page 22: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/22.jpg)
Maximum Likelihood and Least Squares (1)
• Assume observations from a deterministic function with added Gaussian noise:
• or,
• Given observed inputs, , and targets , we obtain the likelihood function
where
![Page 23: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/23.jpg)
Maximum Likelihood and Least Squares (2)Taking the logarithm, we get
Where the sum-of-squares error is
![Page 24: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/24.jpg)
Maximum Likelihood and Least Squares (3)Computing the gradient and setting it to zero yields
Solving for w,
whereThe Moore-Penrose pseudo-inverse, .
![Page 25: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/25.jpg)
Maximum Likelihood and Least Squares (4)Maximizing with respect to the bias, w0, alone,
We can also maximize with respect to b, giving
![Page 26: lecture3 - University of California, San Diegonoiselab.ucsd.edu/ECE228_2018/slides/lecture3.pdfLecture3 • Homework • Gaussian, Bishop 2.3 • Non-parametric, Bishop 2.5 • Linear](https://reader034.vdocuments.net/reader034/viewer/2022050313/5f7552eac81b041df6151f75/html5/thumbnails/26.jpg)
Geometry of Least SquaresConsider
S is spanned by
wML minimizes the distance between t and its orthogonal projection on S, i.e. y.
N-dimensionalM-dimensional