seyed abbas hosseini sharif university of technology most slides … · 2021. 3. 13. · most...
TRANSCRIPT
Seyed Abbas Hosseini Sharif University of Technology
Regression
Most slides are adopted from PRML book
Outline
2
•LinearBasisFunctionModels
•MaximumLikelihoodandLeastSquares
•RegularizedLeastSquares
•GradientDescentandSequentialLearning
•MultipleOutputs
•BiasVarianceTradeoff
Linear Basis Function Models
LinearBasisFunctionModels(1)
• Example:PolynomialCurveFitting
4
LinearBasisFunctionModels(2)
• Generally
• whereareknownasbasisfunctions.
• Typically= 1,forj=0sothatw0actsasabias.
• Inthesimplestcase,weuselinearbasisfunctions: = xj.
5
LinearBasisFunctionModels(3)
• Polynomialbasisfunctions:
• Theseareglobal;asmallchangeinxaffectallbasisfunctions.
6
LinearBasisFunctionModels(4)
• Gaussianbasisfunctions:
• Thesearelocal;asmallchangeinxonlyaffectnearbybasisfunctions andscontrollocationandscale(width).
7
LinearBasisFunctionModels(5)
• Sigmoidalbasisfunctions:
• where
• Alsothesearelocal;asmallchangeinxonlyaffectnearbybasisfunctions. andscontrollocationandscale(slope).
8
Maximum Likelihood and Least Squares
MaximumLikelihoodandLeastSquares(1)
• AssumeobservationsfromadeterministicfunctionwithaddedGaussiannoise:
• whichisthesameassaying,
• Givenobservedinputs,,andtargets,,weobtainthelikelihoodfunction
10
where
MaximumLikelihoodandLeastSquares(2)
• Takingthelogarithm,weget
• where
• isthesum-of-squareserror.
11
MaximumLikelihoodandLeastSquares(3)
• Computingthegradientandsettingittozeroyields
• Solvingforw,weget
• where
12
TheMoore-Penrosepseudo-inverse,.
MaximumLikelihoodandLeastSquares(4)
• Maximizingwithrespecttothebias,w0,alone,weseethat
• Wecanalsomaximizewithrespecttobeta,giving
13
GeometryofLeastSquares
• Consider
• Sisspannedby.• wMLminimizesthedistancebetweentanditsorthogonalprojectiononS,i.e.y.
14
N-dimensionalM-dimensional
Regularize Least Squares
RegularizedLeastSquares(1)
• Considertheerrorfunction:
• Withthesum-of-squareserrorfunctionandaquadraticregularizer,weget
• whichisminimizedby
16
Dataterm+Regularizationterm
lambda iscalledtheregularizationcoefficient.
RegularizedLeastSquares(2)
• Withamoregeneralregularizer,wehave
17
Lasso Quadratic
RegularizedLeastSquares(3)
• Lassotendstogeneratesparsersolutionsthanaquadratic regularizer.
18
Gradient Descent & Sequential Learning
Gradient Descent in 1D
Suppose we want to minimize a function f(x) = x4 - 15x3 + 80x2 - 180x + 144• Many approaches for doing this.• We’ll discuss one approach today called “gradient descent”.
20x
f(x)
Gradient Descent Intuition
The intuition behind 1D gradient descent: • To the left of a minimum, derivative is negative (going down).• To the right of a minimum, derivative is positive (going up).• Derivative tells you where and how far to go.
Let’s work from here and try to invent gradient descent.
21
f(x)
x x
f(x)
Gradient Descent Algorithm
The gradient descent algorithm is shown below:• alpha is known as the “learning rate”.
• Too large and algorithm fails to converge.• Too small and it takes too long to converge.
22
x
f(x)
GD Only Finds Local Minima
• If loss function has multiple local minima, GD is notguaranteed to find global minimum.
• Suppose we have this loss curve:
23
GD Only Finds Local Minima
• Here’s how GD runs:
24
● GDcanconvergeat-15whenglobalminimumis-18
Convexity
• For a convex function f, any local minimum is also a global minimum.• If loss function convex, gradient descent will always find the
globally optimal minimizer.• Formally, f is convex if I draw a line between two points on curve, all
values on curve need to be on or below line. More formally:
25
Multi Dimensional Gradient Descent
On a 2D surface, the best way to go down is described by a 2D vector.
26
On a 2D surface, the best way to go down is described by a 2D vector.
27
Multi Dimensional Gradient Descent
On a 2D surface, the best way to go down is described by a 2D vector.
28
Multi Dimensional Gradient Descent
On a 2D surface, the best way to go down is described by a 2D vector.
29
Multi Dimensional Gradient Descent
Next value for θ
Batch Gradient Descent
• Gradientdescentalgorithm:nudgeθinnegativegradientdirectionuntilθconverges.
• Batch gradient descent update rule:
30
θ:Modelweights L:lossfunction ⍺:Learningrate,typicallyeitherconstantor1/(t+1)y:Truevaluesfromthetrainingdata
Gradient of loss wrt θ
Learning rate
Gradient Descent Algorithm
• Repeat until model weights don’t change (convergence).• At this point, we have θ̂ , our minimizing model weights
31
● Initializemodelweightstoallzero○ Alsocommon:initializeusingsmallrandomnumbers
● Updatemodelweightsusingupdaterule:
StochasticGradientDescent
1. Draw a simple random sample of data indices •Often called a batch or mini-batch
•Choice of batch size trade-off gradient quality and speed
2. Compute gradient estimate and uses as gradient 32
For 𝜏 from 0 to convergence:initial vector (random, zeros …)
StochasticGradientDescent
33
initial vector (random, zeros …)
Decomposable Loss
Loss can be written as a sum of the loss on each record.
For 𝜏 from 0 to convergence:
OnlineLinearRegression
• Dataitemsconsideredoneatatime(a.k.a.onlinelearning);usestochastic(sequential)gradientdescent:
• Thisisknownastheleast-mean-squares(LMS)algorithm.Issue:howtochooseeta?
34
Gradient Descent
35
36
Stochastic Gradient Descent
Multiple Outputs
MultipleOutputs(1)
• Analogouslytothesingleoutputcasewehave:
• Givenobservedinputs,,andtargets,,weobtaintheloglikelihoodfunction
38
MultipleOutputs(2)
• MaximizingwithrespecttoW,weobtain
• Ifweconsiderasingletargetvariable,tk,weseethat
• where,whichisidenticalwiththesingleoutputcase.
39
Bias-Variance Tradeoff
TheBias-VarianceDecomposition(1)
• Recalltheexpectedsquaredloss,
• where
• ThesecondtermofE[L] correspondstothenoiseinherentintherandomvariablet.
• Whataboutthefirstterm?
41
TheBias-VarianceDecomposition(2)
• Supposeweweregivenmultipledatasets,eachofsizeN.Anyparticulardataset,D,willgiveaparticularfunctiony(x;D).Wethenhave
42
TheBias-VarianceDecomposition(3)
• TakingtheexpectationoverDyields
43
TheBias-VarianceDecomposition(4)
• Thuswecanwrite
• where
44
TheBias-VarianceDecomposition(5)
• Example:25datasetsfromthesinusoidal,varyingthedegreeofregularization,¸.
45
TheBias-VarianceDecomposition(6)
• Example:25datasetsfromthesinusoidal,varyingthedegreeofregularization,¸.
46
TheBias-VarianceDecomposition(7)
• Example:25datasetsfromthesinusoidal,varyingthedegreeofregularization,¸.
47
TheBias-VarianceTrade-off
• Fromtheseplots,wenotethatanover-regularizedmodel(large )willhaveahighbias,whileanunder-regularizedmodel(small )willhaveahighvariance.
48
Bayesian Linear Regression
BayesianLinearRegression(1)
• Defineaconjugateprioroverw
• CombiningthiswiththelikelihoodfunctionandusingresultsformarginalandconditionalGaussiandistributions,givestheposterior
• where
50
BayesianLinearRegression(2)
• Acommonchoicefortheprioris
• forwhich
• Nextweconsideranexample…
51
BayesianLinearRegression(3)
52
0datapointsobserved
Prior DataSpace
BayesianLinearRegression(4)
53
1datapointobserved
Likelihood Posterior DataSpace
BayesianLinearRegression(5)
54
2datapointsobserved
Likelihood Posterior DataSpace
BayesianLinearRegression(6)
55
20datapointsobserved
Likelihood Posterior DataSpace
Predictive Distribution
PredictiveDistribution(1)
• Predicttfornewvaluesofxbyintegratingoverw:
• where
57
PredictiveDistribution(2)
• Example:Sinusoidaldata,9Gaussianbasisfunctions,1datapoint
58
PredictiveDistribution(3)
• Example:Sinusoidaldata,9Gaussianbasisfunctions,2datapoints
59
PredictiveDistribution(4)
• Example:Sinusoidaldata,9Gaussianbasisfunctions,4datapoints
60
PredictiveDistribution(5)
• Example:Sinusoidaldata,9Gaussianbasisfunctions,25datapoints
61
Any Questions?!