seyed abbas hosseini sharif university of technology most slides … · 2021. 3. 13. · most...

Seyed Abbas Hosseini Sharif University of Technology

Regression

Most slides are adopted from PRML book

Outline

2

•LinearBasisFunctionModels

•MaximumLikelihoodandLeastSquares

•RegularizedLeastSquares

•GradientDescentandSequentialLearning

•MultipleOutputs

•BiasVarianceTradeoff

Linear Basis Function Models

LinearBasisFunctionModels(1)

• Example:PolynomialCurveFitting

4


• Generally

• whereareknownasbasisfunctions.

• Typically= 1,forj=0sothatw0actsasabias.

• Inthesimplestcase,weuselinearbasisfunctions: = xj.

5


• Polynomialbasisfunctions:

• Theseareglobal;asmallchangeinxaffectallbasisfunctions.

6


• Gaussianbasisfunctions:

• Thesearelocal;asmallchangeinxonlyaffectnearbybasisfunctions andscontrollocationandscale(width).

7


• Sigmoidalbasisfunctions:

• where

• Alsothesearelocal;asmallchangeinxonlyaffectnearbybasisfunctions. andscontrollocationandscale(slope).

8

Maximum Likelihood and Least Squares

MaximumLikelihoodandLeastSquares(1)

• AssumeobservationsfromadeterministicfunctionwithaddedGaussiannoise:

• whichisthesameassaying,

• Givenobservedinputs,,andtargets,,weobtainthelikelihoodfunction

10

where


• Takingthelogarithm,weget

• where

• isthesum-of-squareserror.

11


• Computingthegradientandsettingittozeroyields

• Solvingforw,weget

• where

12

TheMoore-Penrosepseudo-inverse,.


• Maximizingwithrespecttothebias,w0,alone,weseethat

• Wecanalsomaximizewithrespecttobeta,giving

13

GeometryofLeastSquares

• Consider

• Sisspannedby.• wMLminimizesthedistancebetweentanditsorthogonalprojectiononS,i.e.y.

14

N-dimensionalM-dimensional

Regularize Least Squares

RegularizedLeastSquares(1)

• Considertheerrorfunction:

• Withthesum-of-squareserrorfunctionandaquadraticregularizer,weget

• whichisminimizedby

16

Dataterm+Regularizationterm

lambda iscalledtheregularizationcoefficient.


• Withamoregeneralregularizer,wehave

17

Lasso Quadratic


• Lassotendstogeneratesparsersolutionsthanaquadratic regularizer.

18

Gradient Descent & Sequential Learning

Gradient Descent in 1D

Suppose we want to minimize a function f(x) = x4 - 15x3 + 80x2 - 180x + 144• Many approaches for doing this.• We’ll discuss one approach today called “gradient descent”.

20x

f(x)

Gradient Descent Intuition

The intuition behind 1D gradient descent: • To the left of a minimum, derivative is negative (going down).• To the right of a minimum, derivative is positive (going up).• Derivative tells you where and how far to go.

Let’s work from here and try to invent gradient descent.

21

f(x)

x x

f(x)

Gradient Descent Algorithm

The gradient descent algorithm is shown below:• alpha is known as the “learning rate”.

• Too large and algorithm fails to converge.• Too small and it takes too long to converge.

22

x

f(x)

GD Only Finds Local Minima

• If loss function has multiple local minima, GD is notguaranteed to find global minimum.

• Suppose we have this loss curve:

23

GD Only Finds Local Minima

• Here’s how GD runs:

24

● GDcanconvergeat-15whenglobalminimumis-18

Convexity

• For a convex function f, any local minimum is also a global minimum.• If loss function convex, gradient descent will always find the

globally optimal minimizer.• Formally, f is convex if I draw a line between two points on curve, all

values on curve need to be on or below line. More formally:

25

Multi Dimensional Gradient Descent

On a 2D surface, the best way to go down is described by a 2D vector.

26


27



28



29


Next value for θ

Batch Gradient Descent

• Gradientdescentalgorithm:nudgeθinnegativegradientdirectionuntilθconverges.

• Batch gradient descent update rule:

30

θ:Modelweights L:lossfunction ⍺:Learningrate,typicallyeitherconstantor1/(t+1)y:Truevaluesfromthetrainingdata

Gradient of loss wrt θ

Learning rate

Gradient Descent Algorithm

• Repeat until model weights don’t change (convergence).• At this point, we have θ̂ , our minimizing model weights

31

● Initializemodelweightstoallzero○ Alsocommon:initializeusingsmallrandomnumbers

● Updatemodelweightsusingupdaterule:

StochasticGradientDescent

1. Draw a simple random sample of data indices •Often called a batch or mini-batch

•Choice of batch size trade-off gradient quality and speed

2. Compute gradient estimate and uses as gradient 32

For 𝜏 from 0 to convergence:initial vector (random, zeros …)

StochasticGradientDescent

33

initial vector (random, zeros …)

Decomposable Loss

Loss can be written as a sum of the loss on each record.

For 𝜏 from 0 to convergence:

OnlineLinearRegression

• Dataitemsconsideredoneatatime(a.k.a.onlinelearning);usestochastic(sequential)gradientdescent:

• Thisisknownastheleast-mean-squares(LMS)algorithm.Issue:howtochooseeta?

34

Gradient Descent

35

36

Stochastic Gradient Descent

Multiple Outputs

MultipleOutputs(1)

• Analogouslytothesingleoutputcasewehave:

• Givenobservedinputs,,andtargets,,weobtaintheloglikelihoodfunction

38

MultipleOutputs(2)

• MaximizingwithrespecttoW,weobtain

• Ifweconsiderasingletargetvariable,tk,weseethat

• where,whichisidenticalwiththesingleoutputcase.

39

Bias-Variance Tradeoff

TheBias-VarianceDecomposition(1)

• Recalltheexpectedsquaredloss,

• where

• ThesecondtermofE[L] correspondstothenoiseinherentintherandomvariablet.

• Whataboutthefirstterm?

41


• Supposeweweregivenmultipledatasets,eachofsizeN.Anyparticulardataset,D,willgiveaparticularfunctiony(x;D).Wethenhave

42


• TakingtheexpectationoverDyields

43


• Thuswecanwrite

• where

44


• Example:25datasetsfromthesinusoidal,varyingthedegreeofregularization,¸.

45



46



47

TheBias-VarianceTrade-off

• Fromtheseplots,wenotethatanover-regularizedmodel(large )willhaveahighbias,whileanunder-regularizedmodel(small )willhaveahighvariance.

48

Bayesian Linear Regression

BayesianLinearRegression(1)

• Defineaconjugateprioroverw

• CombiningthiswiththelikelihoodfunctionandusingresultsformarginalandconditionalGaussiandistributions,givestheposterior

• where

50


• Acommonchoicefortheprioris

• forwhich

• Nextweconsideranexample…

51


52

0datapointsobserved

Prior DataSpace


53

1datapointobserved

Likelihood Posterior DataSpace


54

2datapointsobserved



55

20datapointsobserved


Predictive Distribution

PredictiveDistribution(1)

• Predicttfornewvaluesofxbyintegratingoverw:

• where

57


• Example:Sinusoidaldata,9Gaussianbasisfunctions,1datapoint

58


• Example:Sinusoidaldata,9Gaussianbasisfunctions,2datapoints

59



60



61

Any Questions?!

seyed abbas hosseini sharif university of technology most slides … · 2021. 3. 13. · most...

Documents