smoothingsas.uwaterloo.ca/~rwoldfor/courses/.../smoothing.pdf · 0.0 0.5 1.0-1.5-1.0-0.5 0.0 0.5...

124
Smoothing R.W. Oldford Contents 1 Fitting locally 1 1.1 Piecewise fitting ........................................... 5 1.2 Multidimensional splines ....................................... 39 1.3 Local neighbourhoods ........................................ 39 1.4 Multidimensional x .......................................... 80 2 Linear smoothers 80 2.1 Complementary viewpoints ..................................... 81 3 More than one explanatory variate 98 3.1 Multiple regression smooths...................................... 105 3.2 Additive models ............................................ 115 3.3 Advantages of additive models ................................... 119 1 Fitting locally Recall the Facebook data on like and Impressions where we fit a cubic to capture the average of the ys as a function of x. y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 + r. Here y was log(like + 1) and x was log(Impressions) 1

Upload: others

Post on 01-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • SmoothingR.W. Oldford

    Contents1 Fitting locally 1

    1.1 Piecewise fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Multidimensional splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391.3 Local neighbourhoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391.4 Multidimensional x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    2 Linear smoothers 802.1 Complementary viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    3 More than one explanatory variate 983.1 Multiple regression smooths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.2 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153.3 Advantages of additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    1 Fitting locally

    Recall the Facebook data on like and Impressions where we fit a cubic to capture the average of the ysas a function of x.

    y = β0 + β1x + β2x2 + β3x3 + r.

    Here y was log(like + 1) and x was log(Impressions)

    1

  • 8 10 12 14

    02

    46

    8

    Facebook

    log(Impressions)

    log(

    like

    + 1

    )

    There are a few things worth noting about this model• it is a polynomial• it is defined for any x ∈ ℜ, that is for any x within the range of the xis and any x outside of the model• the corresponding generative model has

    Y = µ(x) + R

    with E(R) = 0. That is the mean or expected value of Y is a function of x• that function is

    µ(x) = β0 + β1x + β2x2 + β3x3.

    We might reasonably ask which, if any, of these implicit assumptions generalize?It is not clear, for example, whether a cubic is appropiate here? Perhaps a lo degree polynomial would bebetter? Or lower degree? Or perhaps some non-polynomial curve might make more sense? We might wantto be guided by whatever prior information we might have about the functional form of the model. Barringthe availability of such prior information, we might prefer to have a less prescriptive model for the functionalform of µ(x).In most cases, we would rather like to “let the data speak for themselves” in suggesting what shape thedependence of y on x might take. We would rather not have to specify that shape in advance.

    2

  • A less prescriptive approach might be to take the generative model at face value, namely that we are modellingthe mean of Y at any given point x. Having data values (x1, y1), . . . , (xN , yN ) it might make sense to simplyestimated µ(x) by the arithmetic average of the yi values for those points whose corresponding xi is eitherequal to the x of interest or which is nearly so.For example, a plot pf log(impressions) versusPost.Month‘ might simply “connect the dots” of themonthly averages:plot(fb$Post.Month, fb$x,

    main = "Facebook",xlab = "Post.Month)",ylab = "log(Impressions)",pch=19,col=adjustcolor("firebrick", 0.7))

    averages

  • 2 4 6 8 10 12

    810

    1214

    Facebook

    Post.Month)

    log(

    Impr

    essi

    ons)

    LS lineAveragesMedians

    The averages might better reflect the month to month differences than does the least-squares fitted line,though the latter is simpler. The medians on the other hand produce a smoother curve that is not soinfluenced by outlying y values.With the monthly posts, we have many observations available for each month, that is many y values forevery unique x. More generally, we might choose to have

    µ̂(x) = 1kx

    ∑xi ∈ Nbhd(x)

    yi

    where Nbhd(x) denotes a neighbourhood of x and kx is the number of points in that neighbourhood. It might,for example, be a fixed number of neighbours neighbourhood or a distance based neighbourhood. Either way,µ(x) is being determined locally, in this case by the average of all of the yis in the local neighbourhood.For example, consider the following artifically generated set of point pairs:

    4

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Fake data

    x

    y

    1.1 Piecewise fitting

    We could, for example, cut the range of the x values up into several fixed neighbourhoods and use the averagey in each neighbourhood as µ̂(x) for any value of x in the neighbourhood. This might be accomplished asfollows:# varying numbers, constant widthbreaks_v

  • local_v

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5Constant width nbhd

    x

    y

    plot(x,y,col="grey80", pch=19, cex=0.5,main = "Constant proportion nbhd")

    plot_ave(local_p, nbhd_p, x, mu_p,col="red", lwd=5)

    7

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Constant proportion nbhd

    x

    y

    # Note that the first neighbourhoods had constant widths,# and hence varying numbers of pointslocal_v

    ## [1] "[-0.306,-0.154]" "(-0.154,-0.00201]" "(-0.00201,0.15]"## [4] "(0.15,0.302]" "(0.302,0.454]" "(0.454,0.606]"## [7] "(0.606,0.758]" "(0.758,0.91]" "(0.91,1.06]"## [10] "(1.06,1.21]"# The second neighbourhoods had approximately constant number# of points, this being about 10% of the x values in each.# Hence they had varying widths.#local_p

    ## [1] "[-0.306,0.0714]" "(0.0714,0.19]" "(0.19,0.291]"## [4] "(0.291,0.38]" "(0.38,0.45]" "(0.45,0.539]"## [7] "(0.539,0.636]" "(0.636,0.773]" "(0.773,0.869]"## [10] "(0.869,1.21]"

    And now both in the one plot.

    8

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Both at once

    x

    y

    Each plot has a separate mean value cover each entire neighbourhood. Note however - they visibly havedifferent neighbourhood sizes depending on the density of the points (in x) - they are in some agreement,particularly about the coarse shape of µ(x) - they disagree in places, some regions have quite different values- they suggest that µ(x) is - locally flat (zero slope) - discontinuousWe might for example look at larger, or smaller neighbourhoods. First, let’s make the neighbourhoods larger.

    9

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Constant width nbhd

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Constant proportion nbhd

    x

    y

    And now smaller neighbourhoods:

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Constant width nbhd

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Constant proportion nbhd

    x

    y

    (Note that the intervals were not drawn here to cover the entirel neighbourhood but only the points in eachneighbourhood.)Not surprisingly, the smaller neighbourhoods capture finer structure and the larger neighbourhoods coarserstructure.But we still have these “flats”” everywhere. This, together with the obvious discontinuities for µ(x), alsosuggests something about the derivative µ′(x), namely that µ′(x) = 0 almost everywhere except the discon-tinuites where it is not defined (or essentially infinite).We could replace the flats by local lines. This allows µ(x) to change somewhat more smoothly and lets thederivative take non-zero values within the neighbourhoods.We need only change the way we calculate the estimates of µ(x).## Now we need to adapt the local average function# to fit local lines.#

    10

  • get_ave

  • mu[nbhd_i]

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Constant proportion nbhd

    x

    y

    Clearly, this follows the data more closely and more smoothly in each neighbourhood. Unfortunately at theneighbourhood boundaries discontinuities can occur.How do we rid ourselves of the discontinuities?

    1.1.1 Splines

    First, we note that within each neighbourhood, we fit a linear model, say a polynomial. For example, in thejth neighbourhood a p degree polynomial would look like

    µ(j)(x) = β(j)0 + β(j)1 x + . . . + β(j)p xp ∀ x ∈ Nbhdj .

    Since no x is in more than one neighbourhood, we could fit all parameters at once in a single model involvingall of the data. If there were J neighbourhoods, this means a linear model having a total of J(p + 1)parameters. The entire model could be written as

    µ(x) =J∑

    j=1INbhdj (x)µ(j)(x)

    where IA(x) is the indicator function evaluating to 1 whenever x ∈ A and to zero otherwise.

    13

  • To rid ourselves of discontinuities, we need only restrict the parameters so that the discontinuities disappear.For example, suppose the discontinuities occur at x = kj for j = 1, . . . , K with K = J − 1. (The boundarypoints kj are sometimes called knots.) Then we would restrict the parameters by forcing the curves to meetat every boundary. That is

    β(j)0 + β

    (j)1 kj + . . . + β(j)p k

    pj = β

    (j+1)0 + β

    (j+1)1 kj + . . . + β(j+1)p k

    pj

    for j = 1, . . . , K. This introduces K = J −1 restrictions so that there are really only J(p+1)−(J −1) = pJ +1parameters.The curve µ(x) would be continuous but could now have “kinks” at the joins. We could make these go awayby forcing the first derivatives to match at the boundaries as well. This would introduce another set ofK = J − 1 restrictions giving only pJ + 1 − (J − 1) = (p − 1)J + 2 free parameters. We could then matchthe second derivative, giving a still smoother function that is constrained by further K restrictions. If wematch the function an also match on d derivatives, the total number of parameters which remain are

    J(p + 1) − (d + 1)K = pJ + J − (d + 1)J + (d + 1)

    = (p − d)J + d + 1

    = (p − d)K + p + 1.

    If we choose d = p − 1 this becomes simply K + p + 1. A piecewise polynomial of degree p that matches onthe function and on p − 1 of its derivatives is called a p-degree spline or a pth order spline. (The namecomes from the flexible tool called the spline that was used in drafting to draw smooth curves by hand.)A popular choice is the cubic spline, having p = 3. These provide enough flexibility for most purposes and,at least according to statistical folklore, are smooth enough that the human visual system cannot detect thelocations of the knots!The linear model that would result, say for a cubic spline having p = 3, would be written as a linearcombination like

    µ(x) = β0 + β1b1(x) + . . . + βK+3bK+3(x)

    for some particular choices of bj(x). These functions can be thought of as a set of basis functions in the sameway that the N -dimensional vectors

    bj = (bj(x1), . . . , bj(xN ))T

    that result from evaluating the functions at the points x1, . . . , xN form a set of basis vectors (together withthe 1 vector multiplying β0) for the vector space spanned by the column vectors of an N × (K +4) regressionmatrix X = [1, b1, . . . , bK+3].As a basis the set of functions generate a space (having K + 4 dimensions) and, conversely, for that spacethere are any number of bases which would generate it. For example, let

    h(x, kj) = (x − kj)3+ =

    (x − kj)3 when x > kj

    0 otherwise

    where kj is the location of the jth knot. Using these functions, it can be shown that the cubic spline can bewritten as

    µ(x) = β0 + β1x + β2x2 + β3x3 + β4h(x, k1) + . . . + βK+3h(x, kK).

    In this way, our basis has been formed by starting with the basis functions for a simple cubic and then addinga truncated power basis function for each knot.This holds more generally for a p-degree spline where we can write

    µ(x) = β0 + β1x + . . . + βpxp + βp+1h1(x, k1) + . . . + βK+phK(x, kK)

    14

    https://en.wikipedia.org/wiki/Flat_spline

  • in terms of the basis functions being those of a p degree polynomial plus the truncated power basis functions:

    hj(x, kj) = (x − kj)p+ for j = 1, . . . , K.

    Note again that this choice is just one possible choice of basis functions. This particular choice is fairlyeasy to understand conceptually but it is unfortunately a poor choice for computation. The problem is thatpowers of large numbers can lead to numerical instability and round-off error. Instead, an equivalent set ofbasis functions but one which is better computationally is the so-called B-spline basis, which also allows forefficient computation even when K, the number of knots, is large.The p-degree splines with fixed knots are a very flexible set of functions for µ(x). Because we are free tochoose the knots, we can choose how many and place them where please. More knots means more flexibleand so we could even choose to add more in x regions where we think that µ(x) varies more. These fixedknots splines are also sometimes called regression splines and, as in any polynomial regression problem wherewe might add more terms to better fit the data, with these regression splines we could add more knots forthe same reason and in a more targetted fashion.This can be used for any x; all we need are the coefficient estimates and the knot values to make a predictionat any x.

    1.1.1.1 Fitting splines in RThere is a splines package in R that contains the function bs() which will calculate an X matrix (excludingthe intercept term unless it is specifically requested) corresponding to a B-spline basis for a p-degree splineat fixed knots. Unless the value of the argument degree is provided, the default spline will be cubic.For example:library(splines)p

  • parOptions

  • 0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 5

    x

    Bas

    is

    0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 6

    x

    Bas

    is

    0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 7

    x

    Bas

    is

    0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 8

    x

    Bas

    is

    17

  • 0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 9

    x

    Bas

    is

    0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 10

    x

    Bas

    is

    0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 11

    x

    Bas

    is

    0.0 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Basis vector 12

    x

    Bas

    is

    par(parOptions)

    Clearly, the basis functions are not polynomials. The estimated smooth µ̂(x) will be a linear combination ofthese functions.With this X matrix in hand, we can now fit the cubic spline to this data:fit

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5Cubic spline, knots at lines

    x

    y

    The linear combination of the basis functions (excluding an intercept) is given by the fitted coefficientestimates, namely:coef(fit)

    ## (Intercept) bs(x, degree = p, knots = knots_p)1## -1.4686086 0.6263294## bs(x, degree = p, knots = knots_p)2 bs(x, degree = p, knots = knots_p)3## 0.3132570 0.4536932## bs(x, degree = p, knots = knots_p)4 bs(x, degree = p, knots = knots_p)5## 0.6135543 0.5744901## bs(x, degree = p, knots = knots_p)6 bs(x, degree = p, knots = knots_p)7## 0.6375489 2.6419658## bs(x, degree = p, knots = knots_p)8 bs(x, degree = p, knots = knots_p)9## 1.7455348 2.0888355## bs(x, degree = p, knots = knots_p)10 bs(x, degree = p, knots = knots_p)11## 2.3340356 2.8483187## bs(x, degree = p, knots = knots_p)12## 2.6743683

    Note also that we need not have fit this via least-squares:

    19

  • library(robust)

    fit

  • fit1

  • 1.1.2 Natural splines

    As we have seen in several examples already, polynomial functions can be very wild at the end of the rangeof the x data and beyond. With p-degree splines, the polynomials fit to the outside edge of the range of thex data are likely to be even wilder since they are based on many fewer points.To help address this problem, the spline is sometimes constrained to be only a linear function beyond thelargest and smallest knots. For a p-degree spline (with odd p), we define a natural spline with knots atk1 < k2 < · · · < kK to be a function µ(x) such that

    • µ(x) is a polynomial of degree p in each interior neighbouthood [k1, k2], . . . [kK−1, kK ],• µ(x) is a polynomial of degree (p − 1)/2 on (−∞, k1] and on [kK , ∞), and• µ(x) is continuous and has continuous derivatives of order 1, . . . , (p − 1) at its knots k1, . . . , kN .

    This forces the polynomials at either end to be fluctuate less by severely reducing their degree. Perhaps themost common natural spline used is the natural cubic spline which forces the fit on either end to be astraight line.Recall that the model degrees of freedom associated with a p-degree spline was K + p + 1, where K was thenumber of knots. In the case of a natural p degree spline the model degrees of freedom are

    (p + 1)(K − 1)︸ ︷︷ ︸interior

    neighbourhoods

    + 2(

    1 + p − 12

    )︸ ︷︷ ︸

    exteriorneighbourhoods

    − Kp︸︷︷︸continuityconstraints

    = K.

    This means that the model degrees of freedom for natural splines depends only on the number of knots!Which is, you have to admit, pretty amazing.

    1.1.2.1 Fitting natural splinesIn splines package a basis matrix is provided by the function ns(...) and only for natural cubic splines.Degree 3 polynomials are by far the most common and, as it turns out, are all we really need in mostcircumstances (see “smoothing splines” below).In R, in addition to any interior knots provided by the user (or determined automatically from a user suppliedarguments) two additional boundary knots, k0 and kK+1 may be supplied (k0 < ki and kK+1 > kK). Thesedetermine the points beyond which the lower degree polynomials are fit. By default, ns chooses the boundaryknots at the minimum and maximum x values in the data.Previously we fit a cubic spline to this data with knots atknots_p

    ## 10% 20% 30% 40% 50% 60%## 0.07138955 0.18966397 0.29106870 0.38030525 0.45015999 0.53857863## 70% 80% 90%## 0.63629916 0.77296504 0.86884995

    And considered what the basis functions looked like. Analogously, we can plot the natural spline basisfunctions for this data as well.Xmat.ns

  • ylim=blim,xlim = extendrange(x),xlab="x", ylab="Basis",main=paste("ns basis vector", j),col="firebrick")

    }

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 1

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 2

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 3

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 4

    x

    Bas

    is

    23

  • 0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 5

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 6

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 7

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 8

    x

    Bas

    is

    par(parOptions)

    24

  • 0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 9

    x

    Bas

    is

    0.0 0.5 1.0

    −0.

    20.

    00.

    20.

    40.

    60.

    8

    ns basis vector 10

    x

    Bas

    isWe can also compare the two fits that would result for our fake data.fit.bs

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Comparing cubic spline, knots at lines

    x

    y

    bs − cubic splinens − natural spline

    The straight line fits at the end of the natural cubic splines can be seen in the plot. Note the essentialagreement between the two nearly everywhere else.Now, fit.bs used model fit.bs$rank = 13 degrees of freedom in building its fit whereas the natural splinefit.ns used only fit.ns$rank = 11 degrees of freedom. Note that ns added two boundary knots to the 9we provided it, hence the 11 degrees of freedom for this model. The natural spline has two fewer degrees offreedom. We might “spend” these two degrees of freedom on the placement of two more interior knots inthe natural spline.Suppose we use these extra model degrees of freedom by having two more interior knots, say between the60% and 70% quantile. this should allow us to fit the abrupt change in the middle a little better.knots_p2

  • ypred.ns2

  • Suppose that µ(x) is at least twice differentiable. Then the first derivative µ′(x) measures the slope of thefunction at any point x. A function of any slope can be smooth. However, if that slope changes frequentlyand or abruptly, then the function would not be smooth but rather it would be rough.The second derivative µ′′(x) measures how quickly the slope changes at any given point x. If this is large(positive or negative) then the slope is changing quickly. Large values of (µ′′(x))2 indicate that there is anabrupt change in slope at the point x. One possible measure of roughness then might be∫

    (µ′′(t))2 dt

    as the average of (µ′′(x))2 over all x. The smaller is the average, the smoother is µ(x).One way to proceed would be find µ̂ that minimizes

    RSS(µ, λ) =N∑

    i=1(yi − µ(xi))2 + λ

    ∫(µ′′(t))2 dt

    with λ ≥ 0. This is a penalized residual sum of squares - the first term is the residual sum of squares, thesecond term a penalty function which is larger the rougher is µ(x).Alternatively, had we been using a Gaussian generative model for the residuals, then we might cast thepenalty function in probablistic terms. For example, we have

    Yi ∼ N(µ(xi), σ2)

    as the generative model for the response Y conditional on the explanatory variate xi. Seeing that xiappears only through the function µ(xi), this is in some sense the conditional distribution of Y given bothxi and µ(·). We might still want to condition on xi but would like to constrain the functional form of µ(·)in some way. A useful fiction that helps us think about this might be to imagine that the µ(·) that hasgenerated the observed responses was itself randomly drawn from some collection of possible functions µ(·).Because we think that we are more likely to have been served up a smooth µ(·) than a rough one, we mightimagine that the probability of any particular function µ(·) is proportional to

    e−λ∫

    (µ′′(t))2dt.

    Clearly functions with larger (average over the whole line) changes in the slope have lower probability;smoother µ(·) have higher probability. If this were indeed a probability, then we might apply Bayes’stheorem and find the conditional distribution of µ(xi) given Yi = yi and xi. We could then choose µ̂(x)to maximize this probability. This turns out to be equivalent to minimizing the penalized residual sum ofsquares.Given the Gaussian generative model, the penalized residual sum of squares may be thought of as a so-called “Bayesian” method in that it turns out to be the objective function which results from maximizingthe posterior probability of µ(·) (in this language the marginal probability distribution of µ(·) is called theprior distribution because it is available prior to any data). Alternatively, the objective function is simplya penalized log-likelihood function whose penalty might have been constructed (or not) by imagining aprior distribution for µ(·).We might also recognize the penalized residual sum of squares in this case as the objective function thatwould result from using λ as a Lagrange multiplier when minimizing the residual sum of squares subjectto the constraint that

    ∫(µ′′(t))2 dt = 0. The constraint would enforce complete smoothness (i.e. a straight

    line).Note, for example, that however one might think of the penalized residual sum of squares, it is clear that thevalue taken by λ (which is still our choice) determines the smoothness of our estimated µ(x). If, for example,λ = ∞ then no change in the slope is allowed and we have the smoothest possible function, a line. Indeed, wewould have the least-squares fitted line had by minimizing the first term alone. At the other extreme, if we

    28

  • had λ = 0, then only the residual sum of squares would need to be minimized. With no further restrictionson µ, we would have µ̂(xi) = yi which amounts to connecting the dots in the scatterplot from left to right(assuming no ties among the xi, otherwise we average y at those points having the same value of x).It turns out that for any fixed λ, the solution to this penalized residual sun of squares problem, is a naturalcubic spline with knots at every unique value of the xi! That is the solution requires µ(x) to be ofthe form

    µ(x) =N∑

    j=1Nj(x)βj

    where the Nj(x) are a set of N basis functions for this family of natural cubic splines.Knowing that the solution must be a natural cubic spline of the above form, we can rewrite the penalizedresidual sum of squares as

    RSS(µ, λ) = (y − Nβ)T (y − Nβ) + λ βT ΩN β

    where N = [Nij ] is an N × N matrix whose (i, j) element is Nij = Nj(xi), the j th natural cubic spline basisfunction evaluated at xi and Ω = [ωij ] is an N × N matrix whose (i, j) element is ωij =

    ∫N ′′i (t)N ′′j (t)dt.

    The solution is now easily found to be

    β̂ =(NT N + λΩN

    )−1 NT yand the fitted smoothing spline is

    µ̂(x) =N∑

    j=1Nj(x)β̂j .

    Looking at this solution, it would appear to be overparameterized – there are as many βjs as there areobservations yi. This solution however has been constrained and to a more or less amount according to thevalue of λ. No longer are the number and location of knots chosen to make the function smoother or rougher(for the smoothing spline knots must be at the unique xi) but we now choose a value for the smoothingparameter λ. The larger is λ the more smooth is the resulting estimated function µ̂(x).

    1.1.3.1 Choosing λ via degrees of freedomBut how should we choose λ? One way would be to somehow connect the value of λ to a measure of thecomplexity (or roughness) of the fitted model.A traditional measure of the complexity of a linear model is the number of linear parameters in that model.This in turn was equivalent to the rank of the X matrix of the “hat-matrix” H = X

    (XT X

    )−1 XT .Recall that the role of the hat matrix is to determine the N dimensional fitted mean vector µ̂ = Hy, whichit does by orthogonally projecting the vector y onto the space spanned by the columns of X = colsp(X).Note that colsp(H) = colsp(X), that is it is the same space. The difference is that the columns of X forma basis for that space (assuming that X has full column rank) whereas the columns H are generators of thespace but there are too many to be a basis. There are N generators in H when we need only r = rank(X)which is typically smaller than N .One way to get a set of r basis vectors from H is to find its eigen-decomposition. Suppose we do that andthat the ordered eigen-values are ρ1 ≥ ρ2 ≥ · · · ≥ ρN ≥ 0. Let the corresponding eigen-vectors be u1, . . . , uN .Then the decomposition gives

    H =N∑

    i=1ρiuiuTi

    and soµ̂ =

    ∑Ni=1 ρiuiuTi y

    =∑N

    i=1 uiρi < ui , y >

    29

  • where < ·, · > indicates the inner product of its vector arguments. An interpretation of the last piece isthat the vector y is first decomposed with respect to the orthonormal basis {u1, . . . , uN } for ℜN . Each ρimoderates the contribution of each piece to the corresponding basis vector ui. Now, the nature or H is that

    ρi =

    1 i = 1, . . . , r0 i = (r + 1), . . . , N.Being a projection matrix (idempotent) the eigen-values ρi of H either select a basis vector (ρi = 1) or donot (ρi = 0). Note that the dimension of this space is the sum of the eigenvalues or the trace, tr(H), of H.When fitting cubic splines, instead of X we had a basis matrix B, and corresponding projection matrixHB = B

    (BT B

    )−1 BT . When using bs(...) or ns(...) in R we could specify the degrees of freedom dfand have the function choose the appropriate number (and location) of the knots with which to build B.The question is whether we can parameterize a smoothing spline in a similar way?A smoothing spline estimates the mean value at the points x1, . . . , xN by

    µ̂ = (µ̂(x1), µ̂(x2), . . . µ̂(xN ))T

    = Nβ̂

    = N(NT N + λΩN

    )−1 NT y= Sλy , say.

    So Sλ acts here like the hat-matrix HB for a cubic spline. Unfortunately, it is not idempotent and hencenot a projection matrix.We can however see what it’s made of. Note that N is N × N and of full rank N .

    Sλ = N(NT N + λΩN

    )−1 NT=

    (N−T

    (NT N + λΩN

    )N−1

    )−1=

    (IN + λN−T ΩN N−1

    )−1= (IN + λK)−1

    where K = N−T ΩN N−1. Note that K does not involve the smoothing parameter λ. Note also that theobjective function being minimized can now be rewritten as

    RSS(µ, λ) = (y − µ)T (y − µ) + λ µT Kµ.

    Consequently, K is sometimes called the penalty matrix.If K = VDVT is the eigen decomposition of the (real symmetric) matrix K with D = diag(d1, . . . , dN ) andd1 ≥ · · · ≥ dN ≥ 0, then it would appear that the components of µ in directions of eigen-vectors of K areto be penalized more when they correspond to large eigen-values di than when they correspond to smalleigen-values di.Note that both any constant function µ̂1(x) = a and any straight line function µ̂1(x) = a + bx are in thespace spanned by the natural spline basis functions Ni(x). Hence there is one non-zero linear combinationof these basis functions that leads to the constant and another that leads to a straight line. This impliesthat the two smallest eigen-values of K are dN−1 = dN = 0.This can be made more apparent by examining the solution vector µ̂ = Sλy. First, note that

    Sλ = V (IN + λD)−1 VT

    30

  • is the eigen decomposition of Sλ having eigen-values

    ρi(λ) =1

    1 + λdN−i+1

    for i = 1, . . . , N . Large values of di will produce small values of ρi(λ). Similarly, large values of the smoothingparameter λ will produce small values of ρi(λ).A closer look at µ̂ reveals

    µ̂ =∑N

    i=1 ρi(λ)vivTi y

    =∑N

    i=1 viρi(λ) < vi , y >

    where the ith eigen-vector vi here is of Sλ and corresponds to its ith largest eigen-value ρi(λ). This is thereverse order of the eigen-vectors of K in that the largest ρi(λ) corresponds to the smallest di.This is a lot like the relationship we saw for p-order splines. The difference here is that the eigen-values ρi(λ)are not just zero or one. The two largest are 1 (corresponding to d1 = d2 = 0) and the rest are less than one.With HB of the ordinary pth order splines, the eigen-values of HB selected the components of y in the direc-tions of the eigen-vectors corresponding to the largest eigen-values (namely 1) and dropped the componentsin the directions corresponding to the smallest eigen-values (namely 0). That is the nature of a projectionoperator and so this kind of spline is sometimes called a projection smoother.In contrast, the effect of Sλ on y is to shrink the components of y in the directions its eigen-vectors. Itshrinks more in those directions with small eigen-values (ρi(λ)) and less in directions with large ones. UnlikeHλ, Sλ ×Sλ ̸= Sλ and hence not a projection matrix. Instead, Sλ ×Sλ ⪯ Sλ – the product has smaller eigen-values than the original. Because of this shrinkage, the smoothing spline is sometimes called a shrinkingsmoother.Analogous to the model degrees of freedom of a projection smoother being equivalent to the trace of itsprojection matrix, tr(HB), we will take the effective degrees of freedom to be the trace of the smoothermatrix, tr(Sλ). Both are the sums of their respective eigen-values.For the smoothing spline, then, the effective degrees of freedom can be expressed as

    dfλ = tr(Sλ) =N∑

    i=1

    11 + λdi

    .

    Which in turn means that rather than specify the smoothing parameter λ, we could specify the effectivedegrees of freedom dfλ and solve the above equation for λ.Now as λ → 0, dfλ → N and Sλ → IN . The result is a perfect (and likely very non-smooth) fit. Similarly,as λ → ∞, dfλ → 2 (corresponding to dN−1 = dN = 0) and Sλ → H, the hat matrix for a straight lineregression of y on x.For λ values in between these two extremes, as λ increases we have greater shrinkage of the eigen-values ρi(λ)and a lower value of the effective degrees of freedom dfλ. Again, the shrinkage is greatest in the direction ofthe eigen-vectors vi corresponding to the smallest eigen-values of Sλ.To get some sense of what basis functions these correspond to, we could get the eigen decomposition of Sλfor our fake data example.In R we will first fit the smoothing spline on the data set using dfλ = 11 to try to match our earlier naturalspline fit. This is accomplished by these using the function smooth.spline(...) as follows.df

  • ypred.sm

  • S[,i]

  • We see that the ρi(λ) values drop of very quickly indicating that the components of y in the direction ofthe smallest eigen-values are shrunk a great deal (effectively made zero). Note also that the eigen-values doseem to sum to our intended effective degrees of freedom dfλ or sum(eigS$values) = 11.0018443.So, the question is, what do the eigen-vectors corresponding to the largest eigen-values look like? And howdo they compare to those corresponding to the small eigen-values? To that end, we plot them as a curveevaluated at the x values in our data set.First, let’s look at the curve for some small values.plotEigenBases

  • 0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 1 ; rho = 1

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 2 ; rho = 1

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 3 ; rho = 0.9986

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 4 ; rho = 0.9913

    x

    eige

    n fu

    nctio

    n

    par(parOptions)

    The next few areparOptions

  • 0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 5 ; rho = 0.9747

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 6 ; rho = 0.9373

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 7 ; rho = 0.8749

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 8 ; rho = 0.7851

    x

    eige

    n fu

    nctio

    n

    par(parOptions)

    And how about a fewer farther out.parOptions

  • 0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 10 ; rho = 0.5529

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 20 ; rho = 0.0551

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 30 ; rho = 0.0079

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 40 ; rho = 0.002

    x

    eige

    n fu

    nctio

    n

    par(parOptions)parOptions

  • 0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 50 ; rho = 6e−04

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 60 ; rho = 2e−04

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 150 ; rho = 0

    x

    eige

    n fu

    nctio

    n

    0.0 0.5 1.0

    −0.

    50.

    00.

    5

    i = 200 ; rho = 0

    x

    eige

    n fu

    nctio

    n

    par(parOptions)

    As these plots indicate, the eigen-vectors contribute bumpier and bumpier basis vectors as the associatedeigen-values diminish. The smoother radically downweights according to the corresponding eigen-value ρi(λ).It does seem that as the effective degrees of freedom get smaller, the smoother is the fitted function µ̂(x)just as is the case with the usual model degrees of freedom.More formally, axioms for an effective dimension of a finite collection of vectors, say the N column vectorsof a matrix M = [m1, . . . , mN ] have been proposed (Oldford 1985) which are satisfied by the function

    dα ({m1, . . . , mN }) = dα (M) =N∑

    i=1

    (γi

    γmax

    )αfor at least α = 1, 2. Here γi is the ith largest singular value of M and γmax = maxi γi. This “dα effectivedimension” also arises in a variety of estimation and diagnostic problems related to linear response models.For example, dα(H) = p for all α ̸= 0.

    38

    https://www.researchgate.net/publication/275521797_NEW_GEOMETRIC_THEORY_FOR_THE_LINEAR_MODEL

  • The effective degrees of freedom are therefore also an effective dimension of the set of column (or row) vectorsof Sλ when α = 1.

    1.2 Multidimensional splines

    The splines we have used above have been designed to fit a curve to a single explanatory variate x. Butwhat if we have more than one explanatory variate?Suppose for example that we have only two explanatory variates x, and z say.One way to proceed would be to use an additive model

    µ(x, z) = β0 + f1(x) + f2(z)

    and use a penalty function (∫f ′′1 (t)2dt

    )+

    (∫f ′′2 (t)2dt

    ).

    The resulting minimization is obtained when each fi(·) is itself a univariate spline. This result extends toany number of explanatory variates.Another suggestion would be to choose a set of basis functions from existing univariate ones. For example, ifwe have a set of m basis functions b1(x), . . . , bm(x) for x and another set of n basis functions c1(x), . . . , cn(x)for z, then we could introduce basis functions

    gjk(x, z) = bj(x) × ck(z)

    for j = 1, . . . , m and k = 1, . . . , n, a so-called tensor product basis. Then

    µ(x, z) = β0 +m∑

    j=1

    n∑k=1

    βjkgjk(x, z)

    would be fitted via least-squares as before.Clearly both of these methods generalize to any number of dimensions but the effective model degrees offreedom grows multiplicatively with the number of explanatory variates using the tensor product bases andonly linearly using additive bases.One dimensional splines can also be generalized to higher dimensions via an appropriate penalty for highcurvature in µ(x) with x ∈ ℜd.For example, when d = 2 we could choose our roughness penalty as∫ ∫ [(

    ∂2µ(x, z)∂x2

    )2+ 2

    (∂2µ(x, z)

    ∂x∂z

    )2+

    (∂2µ(x, z)

    ∂z2

    )2]dxdz.

    Minimizing the residual sum of squares plus λ times this penalty function leads to a smooth two dimensionalsurface called a thin-plate spline.

    1.3 Local neighbourhoods

    Compared to fitting splines with prespecified knots, smoothing splines seem to fit the data more locally inthat they had knots at every unique point xi in the data. Another approach to finding a flexible functionto fit the data would be focus on each point x and fit a mean function to that point based on its localneighbourhood.

    39

  • 1.3.1 K nearest neighbour fitting

    The simplest way to proceed might again be to fit a local average at every value x = xi in the data set. Valuesfor other values of x might be found by simple linear interpolation between these fitted values (i.e. simply“connect the dots”).One way to define a local neighbourhood would be to find the k nearest neighbours of that x value. Thereexists a function called knn.reg in the R package FNN that will compute this average for every point xi inthe data.require(FNN)

    # Let's try a few values for k#knn.fit5

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    5 nearest neighbours

    x

    y

    plot(x,y,col="grey80", pch=19, cex=0.5,main = "21 nearest neighbours")

    lines(x[Xorder], knn.fit21$pred[Xorder],col=adjustcolor("firebrick", 0.75),lwd=2, lty=2)

    41

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    21 nearest neighbours

    x

    y

    plot(x,y,col="grey80", pch=19, cex=0.5,main = "51 nearest neighbours")

    lines(x[Xorder], knn.fit51$pred[Xorder],col=adjustcolor("firebrick", 0.75),lwd=2, lty=5)

    42

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    51 nearest neighbours

    x

    y

    As was the case with piecewise neighbourhoods, as the size of these local neighbourhoods increases, thesmoother becomes the fitted function.Just as we did for the case of piecewise defined neighbourhoods, we might also replace the averages withany fitted model based on the k nearest neighbours of any location x.To that end, here is a little function that will allow us to experiment a bit.library(FNN)library(robust)

    # This function allows us to see how# the local fits behave as we change# various elements#

    KNNfit

  • "hampel","bisquare","lms","lts"),

    newplot=TRUE, # create a new plot or# add to and existing one

    showPoints=FALSE, # highlight points used?pcol="red", # highlight colourshowWeights=FALSE, # the weights used to

    # define the nbhd for# the fit

    wcol="pink", # weight colourshowLine=TRUE, # show the fitted

    # line at xlocfullLine=TRUE, # full or partial linelcol="steelblue", # line colourcex=0.5, # point sizepch=19, # point charactercol="grey80", # point colourlwd=2, # line width... # other plot parameters)

    {if (newplot) {

    plot(x=x, y=y, cex=cex, pch=pch, col=col, ...)}data

  • # We can select these points via weightsif (showWeights){N

  • KNNfit(x, y, xloc=xloc, k=k, method="lm",fullLine=TRUE, showPoints=TRUE,main = paste("least-squares", "fit at x =", xloc, "with k =", k))

    KNNfit(x, y, xloc=xloc, k=k, method="lms",fullLine=TRUE, showPoints=TRUE,main = paste("LMS", "fit at x =", xloc, "with k =", k))

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    least−squares fit at x = 0.25 with k = 5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    LMS fit at x = 0.25 with k = 5

    x

    y

    par(parOptions)

    Note that the fit might produce a good location, even though the slope of the line used to get that locationis somewhat surprising. That is because it is responding to very local structure (it is based only on the redpoints in the plot).So what happens if we increase the size of the neighbourhood? How about k = 30?xloc

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    least−squares fit at x = 0.25 with k = 30

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    LTS fit at x = 0.25 with k = 30

    x

    y

    par(parOptions)

    Although the lines are still different, there is much better agreement in this case. When the neighbourhoodis sized right so that a few points do not influence the outcome, these fitting methods should be in nearagreement (especially as regards the fitted line at the single location xloc). If it is too small, a few pointscould dominate a least-squares fit. If it is too large, the underlying function may have changed enough thata simple straight line model might be too simple.We might compare the fits over a wide range of x locations. We’ll do that now, but using only a short linesegment to represent the fitted line at each location. The horizontal range of each line segment covers theneighbourhood on which the line was constructed.First for least-squares and k = 30:k

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    lm fit with k = 30

    x

    y

    And now for lts and k = 30:

    48

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    lts fit with k = 30

    x

    y

    We can see that both sets of fits find the middle peak of the plot, but that the peak is higher for “lts” herebecause it ignores the rightmost points in that neighbourhood as outliers.

    1.3.2 Local weighting

    Another way to think about this local fitting is to imagine that all points in the data set are being used,except that those outside of the local neighbourhood have a zero weight (in the sense of weighted leastsquares).For example, our least-squares fit only on the k nearest neighbours looks like a least-squares fit on allof the data but with weights that are 1 for points in the neighbourhood and zero for points outside theneighbourhood. The two lines could be seen as follows:## Our data here are simply x and y#ourdata

  • plot(x,y,col="grey80", pch=19, cex=0.5,main = "Locally weighted least-squares")

    abline(a=alpha_hat, b=beta_hat,col=adjustcolor("steelblue", 0.75),lty=2, lwd=3)

    KNNfit(x, y, xloc=xloc, k=30, method="lm",fullLine=TRUE, showPoints=FALSE,showWeights = TRUE, newplot=FALSE,lcol=adjustcolor("steelblue", 0.75),lwd=3)

    points(x,y, col="grey80",pch=19, cex=0.5)

    legend("topleft",legend=c("Weight 1 for all data","Non-zero red weights"),lty=c(2,1), lwd=3,col=c(adjustcolor("steelblue", 0.75),

    adjustcolor("steelblue", 0.75)))

    50

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Locally weighted least−squares

    x

    y

    Weight 1 for all dataNon−zero red weights

    When we plot the weights like this a few things become obvious. µ̂(x) is seen to be the value that minimizesa weighted sum of squares

    N∑i=1

    w(x, xi) r2i =N∑

    i=1w(x, xi)(yi − µ(xi))2

    where

    w(x, xi) =

    1 xi ∈ Nbhd(x)0 otherwise.And second that the choice of a weight function is a bit peculiar, namely the indicator function INbhd(x)(xi).Note also that the neighbourhood of the indicator function is defined by the k nearest neighbours of x whichhas some obvious consequences. For example, for some x, the neighbourhood could be very unbalanced inas much as most (even all) neighbours could be on one side of x and fewer (even none) could be on the other.If x happens to be located in a place where there are few data points, the neighbourhood could stretch somedistance to find its k near neighbours. Clearly this would stabilize µ̂(x) there but it would be at the expenseof relying on points possibly quite far away.Another approach we might try would be to base our neighbourhood on the Euclidean distance from thelocation of interest, x. For example, we could choose our neighbourhood to be to choose all points xi withina distance h say of x, that is

    Nbhd(x, h) = {xi : |x − xi| ≤ h} .

    51

  • Just as with k nearest neighbours then we might give a weight of one to all points xi within this neighbourhoodand a weight of zero to points outside the neighbourhood.One problem with this neighbourhood is that depending on its size it might include many points or fewpoints. We’ll return to this later.Alternatively, we might not worry so much about the neighbourhood but rather simply choose the weightsmore judiciously. Since we are trying to fit locally, we could choose higher weights for the closer points andlower weights for those farther away.For example, we might consider weights that are proportional to a function, say K(t) having the followingproperties: ∫

    K(t)dt = 1,∫

    t K(t)dt = 0, and∫

    t2 K(t)dt < ∞.

    The first two of these standardize the K(t); the last makes sure that there is some spread in the weight alongthe real line but also that there not be too much weight in the extremes. The function K(t) maps K : ℜ → ℜand is called a kernel function. (Aside: this is not to be confused with the “kernel” functions you mayhave seen in other courses – of reproducing kernel Hilbert space methods which map K : ℜ × ℜ → ℜ.)To help get some intuition on these, imagine that we also have K(t) ≥ 0 for all t. Then K(t) could be adensity function (integrating to 1), with mean 0, and finite variance. Note that it need be symmetric, thoughthat is all that we will consider.A gaussian kernel would be one such example defined as

    K(t) = 1√2π

    exp

    (− t

    2

    2

    ).

    Some other examples include:1. An Epanechnikov kernel

    K(t) =

    34

    (1 − t2

    )for |t| < 1

    0 otherwise.

    2. Tukey’s tri-cube weight

    K(t) =

    (1 − |t|3

    )3 for |t| ≤ 10 otherwise.

    Since we are interested in applying these functions locally, the kernel functions K(·) is applied not to eachxi, but rather to the difference xi − x. Similarly, a means of controlling how quickly the weights diminsh forany kernel is to introduce a scale parameter, say h > 0.Our weight function would be calculated as

    w(x, xi) =K

    (xi−x

    h

    )∑Nj=1 K

    (xj−x

    h

    )1.3.2.1 IllustrationTo illustrate how this might work, we can construct a naive locally weighted sum of squares estimator asfollows.# Construct a weight function shaped like# a Gaussian or Normal density.#GaussWeight

  • # Normal densitydnorm(x, mean=xloc, sd=h)

    }

    Now we choose a point in the x range, say xloc = 0.5 and fit a straight line using all of the data but usingthe above weight function to determine the weight that will be given to each point in the estimation.# location at which we are estimating.xloc

  • for (i in 1:length(wts)) {lines(x=rep(x[i],2),

    y=c(ybottom, ybottom + wts[i] * yheight),col="pink",lty=1)

    }

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Gaussian weights

    x

    y

    Which doesn’t look that different from the least-squares lines. For the original least-squares fit we had α̂ =-1.3343939 and β̂ = 2.4593346. With our Gaussian weights, the weighted least-squares estimates were α̂ =-1.4725965 and β̂ = 2.7143856, which is different, but not that different.This is to be expected when you look at the weights for each point shown across the bottom of the plot.These are not that different from one another and no point has very small weight.

    1.3.3 Scaling also determines locality

    To be more responsive to the local structure, we need only consider changing the size of the scale parameterh that determines the standard deviation used in the Gaussian weight function.To see the effect of changing this, and to illustrate a few other points, we’ll first wrap all of the above local

    54

  • fitting and drawing in a single demo function:# This demo function allows us to see how# the local fits behave as we change various elements#

    demoLoWeSS

  • points(xloc, pred, pch=19, col=pcol)

    if (showWeights){weights

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    yspan = 0.95

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    span = 0.82

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    span = 0.68

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    yspan = 0.55

    57

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    yspan = 0.41

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    span = 0.28

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    span = 0.14

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    yspan = 0.01

    par(parOptions)

    1.3.4 Putting the pieces together

    We can now easily imagine fitting lines locally at a whole series of x values.##demoPieces

  • xrange

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    60

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    par(parOptions)

    # Here's the plot againplot(x,y,

    col="grey80", pch=19, cex=0.5,main = "Local linear smooth: span=0.8")

    #lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

    61

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Local linear smooth: span=0.8

    x

    y

    ## Decrease the spanparOptions

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    63

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    par(parOptions)

    64

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    # Connect the dotsplot(x,y,

    col="grey80", pch=19, cex=0.5,main = "Local linear smooth: span=0.4")

    #lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

    65

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Local linear smooth: span=0.4

    x

    y

    #parOptions

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    67

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    68

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    69

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    70

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    x

    y

    par(parOptions)

    # Connnect the dotsplot(x,y,

    col="grey80", pch=19, cex=0.5,main = "Local linear smooth: span=0.1")

    #lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

    71

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Local linear smooth: span=0.1

    x

    y

    ##

    ## And finally, let's look at the small bandwidth# with many more points# Get the locations BUT do not plot lines yet.dotLocs

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    100 locations: span = 0.1

    x

    y

    So the thing to do is make a function that just does this.

    1.3.4.1 Putting it all togetherHere is one such function that will produce a smooth curve from minimizing the Locally Weighted Sum ofSquares, or LoWeSS.## Given a weight function the corresponding# weighted least squares estimate at any point(s) x# is easily constructed.## It requires:# x, y - the data# xloc - x locations at which the estimate# is to be calculated# span - a bandwidth# weightFn - a weighting function, default will be GaussWeight# nlocs - number of equi-spaced locations at which estimate# will be calculated if xloc=NULL, ignored otherwise.#

    73

  • LoWeSS

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Our data

    x

    y

    Now ourdata was actually generated asYi = µ(xi) + Ri

    with Ri ∼ N(0, 0.2)So we are in the completely artificial position of being able to compare the fitted smooth with the true µ(xi)that was used to produce these data. Here’s how the data were constructed.# The fake data## Get some x's#set.seed(12032345)N

  • breaks

  • 0.0 0.5 1.0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Our LoWeSS smoother

    x

    y

    What we have constructed here is a naive locally weighted sum of squares estimator. Clearly, there aresome difficulties that require attention, such as:

    • choosing the bandwidth. What value? Also, ours only looked at x locations; another might choose aproportion of the nearest x values.

    • what weight function? Perhaps a more robust choice that actually gives zero weight to points that arefar away.

    • what about the ends? Seems that you can only estimate using data from one side. Should that effectthe choice of bandwidth there? Or the weight function?

    1.3.5 Using both knn and local weights

    The naive locally weighted sum of squares above did not define neighbourhoods, but rather used the scaleparameter h to determine how weights would diminish. Again this means that where the data is densest inx, more points will appear in the estimation than where it is sparser.So we now return to asserting that there will be the same specified number of points, k, in a neighbourhood.Only points in that neighbourhood may have non-zero weights – all points outside of the neighbourhood willhave zero points. We will also use some kernel function to downweight points in the neighbourhood. Thekernel will again be evaluated at (xi − x)/h , but now we will choose h to be a function of the maximumdistance |xi − x| over all points in the neighbourhood.

    77

  • 1.3.5.1 loess: locally weighted sums of squaresIn R there is a function called loess that fits a LOcally wEighted Sum of Squares estimate that pays a littlemore attention to some of these problems. For example,fit

  • fit

  • Given the parameter a < 1, the default weighting is given Tukey’s tricube weight

    K(t) ={

    (1 − |t|3)3 for |t| ≤ 10 elsewhere

    For the ith point in the neighbourhood, we take

    ti =|xi − x|

    maxj∈Nbhd(x) |xj − x|.

    If a > 1, the maximum distance in the above denominator is taken to be a1/p times the maximum distanceto x of all the xi s. Note that here p is the number of explanatory variates in case there is more than 1. (inthis case, |xi − x| is everywhere replaced by the Euclidean distance ||xi − x||)Loess is not restricted to fitting local lines. loess can fit any degree polynomial locally (though typicallyonly degree 1 or 2 is used in practice). Its default is 2.The fitting mechanism is given by the parameter family and can be either gaussian (the default), whichwill use least-squares to fit the local polynomial, or symmetric which will begin with least squares and thenperform a few iterations of an M-estimation using Tukey’s bisquare weight function.

    1.4 Multidimensional x

    As was suggested in the above discussion of loess, all of these local fitting methods are easily extended tocases with more than one explanatory variate simply by using distances ||xi −x|| in place of |xi −x| whereverthe latter appears. For example the kernel weighting functions all become

    K

    (||xi − x||

    h

    )for some span parameter h.

    2 Linear smoothers

    A smoother is sometimes called a linear smoother if its vector of fitted values µ̂ at the xis can be written as

    µ̂ = Sy

    for some N ×N matrix S whose elements depend only on the values of x1, . . . , xN . The smoothers are calledlinear because they are linear in y.Nearly all of the smoothers discussed above are linear smoothers. The only exceptions are those where thefitting (locally or globally) has been done using weights that depend on the values of y. This would be thecase, for example, if any of the iteratively reweighted M-estimates or the high breakdown estimates wereused in the fitting – the resulting smooths would no longer strictly linear in y. There are other examples aswell, but we will not be considering them here.We really only have three different classes of smoother that we have considered. First are the splines basedon a fixed set of K knots (or equivalently a specified number of effective degrees of freedom). Of these weconsidered two possibilities (p-degrees splines and natural splines).For p-degree splines we had

    µ̂ = Sby = B(BT B

    )−1 BT yand similarly for natural splines, we had

    µ̂ = Sny = N(NT N

    )−1 NT y.80

  • Both of these were called “projection smoothers” since they orthogonally project y onto the space spannedby the columns of B for p-degree splines or onto the space spanned by the columns of N for natural splines.The second class is the set of smoothing splines was that of the “smoothing splines”. For these we had

    µ̂ = Sλy

    and showed that

    Sλ = N(NT N + λΩN

    )−1 NT = (IN + λK)−1The third class is the set of local polynomial regression smoothers. We include kernel smoothers in this class(where the local polynomial is simply a constant). In this case, at each point x we fit a polynomial of somedegree (typically 0, 1, or 2) by weighted least-squares with weights depending on the distance of the observedpoints xi from the point of interest x. When we look at the fitted values of at the observed xi we have

    µ̂i = µ̂(xi) = xTi β̂i

    where β̂i =(XT WiX

    )−1 XT Wiy is the weighted least-squared estimate of the coefficient vector with adiagonal matrix Wi of weights that are peculiar to each observation i. This means that each element µ̂i ofµ̂ can be written as

    µ̂i = sTi y

    wheresTi = xTi

    (XT WiX

    )−1 XT Wiis a 1×N vector dependent only on the values of the xi. This in turn means we can write any local polynomialregression smoother’s estimate of the fitted values as

    µ̂i = Swy

    for

    S =

    sT1sT2...

    sTN

    .Note that this matrix, unlike the others, need not be symmetric.All of these smoothers are therefore linear smoothers in that they share this common structure:

    µ̂ = Sy

    for variously defined S, each being defined independently of the values of y.

    2.1 Complementary viewpoints

    Since all linear smoothers have the same form we should be able to look at them in much the same way –no matter how they were motivated or derived.For example, local regression smoothers like loess were built on kernel functions which gave higher weightto observations nearest the point x of interest. We might expect, therefore, that in computing the fit at anygiven x that the coefficients multiplying any y would be higher for a yi whose corresponding xi was closerto x than for one which was farther away. To check this, we might have a look at the coefficients of each yias a function of xi.First we need to have the smoother matrix that corresponds to a loess fit. As we did with the smoothingsplines, we could compute this for any particular set of x values as follows:

    81

  • smootherMatrixLoess

  • lwd=3,col=adjustcolor("steelblue", 0.5))

    abline(v=ourdata$x[row[i]], col="steelblue")

    # And another rowi

  • set.seed(12231299)row

  • 0.0 0.5 1.0

    −0.

    010

    0.00

    00.

    005

    0.01

    00.

    015

    Two rows of smoother matrix

    x

    Coe

    ffici

    ent o

    f y

    As was the case with the local regression, the coefficient of yi is higher the closer its xi value is to the valueof x.Similarly, when we determined the various spline smoothers we considered them as linear combinations ofvarious basis functions. In particular, we looked at some of the orthogonal basis functions that went intodetemining any smoother.We can do the same for any linear smoother. While the spline based smoother matrices S were symmetric,this need not be the case in general and it is not the case for the local regression estimators. So, rather thanworking with the eigen decomposition of the smoother matrix, we work with its singular value decomposition.That is, we decompose any smoother matrix S as

    S = UDρVT

    for N × N matrices U = [U1, . . . , UN ], V = [V1, . . . , VN ], Dρ = diag(ρ1, . . . , ρN ) with ρ1 ≥ ρ2 ≥ · · · ≥ρN ≥ 0 and

    UT U = IN = VT V.

    The smooth can now be written as

    µ̂ = UDρVT y

    =∑N

    i=1 Uiρi < Vi, y >

    85

  • which separates into the basis vectors Ui, the singular values ρi and the orthogonal component of y alongthe direction vectors Vi.If we consider the smoothing spline, as we did before, we can plot these various components as followssvd_s

  • 0 50 100 150 200 250 300

    −2

    02

    46

    810

    y components

    index

    y−co

    mpo

    nent

    parOptions

  • 0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 1

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 2

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 3

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 4

    x

    u va

    lue

    88

  • 0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 5

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 6

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 7

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 8

    x

    u va

    lue

    89

  • 0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 9

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 10

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 11

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 12

    x

    u va

    lue

    par(parOptions)

    As can be seen, the singular values and the y components die off quickly. These are the multipliers of thebasis functions. The orthogonal basis functions increase in complexity as i increases. These higher frequencybasis functions are largely obliterated by the small singular values and y components.Now since loess is also a linear smoother, we can do the same for loess on this data.svd_l

  • cex=0.5, pch=19,xlab="index", ylab="rho")

    plot(t(svd_l$v) %*% ourdata$y,col=adjustcolor("steelblue",0.5),cex=0.5, pch=19,type="b", main="y components",xlab="index", ylab="y-component")

    0 50 100 150 200 250 300

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    singular values

    index

    rho

    0 50 100 150 200 250 300

    05

    10

    y components

    index

    y−co

    mpo

    nent

    par(parOptions)

    parOptions

  • 0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 1

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 2

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 3

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 4

    x

    u va

    lue

    92

  • 0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 5

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 6

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 7

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 8

    x

    u va

    lue

    93

  • 0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 9

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 10

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 11

    x

    u va

    lue

    0.0 0.5 1.0

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    basis 12

    x

    u va

    lue

    par(parOptions)

    The local weighted least squares estimate shows much the same pattern as the smoothing spline. It too has aset of orthogonal basis functions for which the singular values and the y components die out quickly. Again,the orthogonal basis functions increase in complexity as i increases and the higher frequency basis functionsare largely obliterated by the small singular values and y components.To make a few direct comparisons, we could overplot some of these as follows:S

  • type="l",xlim=extendrange(ourdata$x),ylim=extendrange(c(range(S_l[row,]),

    range(S_s[row,]))),xlab="x", ylab="Coefficient of y",main="Two rows of smoother matrix",lwd=3,col=adjustcolor("firebrick", 0.5))

    abline(v=ourdata$x[row[i]], col="grey80")

    # And another rowi

  • 0.0 0.5 1.0

    −0.

    010

    0.00

    00.

    005

    0.01

    00.

    015

    0.02

    0

    Two rows of smoother matrix

    x

    Coe

    ffici

    ent o

    f y

    splineloess

    ##plot(svd_l$d,

    type="b", main="singular values",col=adjustcolor("steelblue",0.5),cex=0.5, pch=19,xlab="index", ylab="rho")

    points(svd_s$d, type="b",col=adjustcolor("firebrick",0.5),cex=0.5, pch=19,main="singular values", xlab="index", ylab="rho")

    96

  • 0 50 100 150 200 250 300

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    singular values

    index

    rho

    plot(t(svd_s$v) %*% ourdata$y,col=adjustcolor("firebrick",0.5),cex=0.5, pch=19, ylim=c(-5,15),type="b", main="y components",xlab="index", ylab="y-component")

    points(t(svd_l$v) %*% ourdata$y,col=adjustcolor("steelblue",0.5),cex=0.5, pch=19,type="b", main="y components",xlab="index", ylab="y-component")

    97

  • 0 50 100 150 200 250 300

    −5

    05

    1015

    y components

    index

    y−co

    mpo

    nent

    3 More than one explanatory variate

    Suppose we have more than one explanatory variate. For example, consider the ozone data set from thepackage ElemStatLearn:library(ElemStatLearn)head(ozone)

    ## ozone radiation temperature wind## 1 41 190 67 7.4## 2 36 118 72 8.0## 3 12 149 74 12.6## 4 18 313 62 11.5## 5 23 299 65 8.6## 6 19 99 59 13.8

    Here we have 111 daily measurements taken from May until September 1973 in New York on four variates:• ozone, the ozone concentration in parts per billion (ppb),• radiation, the solar radiation energy measured in langleys,

    98

    https://en.wikipedia.org/wiki/Langley_(unit)

  • • temperature, the maximum temperature that day in degrees Fahrenheit, and• wind, the wind speed in miles per hour.

    Interest lies in modelling how ozone depends on the other variates.pairs(ozone, pch=19, col=adjustcolor("firebrick",0.4))

    ozone

    0 50 150 250 5 10 15 20

    050

    100

    150

    050

    150

    250

    radiation

    temperature

    6070

    8090

    0 50 100 150

    510

    1520

    60 70 80 90

    wind

    To begin, let’s just try modelling ozone levels as a function of only two variates. That will give three variatesin total and allow us to see what’s going on using some three dimensional graphics called scatter3d fromthe car package.For example, if we could use a linear model for the mean ozone level.

    µ(xi, zi) = α + β(xi − x) + γ(zi − z)

    where xi is the radiation and zi is the temperature. In this case, we are fitting a plane to a three

    99

    https://en.wikipedia.org/wiki/Fahrenheit

  • dimensional point cloud.Execute the following code and explore.library(rgl) # Access to all of open GL graphics

    # Get some graphing code (very slightly) adapted from John Fox's "car" package to accomodate loess.#source("../../Code/scatter3d.R") # from the course home page#scatter3d(ozone ~ radiation + temperature, data=ozone)

    # Get a png snap shot of this using rgl package's# snapshot3d("ozoneRegressionPlane.png")#

    Figure 1: Fitting a plane to ozone, radiation, and temperature

    Trying some power transformation on the ozone variate, it might be better to model its cube root instead -

    100

  • ozoneˆ(1/3).scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,model.summary=TRUE # print out a summary of the fitted model)

    ## $linear#### Call:## lm(formula = y ~ x + z)#### Residuals:## Min 1Q Median 3Q Max## -0.23670 -0.08050 -0.00671 0.05930 0.38993#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.10578 0.02951 3.585 0.000508 ***## x 0.14024 0.04010 3.497 0.000685 ***## z 0.55325 0.05044 10.968 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.1121 on 108 degrees of freedom## Multiple R-squared: 0.6112, Adjusted R-squared: 0.604## F-statistic: 84.88 on 2 and 108 DF, p-value: < 2.2e-16# snapshot3d("ozone3RegressionPlane.png")## Note that the above coefficient estimates# do not match the ones belowlmfit |t|)## (Intercept) -2.1530326 0.4398302 -4.895 3.46e-06 ***## radiation 0.0021443 0.0006132 3.497 0.000685 ***## temperature 0.0643317 0.0058653 10.968 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.5603 on 108 degrees of freedom## Multiple R-squared: 0.6112, Adjusted R-squared: 0.604## F-statistic: 84.88 on 2 and 108 DF, p-value: < 2.2e-16

    101

  • # The difference in the coefficients is because# scatter3d standardizes all of the variates before fitting.# Note that the R-squared, etc. is the same.

    Figure 2: Fitting a plane to ozone, radiation, and temperature

    What happens if we add an interaction term?scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit = "interaction")

    # snapshot3d("ozone3interaction.png")

    The difference might best be seen when we have these two together on the same plot.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit = c("linear", "interaction"))

    # snapshot3d("ozone3linearandinteraction.png")

    The plane (linear) fit is in blue, the one with an interaction in green.

    102

  • Figure 3: Fitting a plane to ozone, radiation, and temperature

    103

  • Figure 4: Fitting an interaction to ozone, radiation, and temperature

    104

  • The plane fitted a surface of the form

    µ̂(x, z) = β̂0 + β̂1(x − x) + β̂2(z − z)

    which is strictly additive in each of x and z. Moreover, each additive component for x and for z is itselflinear (in x and in z respectively). Note that we choose to centre each of the explanatory variates. The“intercept” term β̂0 in this model is interpreted as the point on the surface (or average of the ys) where x = xand z = z.When we add an interaction term, we lose the additivity. The surface being fit now has the form

    µ̂(x, z) = β̂0 + β̂1(x − x) + β̂2(z − z) + β̂3(x − x)(z − z).

    This is no longer additive in x or in z. Instead, the interaction term means that the component for x willdepend on the value of z and conversely the component for z will depend on the value of x. That is, for anychoice of value z0 of z, the component for x is still a straight line

    (β̂0 + β̂2(z0 − z)) + (β̂1 + β̂3(z0 − z))(x − x)

    except that now both the intercept and the slope depend on the value of z0. A similar line holds for thecomponent of z at any value x0 of x.Suppose we add quadratic components for each variate.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("quadratic"))

    # snapshot3d("ozone3RegressionQuadratic.png")

    The fitted surface has the form

    µ̂(x, z) = β̂0 + β̂1(x − x) + β̂2(z − z) + β̂3(x − x)2 + β̂4(z − z)2.

    As was the case with the simple planar surface, this surface is also additive in each of x and z.Instead of a straight line though, each additive function is now a quadratic.A non-additive quadratic fit would include a cross product (or interaction) term xzscatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("quadint"))

    # snapshot3d("ozone3quadint.png")#

    The effect of the additional interaction term is best appreciated by having both surfaces appear together:scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("quadratic", "quadint"))

    Again, blue is the additive quadratic, green the quadratic with interaction. The effect of the interaction termon the quadratic additive fit is very much like that of the same term on the linear additive fit.

    3.1 Multiple regression smooths.

    Rather than use simple polynomial models, we could also use smoothers. And, in the same fashion, we canchoose between additive models and non-additive, or interaction, models.

    105

  • Figure 5: Fitting an additive paraboloid to ozone, radiation, and temperature

    106

  • Figure 6: Fitting a paraboloid to ozone, radiation, and temperature

    107

  • Two smoothing approaches which generalize to the multiple regression case are local polynomial regres-sion using loess and thin-plate splines using the smoothing splines with the basis functins describedearlier.We can have a look at each of these in turn.

    3.1.1 loess

    We begin with using local polynomial regression via loess. With default settings we have:scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("loess"))

    # snapshot3d("ozone3loess.png")

    Figure 7: loess smooth - to ozone, radiation, and temperature

    108

  • We can see the effect of a more flexible fit by increasing the effective degrees of freedom for the smooth.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("loess"),df.loess=20)

    # snapshot3d("ozone3loessdf20.png")

    Figure 8: loess with 20 df - to ozone, radiation, and temperature

    scatter3d(ozone^(1/3) ~ radiation + temperature,data=ozone,fit=c("loess"),df.loess=30)

    # snapshot3d("ozone3loessdf30.png")

    109

  • Figure 9: loess with 30 df - to ozone, radiation, and temperature

    110

  • 3.1.2 Thin-plate splines

    The default smooth here is a “thin-plate spline” as described earlier.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("smooth"))

    # snapshot3d("ozone3smooth.png")

    Figure 10: thin plate smooth - to ozone, radiation, and temperature

    We can compare this flexible fit with loess by trying to match the effective degrees of freedom for eachsmooth.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("smooth"),

    111

  • df.smooth=20)# snapshot3d("ozone3smoothdf20.png")

    Figure 11: thin plate smooth with 20 df - to ozone, radiation, and temperature

    We can compare this directly with loess by placing them on the same plot.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("loess","smooth"),df.loess=20, df.smooth=20)

    Here loess is blue, thin-plate green. The major differences appear to be at the corners where there are nopoints – differences reflect different strategies for extrapolation.Again, for comparison, we look at the thin-plate smooth for 30 effective degrees of freedom.

    112

  • scatter3d(ozone^(1/3) ~ radiation + temperature,data=ozone,fit=c("smooth"),df.smooth=30)

    # snapshot3d("ozone3smoothdf30.png")

    Figure 12: thin plate smooth with 30 df - to ozone, radiation, and temperature

    And compare the thin-plate spline (green) with the analogous loess (blue):scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("loess", "smooth"),df.loess=30, df.smooth=30)

    # snapshot3d("ozone3bothdf30.png")

    113

  • Figure 13: loess and thin plate smooths each with 30 df - to ozone, radiation, and temperature

    114

  • Again the major differences are in locations with no points in the dataset.

    3.2 Additive models

    One of the challenges of multiple regression, particularly for thin-plate splines and local regression methods isthat as dimensionality increases so must the local neighbourhood in order to get enough points to constructthe fit.One way around this is to use additive models. Since each additive component is a function only of a singlevariate, the neighbourhoods needed for estimation need not be as large.The model surface fitted now has the form:

    µ̂(x, z) = β̂0 + β̂1f1(x − x) + β̂2f2(z − z)

    where f1(·) and f2(·) are splines of some sort (typically either regression or smoothing splines). The defaultin our demonstration software scatter3d uses cubic regression splines.These behave much like additive models in the usual linear model, except now the functional form of eachcomponent is very flexible.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("additive"))

    # snapshot3d("ozone3additive.png")

    As with adding an interaction term, we might compare this with, for example, a thin plate spline:scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("additive", "smooth"))

    What we are giving up in flexibility, we are gaining in estimability and in interpretability.We can make this a more flexible function fit by specifying the span or, to make it comparable to the othersmoothers, by specifying the effective degrees of freedom.Now the argument specifies the effective or target degrees of freedom for each component. This means thatthe total degrees of freedom for the smooth part (aside from the intercept) will be the sum of the degrees offreedom for each smooth component.scatter3d(ozone^(1/3) ~ radiation + temperature,

    data=ozone,fit=c("additive"), df.additive=10)

    # snapshot3d("ozone3additive10.png")

    The corresponding thin-plate spline, for example, would have effective degrees of freedom of 20.scatter3d(ozone^(1/3) ~ radia