generalized linear models-1

Upload: ann-baisa

Post on 05-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Generalized Linear Models-1

    1/29

    Generalized Linear Models

  • 7/31/2019 Generalized Linear Models-1

    2/29

    Generalized linear models: Exponential family

    In Generalized Linear Models the response is assumed to possess a

    probability distribution of Exponential distribution function:

    dispersin)

    momentos)(

    enlacedefuncin

    ),()(

    )(

    a(

    b

    y

    eyc

    a

    by

  • 7/31/2019 Generalized Linear Models-1

    3/29

    Generalized linear models: Exponential family

    Normal distribution has the form:

    )2

    1ln

    2()c(y,)a(

    2)(

    2

    1)|(

    2

    22

    2

    ),()(

    )(2

    1ln

    2

    2

    2

    1ln

    2

    2

    2

    1ln

    2

    )(

    2

    )(

    2

    2

    2

    2

    2

    22

    2

    2

    2

    2

    ybyuy

    eee

    eeyf

    yca

    by

    yyu

    uyy

    yy

  • 7/31/2019 Generalized Linear Models-1

    4/29

    Generalized linear models: Exponential family

    Poisson Distribution has the form:

    )!ln(),(1))()ln(

    !)|(

    ),()(

    )(

    )!ln()ln(

    yyca(ubuyy

    eey

    ueyf

    yca

    by

    yuuyyu

  • 7/31/2019 Generalized Linear Models-1

    5/29

    Generalized linear models: Exponential family

    Binomial Distribution has the form:

    ynyca(pnb

    ppyy

    ee

    eppy

    nyf

    yca

    by

    y

    npn

    p

    py

    pynpyy

    n

    yny

    ln),(1)))1ln(()()1

    log(

    )1()|(

    ),()(

    )(ln))1ln(()

    1ln(

    )1ln()()ln(ln

  • 7/31/2019 Generalized Linear Models-1

    6/29

    Generalized linear models: Exponential family

    In Generalized Linear Models the response is assumed to possess a

    probability distribution of Exponential distribution function:

    )(g)))

    )|(

    ),()(

    )(

    a(b(Var(Y)b(YE

    eyf

    yca

    by

  • 7/31/2019 Generalized Linear Models-1

    7/29

    Generalized linear models: Exponential family

    Normal distribution has the form:

    uugubYVaruy

    ub

    ybyuy

    eee

    eeyf

    uy

    yca

    by

    yyu

    uyy

    yy

    )()a() ()(

    2

    2)(YE

    )21ln

    2()c(y,)a(

    2)(

    2

    1)|(

    2

    2

    2

    2

    2

    ),()(

    )(2

    1ln

    2

    2

    2

    1ln

    2

    2

    2

    1ln

    2

    )(

    2

    )(

    2

    2

    2

    2

    2

    22

    2

    2

    2

    2

  • 7/31/2019 Generalized Linear Models-1

    8/29

    Generalized linear models: Exponential family

    Poisson Distribution has the form:

    )!ln(),(1))()ln(

    !)|(

    ),()(

    )(

    )!ln()ln(

    yyca(ubuyy

    eey

    ueyf

    yca

    by

    yuuyyu

    )ln()()exp()a() ()(

    u)exp()(YE)exp()

    )ln(

    )ln(

    uuguubYVar

    ubb(u

    u

    u

  • 7/31/2019 Generalized Linear Models-1

    9/29

    Generalized linear models: Exponential family

    Binomial Distribution has the form:

    y

    nyca(pnb

    p

    pyy

    ee

    eppy

    nyf

    yca

    by

    y

    n

    pnp

    p

    y

    pynpyyn

    yny

    ln),(1)))1ln(()()1

    log(

    )1()|(

    ),()(

    )(ln))1ln(()1ln(

    )1ln()()ln(ln

    )

    1

    ln()()1()a()()(

    )())exp1ln)(1

    ln

    p

    ppgpnpubYVar

    npb((nbp

    p

  • 7/31/2019 Generalized Linear Models-1

    10/29

    Probability distributions

    Normal:(2)

    Inverse Gaussian: (3)

    Gamma: (4)

  • 7/31/2019 Generalized Linear Models-1

    11/29

    Probability distributions

    Negative Binomial: (5)

    Poisson: (6)

    Binomial:

    (6)

  • 7/31/2019 Generalized Linear Models-1

    12/29

    Generalized Linear Models (GLM)

    General class of linear models that are made up of 3components: Random, Systematic, and Link Function

    Random component: Identifies dependent variable(Y) and its probability distribution

    Systematic Component: Identifies the set ofexplanatory variables (X1,...,Xk)

    Link Function: Identifies a function of the meanthat is a linear function of the explanatory

    variables

    kkXXg 11)(

  • 7/31/2019 Generalized Linear Models-1

    13/29

    Generalised linear model

    If the distribution of observations is one of the distributions from theexponential family and some function of the expected value of the

    observations is a linear function of the parameters then generalisedlinear model is used:

    Function g is called the link function. Here is a list of the popular distributionand corresponding link functions:

    binomial - logit = ln(p/(1-p))

    normal - identity

    Gamma - inverse

    Poisson - log

    Most natural way is to use =X . The optimization for this kind of functions isdone iteratively.

    Xyy ))('(,,),('(())((,,)),((( 11 nn BgBgEgEg

  • 7/31/2019 Generalized Linear Models-1

    14/29

    Likelihood function

    )(

    )()/()/(

    Xp

    pXfXf

  • 7/31/2019 Generalized Linear Models-1

    15/29

    Likelihood function

    n iXn

    n

    iXn

    pxfpxxxL

    pxfpxxxL

    );(ln);,...,,(ln

    );();,...,,(

    21

    21

  • 7/31/2019 Generalized Linear Models-1

    16/29

    Likelihood function Poisson distribution

    )!log()log(ln

    !)|(

    )!log()log(

    yuuyL

    ey

    ueyfL

    yuuyyu

    Newton Raphson algorithm

  • 7/31/2019 Generalized Linear Models-1

    17/29

    Likelihood ratio test

    Let us assume that we have a sample of size n(x=(x1,,,,xn)) and we want to estimate aparameter vector =( 1, 2). Both 1 and 2can also be vectors. We want to test null-hypothesis against alternative one:

    Let us assume that likelihood function is L(x| ). Then likelihood ratio test works asfollows: 1) Maximise the likelihood function under null-hypothesis (I.e. fixparameter(s) 1 equal to 10, find the value of likelihood at the maximum, 2)maximisethe likelihood under alternative hypothesis (I.e. unconditional maximisation), find thevalue of the likelihood at the maximum, then find the ratio:

    wis the likelihood ratio statistic. Tests carried out using this statistic are called likelihoodratio tests. In this case it is clear that:

    If the value of wis small then null-hypothesis is rejected. If g(w) is the the density of thedistribution for wthen critical region can be calculated using:

    10111010 :against: HH

    onmaximisatinedunconstraiafterparametersboththeofvaluestheare

    ,

    onmaximisati)(dconstraineafterparamatertheofvaluetheis

    ),

    |(/),|(

    21

    1011

    21210

    xLxLw

    0 w 1

    c

    dwwg

    0

    )(

  • 7/31/2019 Generalized Linear Models-1

    18/29

    Deviances

    In linear model, we maximise the likelihood with full model and under the nullhypothesis. The ratio of the values of the likelihood function under twohypotheses (null and alternative) is related to F-distribution. Interpretation is that

    how much variance would increase if we would remove part of the model (nullhypothesis).

    In logisitc and log-linear models, again likelihood function is maximised under the nulland alternative hypotheses. Then logarithm (it is called deviance) of ratio of thevalues of the likelihood under these two hypotheses asymptotically has chi-

    squared distribution:

    That is the difference between maximum achievable log-likelihood and the value oflikelihood at the estimated parameters

    That is the reason why in log-linear and logistic regressions it is usual to talk aboutdeviances and chi-squared statistics instead of variances and F-statistics.Analysis based on log-linear and logistic models (in general for generalisedlinear models) is usually called analyisis of deviances. Reason for this is thatchi-squared is related to deviation of the fitted model and observations.

    Another test is based on Pearsons chi-squared test. It approaches asymptotically to

    chi-squared with n-p (n is the number of observations and p is the numberparameters) degrees of freedom.

    2 2.0 (l(y | y) log(l(y | )

    X

    2 (yi i)

    2

    Var(yi)i1

    n

  • 7/31/2019 Generalized Linear Models-1

    19/29

    19

    Goodness of Fits: Deviance

    Deviance = -2[LM

    -LS

    ]where LM is the maximum log likelihood of the model of interestLS is the maximum log likelihood for the most complex model, which has aseparated parameter at each explanatory setting (saturated model).Deviance has approximately a chi-square distribution with df = N-pWhere N = number of observations and p = number of parameters (including

    intercept).

    Likelihood ratio test for model comparison between M1 and M0 (M0 is asimpler model than M

    1)

    Likelihood ratio = -2[L0-L1)=2[L0-LS]-{-2[L1-LS]} = Deviance0-Deviance1

  • 7/31/2019 Generalized Linear Models-1

    20/29

    Model fit metrics Covariance matrix for parameters

    computed from 2nd partial derivatives of the loglikelihood function

    Likelihood ratio test Ratio of max square log likelihood to square log

    likelihood of null hypothesis

    Deviance Measure of how overdetermined the system is

    Compare max of full system to max of saturatedmodel (number of parameters equals number of datapoints)

  • 7/31/2019 Generalized Linear Models-1

    21/29

    Range of plausible modelsLikelihood ratio

    );,f(

    );,f(

    2

    2

    0

    y

    y

    b

    With b0 the specified model and the bestmodel

    Ratio of likelihood of any model to likelihood of best model

    z

    -y

    -y

    -y

    2

    21

    2

    21

    2

    21

    2

    21

    -exp-exp

    -exp

    -exp

    0

    b

    0

    Log-likelihood ratio ln = - z2z2 = -2ln

  • 7/31/2019 Generalized Linear Models-1

    22/29

    ExampleSite

    Longitude

    Latitude

    Alt Sl Te Pp V Ndvi Soil Lc S M B P

    Nicols B. -104.7 24.38 1920 2 17 450 5 90 7 8 9 3 79 80

    Librado R. -104.26 24.4 2005 2 17 450 7 84 9 8 24 28 31 32

    La Ermita -104.33 23.89 2169 10 17 550 6 109 9 11 47 6 13 14

    Madero -104.29 24.27 1942 2 17 450 3 74 9 8 16 85 110 111

    Castillo N. -104.49 24.34 1923 2 17 450 7 78 7 8 20 58 33 34

    Km 188 -104.61 25.38 1501 3 21 350 4 83 2 10 29 28 27 28

    Km 130 -104.51 24.99 1733 4 19 450 4 85 3 9 22 20 20 21

    Las Huertas -104.29 24.27 1930 2 17 450 3 75 9 8 15 13 20 21

    18 de Agosto -104.15 23.95 1866 1 17 450 7 81 9 12 17 10 20 21

    El Venado -104.28 23.87 1747 4 17 450 6 83 4 8 20 10 20 21

    Km 23 -104.46 24.51 2126 4 15 550 7 83 3 11 16 8 14 15

    Km 73 -104.32 25.13 1284 5 21 350 4 78 9 8 1 0 1 2

    Rodrguez -104.09 24.32 2064 4 17 550 6 78 2 11 15 0 78

    Berros - Tuitan -104.27 23.97 1855 4 17 450 6 84 9 9 15 0 0 1

    27 de Noviembre -104.49 24.22 1862 2 17 450 5 91 9 8 17 0 31 32

    Km 86 -104.64 24.7 1954 6 17 450 3 89 4 8 8 4 10 11

    Morcillo -104.7 24.11 1971 3 17 550 8 88 9 10 2 0 5 6

    Km 43 -104.47 24.65 1908 3 17 450 7 75 3 9 12 0 3 4Zarco -

    Cieneguilla -104.04 24.1 2143 5 15 550 7 82 2 9 10 0 9 10

    Berros - Saltito -104.28 23.94 1858 15 17 450 6 83 9 8 6 0 1 2

    Km 36 -104.7 24.27 1909 2 17 450 3 86 9 9 0 0 0 1

    Zaragoza -104.16 23.87 1856 1 17 450 7 76 9 8 11 15 21 22

    Entrada Guadiana -104.34 23.95 1867 8 17 550 6 83 9 11 1 0 23

    Carlos R. -104.44 24.27 1867 1 17 450 5 88 7 8 15 0 5 6

    Km 153 -104.53 25.12 1360 2 21 350 7 81 9 9 8 6 0 1

    Km 51 -104.16 25.21 1416 24 21 250 7 79 4 9 1 0 0 1

    Km 29 -104.16 25.36 1396 7 21 250 7 71 4 10 0 0 0 1

    Km 237 -104.75 25.76 1905 2 17 450 6 79 9 11 5 1 0 1

    Km 260 -104.89 25.96 1940 2 17 450 6 82 2 9 0 0 0 1

    Km 84 -104.42 25.82 1770 3 19 350 7 75 9 11 0 0 0 1

    Francisco Z. -104.07 24.22 2166 3 15 550 6 74 2 8 2 0 0 1

    Km 61 -104.29 25.55 1651 7 21 250 7 77 9 9 0 0 0 1

    Km 104 -104.58 25.79 1942 1 17 450 6 77 9 10 3 0 1 2

    Km 76 -104.28 25.67 1817 5 19 250 4 77 4 10 1 0 01

  • 7/31/2019 Generalized Linear Models-1

    23/29

    Variable distribution

    M

    B

    P-20 0 20 40 60 80 100 120

    Grasshopper count

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    Frequency

  • 7/31/2019 Generalized Linear Models-1

    24/29

    Correlation

    Longitude Latitude Altitude Slope Temperature Precipitation Vegetation Ndvi Soil Landcover

    M. lakinus 0.09 -0.42a 0.28 -0.10 -0.19 0.34a -0.09 0.57a -0.05 0.03

    B. nubilum 0.03 -0.16 0.07 -0.21 -0.05 0.04 -0.24 -0.16 0.10 -0.26

    P.

    nebrascensis-0.04 -0.28 0.17 -0.25 -0.17 0.14 -0.33 0.07 0.13 -0.31

  • 7/31/2019 Generalized Linear Models-1

    25/29

    MulticolinearityLongitude Latitude Altitude Slope Temperature Precipitation Vegetation Ndvi Soil Landcover

    Longitude 1.00 -0.37a 0.00 0.32 -0.01 -0.06 0.22 -0.30 0.09 -0.01

    Latitude -0.37a 1.00 -0.45a 0.02 0.57a -0.66a -0.02 -0.37a -0.18 0.23

    Altitude 0.00 -0.45a 1.00 -0.28 -0.92a 0.79a 0.04 0.30 -0.08 0.09

    Slope 0.32 0.02 -0.28 1.00 0.32 -0.30 0.16 0.11 0.12 0.03

    Temperature -0.01 0.57 -0.92a 0.32 1.00 -0.85a -0.04 -0.22 0.05 0.04

    Precipitation -0.06 -0.64 0.79a -0.30 -0.85a 1.00 0.09 0.39a -0.03 0.08

    Vegetation 0.22 -0.02 0.04 0.15 -0.04 0.06 1.00 -0.10 0.04 0.32

    Ndvi -0.30 -0.37a 0.30 0.11 -0.22 0.39a -0.10 1.00 0.16 0.06

    Soil 0.09 -0.18 -0.08 0.12 0.05 -0.03 0.04 0.16 1.00 -0.02

    Landcover -0.01 0.23 0.09 0.04 0.05 0.08 0.32 0.06 -0.02 1.00

  • 7/31/2019 Generalized Linear Models-1

    26/29

    Deviance

    Species

    Model Link function Deviance

    M. lakinus Value d.f. Value/df

    Gamma Log 2.244 7 0.335

    B. nubilum

    Gamma Log 11.211 9 1.080

    P. nebrascensis

    Gamma Log 2.835 7 0.715

    95% W ld C fid I l H h i T

  • 7/31/2019 Generalized Linear Models-1

    27/29

    Parameter B Std. Error

    95% Wald Confidence Interval Hypothesis Test

    Lower Upper Wald Chi-Square df Sig.

    (Intercept) 835.919 62.1403 714.126 957.712 180.960 1 .000

    [Temperature=15.00] -2.627 .5287 -3.663 -1.591 24.692 1 .000

    [Temperature =17.00] -2.889 .6660 -4.195 -1.584 18.822 1 .000

    [Temperature =19.00] -5.630 .5807 -6.768 -4.491 93.982 1 .000

    [Temperature =21.00] 0a

    . . . . . .

    [Precipitation =250.00] -4.156 .3781 -4.897 -3.415 120.788 1 .000

    [Precipitation =350.00] 0a

    . . . . . .[Precipitation =450.00] 2.332 .4734 1.404 3.260 24.268 1 .000

    [Precipitation =550.00] 0a

    . . . . . .

    [Vegetation=3.00] -3.481 .5117 -4.484 -2.478 46.261 1 .000

    [Vegetation=4.00] -2.383 .8388 -4.027 -.739 8.072 1 .004

    [Vegetation=5.00] -3.402 .5694 -4.518 -2.286 35.696 1 .000

    [Vegetation=6.00] -4.161 .5299 -5.199 -3.122 61.647 1 .000

    [Vegetation=7.00] -4.288 .5890 -5.442 -3.133 52.991 1 .000

    [Vegetation=8.00] 0a

    . . . . . .

    [Soil=2.00] -.103 .3156 -.721 .516 .106 1 .745[Soil=3.00] 1.911 .2833 1.356 2.467 45.522 1 .000

    [Soil=4.00] .488 .1837 .128 .848 7.052 1 .008

    [Soil=7.00] .533 .2793 -.015 1.080 3.638 1 .056

    [Soil=9.00] 0a

    . . . . . .

    [Landcover=8.00] .436 .2687 -.090 .963 2.636 1 .104

    [Landcover=9.00] .369 .2951 -.210 .947 1.561 1 .212

    [Landcover=10.00] -.118 .3949 -.892 .656 .089 1 .765

    [Landcover=11.00] 1.666 .4357 .812 2.520 14.623 1 .000

    [Landcover=12.00] 0a

    . . . . . .

    Longitude 8.360 .6374 7.111 9.610 172.047 1 .000

    Latitude 1.366 .2438 .889 1.844 31.421 1 .000

    Slope -.039 .0147 -.067 -.010 6.859 1 .009

    Ndvi .122 .0113 .100 .144 117.147 1 .000

    (Scale) .048b

    .0122 .029 .079

    Dependent Variable: M1

    Model: (Intercept), Precipitation, Temperature, Vegetation, Soil, Landcover, Longitude, Latitude, Slope, Ndvi

    a. Set to zero because this parameter is redundant.

    b. Maximum likelihood estimate.

  • 7/31/2019 Generalized Linear Models-1

    28/29

    Residual

  • 7/31/2019 Generalized Linear Models-1

    29/29

    Fit