probability and statistics cookbook

Upload: mpv2000

Post on 11-Oct-2015

26 views

Category:

Documents


0 download

DESCRIPTION

A cheat sheet for P&S

TRANSCRIPT

  • Probability and Statistics

    Cookbook

    Copyright c Matthias Vallentin, [email protected]

    18th May, 2014

  • This cookbook integrates a variety of topics in probability the-

    ory and statistics. It is based on literature [1, 6, 3] and in-class

    material from courses of the statistics department at the Uni-

    versity of California in Berkeley but also influenced by other

    sources [4, 5]. If you find errors or have suggestions for further

    topics, I would appreciate if you send me an email. The most re-

    cent version of this document is available at http://matthias.

    vallentin.net/probability-and-statistics-cookbook/. To

    reproduce, please contact me.

    Contents

    1 Distribution Overview 31.1 Discrete Distributions . . . . . . . . . . 31.2 Continuous Distributions . . . . . . . . 4

    2 Probability Theory 6

    3 Random Variables 63.1 Transformations . . . . . . . . . . . . . 7

    4 Expectation 7

    5 Variance 7

    6 Inequalities 8

    7 Distribution Relationships 8

    8 Probability and Moment GeneratingFunctions 9

    9 Multivariate Distributions 99.1 Standard Bivariate Normal . . . . . . . 99.2 Bivariate Normal . . . . . . . . . . . . . 99.3 Multivariate Normal . . . . . . . . . . . 9

    10 Convergence 910.1 Law of Large Numbers (LLN) . . . . . . 1010.2 Central Limit Theorem (CLT) . . . . . 10

    11 Statistical Inference 1011.1 Point Estimation . . . . . . . . . . . . . 1011.2 Normal-Based Confidence Interval . . . 1111.3 Empirical distribution . . . . . . . . . . 1111.4 Statistical Functionals . . . . . . . . . . 11

    12 Parametric Inference 11

    12.1 Method of Moments . . . . . . . . . . . 11

    12.2 Maximum Likelihood . . . . . . . . . . . 12

    12.2.1 Delta Method . . . . . . . . . . . 12

    12.3 Multiparameter Models . . . . . . . . . 12

    12.3.1 Multiparameter delta method . . 13

    12.4 Parametric Bootstrap . . . . . . . . . . 13

    13 Hypothesis Testing 13

    14 Bayesian Inference 14

    14.1 Credible Intervals . . . . . . . . . . . . . 14

    14.2 Function of parameters . . . . . . . . . . 14

    14.3 Priors . . . . . . . . . . . . . . . . . . . 15

    14.3.1 Conjugate Priors . . . . . . . . . 15

    14.4 Bayesian Testing . . . . . . . . . . . . . 15

    15 Exponential Family 16

    16 Sampling Methods 16

    16.1 The Bootstrap . . . . . . . . . . . . . . 16

    16.1.1 Bootstrap Confidence Intervals . 16

    16.2 Rejection Sampling . . . . . . . . . . . . 17

    16.3 Importance Sampling . . . . . . . . . . . 17

    17 Decision Theory 17

    17.1 Risk . . . . . . . . . . . . . . . . . . . . 17

    17.2 Admissibility . . . . . . . . . . . . . . . 17

    17.3 Bayes Rule . . . . . . . . . . . . . . . . 18

    17.4 Minimax Rules . . . . . . . . . . . . . . 18

    18 Linear Regression 18

    18.1 Simple Linear Regression . . . . . . . . 18

    18.2 Prediction . . . . . . . . . . . . . . . . . 19

    18.3 Multiple Regression . . . . . . . . . . . 19

    18.4 Model Selection . . . . . . . . . . . . . . 19

    19 Non-parametric Function Estimation 20

    19.1 Density Estimation . . . . . . . . . . . . 20

    19.1.1 Histograms . . . . . . . . . . . . 20

    19.1.2 Kernel Density Estimator (KDE) 21

    19.2 Non-parametric Regression . . . . . . . 21

    19.3 Smoothing Using Orthogonal Functions 21

    20 Stochastic Processes 2220.1 Markov Chains . . . . . . . . . . . . . . 2220.2 Poisson Processes . . . . . . . . . . . . . 22

    21 Time Series 2321.1 Stationary Time Series . . . . . . . . . . 2321.2 Estimation of Correlation . . . . . . . . 2421.3 Non-Stationary Time Series . . . . . . . 24

    21.3.1 Detrending . . . . . . . . . . . . 2421.4 ARIMA models . . . . . . . . . . . . . . 24

    21.4.1 Causality and Invertibility . . . . 2521.5 Spectral Analysis . . . . . . . . . . . . . 25

    22 Math 2622.1 Gamma Function . . . . . . . . . . . . . 2622.2 Beta Function . . . . . . . . . . . . . . . 2622.3 Series . . . . . . . . . . . . . . . . . . . 2722.4 Combinatorics . . . . . . . . . . . . . . 27

  • 1 Distribution Overview

    1.1 Discrete Distributions

    Notation1 FX(x) fX(x) E [X] V [X] MX(s)

    Uniform Unif {a, . . . , b}

    0 x < abxca+1ba a x b

    1 x > b

    I(a x b)b a+ 1

    a+ b

    2

    (b a+ 1)2 112

    eas e(b+1)ss(b a)

    Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p+ pes

    Binomial Bin (n, p) I1p(n x, x+ 1)(n

    x

    )px (1 p)nx np np(1 p) (1 p+ pes)n

    Multinomial Mult (n, p)n!

    x1! . . . xk!px11 pxkk

    ki=1

    xi = n npi npi(1 pi)(

    ki=0

    piesi

    )n

    Hypergeometric Hyp (N,m, n) (

    x npnp(1 p)

    ) (mx

    )(mxnx

    )(Nx

    ) nmN

    nm(N n)(N m)N2(N 1)

    Negative Binomial NBin (r, p) Ip(r, x+ 1)

    (x+ r 1r 1

    )pr(1 p)x r 1 p

    pr

    1 pp2

    (p

    1 (1 p)es)r

    Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+ 1p

    1 pp2

    pes

    1 (1 p)es

    Poisson Po () exi=0

    i

    i!

    xe

    x! e(e

    s1)

    l l l l l l1n

    a bx

    PMF

    Uniform (discrete)

    lllll

    l

    l

    l

    l

    l

    lll

    l

    l

    l

    l

    lllllllllllllllllllllll0.0

    0.1

    0.2

    0 10 20 30 40x

    PMF

    l n = 40, p = 0.3n = 30, p = 0.6n = 25, p = 0.9

    Binomial

    ll

    ll

    l l l l l l l0.0

    0.2

    0.4

    0.6

    0.8

    3 6 9x

    PMF

    l p = 0.2p = 0.5p = 0.8

    Geometricl l

    l

    l

    ll l l l l l l l l l l l l l l l0.0

    0.1

    0.2

    0.3

    0 5 10 15 20x

    PMF

    l = 1 = 4 = 10

    Poisson

    1We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).3

  • 1.2 Continuous Distributions

    Notation FX(x) fX(x) E [X] V [X] MX(s)

    Uniform Unif (a, b)

    0 x < axaba a < x < b

    1 x > b

    I(a < x < b)

    b aa+ b

    2

    (b a)212

    esb esas(b a)

    Normal N (, 2) (x) = x

    (t) dt (x) =1

    2piexp

    { (x )

    2

    22

    } 2 exp

    {s+

    2s2

    2

    }Log-Normal lnN (, 2) 1

    2+

    1

    2erf

    [lnx

    22

    ]1

    x

    2pi2exp

    { (lnx )

    2

    22

    }e+

    2/2 (e2 1)e2+2

    Multivariate Normal MVN (,) (2pi)k/2||1/2e 12 (x)T1(x) exp{T s+

    1

    2sTs

    }

    Students t Student() Ix(

    2,

    2

    ) ( +12

    )pi

    (2

    ) (1 + x2

    )(+1)/20

    { 2 > 2 1 < 2

    Chi-square 2k1

    (k/2)

    (k

    2,x

    2

    )1

    2k/2(k/2)xk/21ex/2 k 2k (1 2s)k/2 s < 1/2

    F F(d1, d2) I d1xd1x+d2

    (d12,d12

    ) (d1x)d1dd22(d1x+d2)

    d1+d2

    xB(d12, d1

    2

    ) d2d2 2

    2d22(d1 + d2 2)d1(d2 2)2(d2 4)

    Exponential Exp () 1 ex/ 1ex/ 2

    1

    1 s (s < 1/)

    Gamma Gamma (, )(, x/)

    ()

    1

    ()x1ex/ 2

    (1

    1 s)

    (s < 1/)

    Inverse Gamma InvGamma (, )(,

    x

    ) ()

    ()x1e/x

    1 > 12

    ( 1)2( 2)2 > 22(s)/2

    ()K

    (4s

    )

    Dirichlet Dir ()(k

    i=1 i)

    ki=1 (i)

    ki=1

    xi1iiki=1 i

    E [Xi] (1 E [Xi])ki=1 i + 1

    Beta Beta (, ) Ix(, ) (+ )

    () ()x1 (1 x)1

    +

    (+ )2(+ + 1)1 +

    k=1

    (k1r=0

    + r

    + + r

    )sk

    k!

    Weibull Weibull(, k) 1 e(x/)k k

    (x

    )k1e(x/)

    k

    (1 +

    1

    k

    )2

    (1 +

    2

    k

    ) 2

    n=0

    snn

    n!(

    1 +n

    k

    )Pareto Pareto(xm, ) 1

    (xmx

    )x xm x

    m

    x+1x xm xm

    1 > 1xm

    ( 1)2( 2) > 2 (xms)(,xms) s < 0

    4

  • l l

    l l

    1b a

    a bx

    PDF

    Uniform (continuous)

    0.00

    0.25

    0.50

    0.75

    5.0 2.5 0.0 2.5 5.0x

    (x)

    = 0, 2 = 0.2 = 0, 2 = 1 = 0, 2 = 5 = 2, 2 = 0.5

    Normal

    0.00

    0.25

    0.50

    0.75

    1.00

    0 1 2 3x

    PDF

    = 0, 2 = 3 = 2, 2 = 2 = 0, 2 = 1 = 0.5, 2 = 1 = 0.25, 2 = 1 = 0.125, 2 = 1

    LogNormal

    0.0

    0.1

    0.2

    0.3

    0.4

    5.0 2.5 0.0 2.5 5.0x

    PDF

    = 1 = 2 = 5 =

    Student's t

    0

    1

    2

    3

    4

    0 2 4 6 8x

    PDF

    k = 1k = 2k = 3k = 4k = 5

    2

    0

    1

    2

    3

    0 1 2 3 4 5x

    PDF

    d1 = 1, d2 = 1d1 = 2, d2 = 1d1 = 5, d2 = 2d1 = 100, d2 = 1d1 = 100, d2 = 100

    F

    0.0

    0.5

    1.0

    1.5

    2.0

    0 1 2 3 4 5x

    PDF

    = 2 = 1 = 0.4

    Exponential

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0 5 10 15 20x

    PDF

    = 1, = 2 = 2, = 2 = 3, = 2 = 5, = 1 = 9, = 0.5

    Gamma

    0

    1

    2

    3

    4

    0 1 2 3 4 5x

    PDF

    = 1, = 1 = 2, = 1 = 3, = 1 = 3, = 0.5

    Inverse Gamma

    0

    1

    2

    3

    4

    5

    0.00 0.25 0.50 0.75 1.00x

    PDF

    = 0.5, = 0.5 = 5, = 1 = 1, = 3 = 2, = 2 = 2, = 5

    Beta

    0

    1

    2

    3

    4

    0.0 0.5 1.0 1.5 2.0 2.5x

    PDF

    = 1, k = 0.5 = 1, k = 1 = 1, k = 1.5 = 1, k = 5

    Weibull

    0

    1

    2

    3

    0 1 2 3 4 5x

    PDF

    xm = 1, = 1xm = 1, = 2xm = 1, = 4

    Pareto

    5

  • 2 Probability Theory

    Definitions

    Sample space Outcome (point or element) Event A -algebra A

    1. A2. A1, A2, . . . , A =

    i=1Ai A

    3. A A = A A Probability Distribution P

    1. P [A] 0 A2. P [] = 1

    3. P

    [ i=1

    Ai

    ]=

    i=1

    P [Ai]

    Probability space (,A,P)Properties

    P [] = 0 B = B = (A A) B = (A B) (A B) P [A] = 1 P [A] P [B] = P [A B] + P [A B] P [] = 1 P [] = 0 (nAn) = n An (nAn) = n An DeMorgan P [nAn] = 1 P [n An] P [A B] = P [A] + P [B] P [A B]

    = P [A B] P [A] + P [B] P [A B] = P [A B] + P [A B] + P [A B] P [A B] = P [A] P [A B]

    Continuity of Probabilities

    A1 A2 . . . = limn P [An] = P [A] whereA =i=1Ai

    A1 A2 . . . = limn P [An] = P [A] whereA =i=1Ai

    Independence A B P [A B] = P [A]P [B]

    Conditional Probability

    P [A |B] = P [A B]P [B]

    P [B] > 0

    Law of Total Probability

    P [B] =ni=1

    P [B|Ai]P [Ai] =ni=1

    Ai

    Bayes Theorem

    P [Ai |B] = P [B |Ai]P [Ai]nj=1 P [B |Aj ]P [Aj ]

    =

    ni=1

    Ai

    Inclusion-Exclusion Principle ni=1

    Ai

    = nr=1

    (1)r1

    ii1

  • 3.1 Transformations

    Transformation functionZ = (X)

    Discrete

    fZ(z) = P [(X) = z] = P [{x : (x) = z}] = P[X 1(z)] =

    x1(z)f(x)

    Continuous

    FZ(z) = P [(X) z] =Az

    f(x) dx with Az = {x : (x) z}

    Special case if strictly monotone

    fZ(z) = fX(1(z))

    ddz1(z) = fX(x) dxdz

    = fX(x) 1|J |The Rule of the Lazy Statistician

    E [Z] =(x) dFX(x)

    E [IA(x)] =IA(x) dFX(x) =

    A

    dFX(x) = P [X A]

    Convolution

    Z := X +Y fZ(z) =

    fX,Y (x, zx) dx X,Y0= z

    0

    fX,Y (x, zx) dx

    Z := |X Y | fZ(z) = 2

    0

    fX,Y (x, z + x) dx

    Z := XY

    fZ(z) =

    |x|fX,Y (x, xz) dx =

    xfx(x)fX(x)fY (xz) dx

    4 Expectation

    Definition and properties

    E [X] = X =x dFX(x) =

    x

    xfX(x) X discrete

    xfX(x) dx X continuous

    P [X = c] = 1 = E [X] = c E [cX] = cE [X] E [X + Y ] = E [X] + E [Y ]

    E [XY ] =X,Y

    xyfX,Y (x, y) dFX(x) dFY (y)

    E [(Y )] 6= (E [X]) (cf. Jensen inequality) P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ] E [X] =

    x=1

    P [X x]

    Sample mean

    Xn =1

    n

    ni=1

    Xi

    Conditional expectation

    E [Y |X = x] =yf(y |x) dy

    E [X] = E [E [X |Y ]] E[(X,Y ) |X = x] =

    (x, y)fY |X(y |x) dx

    E [(Y,Z) |X = x] =

    (y, z)f(Y,Z)|X(y, z |x) dy dz E [Y + Z |X] = E [Y |X] + E [Z |X] E [(X)Y |X] = (X)E [Y |X] E[Y |X] = c = Cov [X,Y ] = 0

    5 VarianceDefinition and properties

    V [X] = 2X = E[(X E [X])2] = E [X2] E [X]2

    V[

    ni=1

    Xi

    ]=

    ni=1

    V [Xi] + 2i 6=j

    Cov [Xi, Yj ]

    V[

    ni=1

    Xi

    ]=

    ni=1

    V [Xi] if Xi Xj

    Standard deviationsd[X] =

    V [X] = X

    Covariance

    Cov [X,Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X]E [Y ] Cov [X, a] = 0 Cov [X,X] = V [X] Cov [X,Y ] = Cov [Y,X] Cov [aX, bY ] = abCov [X,Y ]

    7

  • Cov [X + a, Y + b] = Cov [X,Y ]

    Cov ni=1

    Xi,

    mj=1

    Yj

    = ni=1

    mj=1

    Cov [Xi, Yj ]

    Correlation

    [X,Y ] =Cov [X,Y ]V [X]V [Y ]

    Independence

    X Y = [X,Y ] = 0 Cov [X,Y ] = 0 E [XY ] = E [X]E [Y ]Sample variance

    S2 =1

    n 1ni=1

    (Xi Xn)2

    Conditional variance

    V [Y |X] = E [(Y E [Y |X])2 |X] = E [Y 2 |X] E [Y |X]2 V [Y ] = E [V [Y |X]] + V [E [Y |X]]

    6 Inequalities

    Cauchy-SchwarzE [XY ]2 E [X2]E [Y 2]

    Markov

    P [(X) t] E [(X)]t

    Chebyshev

    P [|X E [X]| t] V [X]t2

    Chernoff

    P [X (1 + )] (

    e

    (1 + )1+

    ) > 1

    JensenE [(X)] (E [X]) convex

    7 Distribution Relationships

    Binomial

    Xi Bern (p) =ni=1

    Xi Bin (n, p)

    X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n+m, p)

    limn Bin (n, p) = Po (np) (n large, p small) limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)

    Negative Binomial

    X NBin (1, p) = Geo (p) X NBin (r, p) = ri=1 Geo (p) Xi NBin (ri, p) =

    Xi NBin (

    ri, p)

    X NBin (r, p) . Y Bin (s+ r, p) = P [X s] = P [Y r]Poisson

    Xi Po (i) Xi Xj =ni=1

    Xi Po(

    ni=1

    i

    )

    Xi Po (i) Xi Xj = Xi

    nj=1

    Xj Bin nj=1

    Xj ,inj=1 j

    Exponential

    Xi Exp () Xi Xj =ni=1

    Xi Gamma (n, )

    Memoryless property: P [X > x+ y |X > y] = P [X > x]Normal

    X N (, 2) = (X ) N (0, 1) X N (, 2) Z = aX + b = Z N (a+ b, a22) X N (1, 21) Y N (2, 22) = X + Y N (1 + 2, 21 + 22) Xi N

    (i,

    2i

    )= iXi N (i i,i 2i )

    P [a < X b] = (b

    ) (a )

    (x) = 1 (x) (x) = x(x) (x) = (x2 1)(x) Upper quantile of N (0, 1): z = 1(1 )

    Gamma

    X Gamma (, ) X/ Gamma (, 1) Gamma (, ) i=1 Exp () Xi Gamma (i, ) Xi Xj =

    iXi Gamma (

    i i, )

    ()

    =

    0

    x1ex dx

    Beta

    1B(, )

    x1(1 x)1 = (+ )()()

    x1(1 x)1

    E [Xk] = B(+ k, )B(, )

    =+ k 1

    + + k 1E[Xk1

    ] Beta (1, 1) Unif (0, 1)

    8

  • 8 Probability and Moment Generating Functions

    GX(t) = E[tX] |t| < 1

    MX(t) = GX(et) = E[eXt]

    = E

    [ i=0

    (Xt)i

    i!

    ]=

    i=0

    E[Xi]

    i! ti

    P [X = 0] = GX(0) P [X = 1] = GX(0)

    P [X = i] = G(i)X (0)

    i! E [X] = GX(1) E [Xk] = M (k)X (0) E

    [X!

    (X k)!]

    = G(k)X (1

    )

    V [X] = GX(1) +GX(1) (GX(1))2

    GX(t) = GY (t) = X d= Y

    9 Multivariate Distributions

    9.1 Standard Bivariate Normal

    Let X,Y N (0, 1) X Z where Y = X +

    1 2Z

    Joint density

    f(x, y) =1

    2pi

    1 2 exp{x

    2 + y2 2xy2(1 2)

    }Conditionals

    (Y |X = x) N (x, 1 2) and (X |Y = y) N (y, 1 2)Independence

    X Y = 0

    9.2 Bivariate Normal

    Let X N (x, 2x) and Y N (y, 2y).f(x, y) =

    1

    2pixy

    1 2 exp{ z

    2(1 2)}

    z =

    [(x xx

    )2+

    (y yy

    )2 2

    (x xx

    )(y yy

    )]

    Conditional mean and variance

    E [X |Y ] = E [X] + XY

    (Y E [Y ])

    V [X |Y ] = X

    1 2

    9.3 Multivariate Normal

    Covariance matrix (Precision matrix 1)

    =

    V [X1] Cov [X1, Xk]... . . . ...Cov [Xk, X1] V [Xk]

    If X N (,),

    fX(x) = (2pi)n/2 ||1/2 exp

    {1

    2(x )T1(x )

    }Properties

    Z N (0, 1) X = + 1/2Z = X N (,) X N (,) = 1/2(X ) N (0, 1) X N (,) = AX N (A,AAT ) X N (,) a = k = aTX N (aT, aTa)

    10 Convergence

    Let {X1, X2, . . .} be a sequence of rvs and let X be another rv. Let Fn denotethe cdf of Xn and let F denote the cdf of X.

    Types of convergence

    1. In distribution (weakly, in law): XnD X

    limnFn(t) = F (t) t where F continuous

    2. In probability: XnP X( > 0) lim

    nP [|Xn X| > ] = 0

    3. Almost surely (strongly): Xnas X

    P[

    limnXn = X

    ]= P

    [ : lim

    nXn() = X()]

    = 1

    9

  • 4. In quadratic mean (L2): Xnqm X

    limnE

    [(Xn X)2

    ]= 0

    Relationships

    Xn qm X = Xn P X = Xn D X Xn as X = Xn P X Xn D X (c R) P [X = c] = 1 = Xn P X Xn P X Yn P Y = Xn + Yn P X + Y Xn qm X Yn qm Y = Xn + Yn qm X + Y Xn P X Yn P Y = XnYn P XY Xn P X = (Xn) P (X) Xn D X = (Xn) D (X) Xn qm b limn E [Xn] = b limnV [Xn] = 0 X1, . . . , Xn iid E [X] = V [X]

  • 11.2 Normal-Based Confidence Interval

    Suppose n N(, se2

    ). Let z/2 =

    1(1 (/2)), i.e., P [Z > z/2] = /2and P

    [z/2 < Z < z/2] = 1 where Z N (0, 1). ThenCn = n z/2se

    11.3 Empirical distribution

    Empirical Distribution Function (ECDF)

    Fn(x) =

    ni=1 I(Xi x)

    n

    I(Xi x) ={

    1 Xi x0 Xi > x

    Properties (for any fixed x)

    E[Fn

    ]= F (x)

    V[Fn

    ]=F (x)(1 F (x))

    n

    mse = F (x)(1 F (x))n

    D 0 Fn P F (x)

    Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1, . . . , Xn F )

    P[supx

    F (x) Fn(x) > ] = 2e2n2Nonparametric 1 confidence band for F

    L(x) = max{Fn n, 0}U(x) = min{Fn + n, 1}

    =

    1

    2nlog

    (2

    )

    P [L(x) F (x) U(x) x] 1

    11.4 Statistical Functionals

    Statistical functional: T (F ) Plug-in estimator of = (F ): n = T (Fn) Linear functional: T (F ) = (x) dFX(x) Plug-in estimator for linear functional:

    T (Fn) =

    (x) dFn(x) =

    1

    n

    ni=1

    (Xi)

    Often: T (Fn) N(T (F ), se2

    )= T (Fn) z/2se

    pth quantile: F1(p) = inf{x : F (x) p} = Xn 2 = 1

    n 1ni=1

    (Xi Xn)2

    =1n

    ni=1(Xi )33j

    =ni=1(Xi Xn)(Yi Yn)n

    i=1(Xi Xn)2n

    i=1(Yi Yn)

    12 Parametric Inference

    Let F ={f(x; ) : } be a parametric model with parameter space Rk

    and parameter = (1, . . . , k).

    12.1 Method of Moments

    jth moment

    j() = E[Xj]

    =

    xj dFX(x)

    jth sample moment

    j =1

    n

    ni=1

    Xji

    Method of moments estimator (MoM)

    1() = 1

    2() = 2

    ... =...

    k() = k11

  • Properties of the MoM estimator

    n exists with probability tending to 1 Consistency: n P Asymptotic normality:

    n( ) D N (0,)

    where = gE[Y Y T

    ]gT , Y = (X,X2, . . . , Xk)T ,

    g = (g1, . . . , gk) and gj =

    1j ()

    12.2 Maximum Likelihood

    Likelihood: Ln : [0,)

    Ln() =ni=1

    f(Xi; )

    Log-likelihood

    `n() = logLn() =ni=1

    log f(Xi; )

    Maximum likelihood estimator (mle)

    Ln(n) = supLn()

    Score function

    s(X; ) =

    log f(X; )

    Fisher informationI() = V [s(X; )]

    In() = nI()

    Fisher information (exponential family)

    I() = E[ s(X; )

    ]Observed Fisher information

    Iobsn () = 2

    2

    ni=1

    log f(Xi; )

    Properties of the mle

    Consistency: n P

    Equivariance: n is the mle = (n) ist the mle of () Asymptotic normality:

    1. se 1/In()(n )

    seD N (0, 1)

    2. se

    1/In(n)

    (n )se

    D N (0, 1)

    Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-ples. If n is any other estimator, the asymptotic relative efficiency is

    are(n, n) =V[n

    ]V[n

    ] 1 Approximately the Bayes estimator

    12.2.1 Delta Method

    If = () where is differentiable and () 6= 0:(n )se()

    D N (0, 1)

    where = () is the mle of and

    se =() se(n)

    12.3 Multiparameter Models

    Let = (1, . . . , k) and = (1, . . . , k) be the mle.

    Hjj =2`n2

    Hjk =2`njk

    Fisher information matrix

    In() =

    E [H11] E [H1k]... . . . ...E [Hk1] E [Hkk]

    Under appropriate regularity conditions

    ( ) N (0, Jn)12

  • with Jn() = I1n . Further, if j is the j

    th component of , then

    (j j)sej

    D N (0, 1)

    where se2j = Jn(j, j) and Cov[j , k

    ]= Jn(j, k)

    12.3.1 Multiparameter delta method

    Let = (1, . . . , k) and let the gradient of be

    =

    1...

    k

    Suppose =6= 0 and = (). Then,

    ( )se()

    D N (0, 1)

    where

    se() =

    ()T

    Jn

    ()

    and Jn = Jn() and = =

    .

    12.4 Parametric Bootstrap

    Sample from f(x; n) instead of from Fn, where n could be the mle or methodof moments estimator.

    13 Hypothesis Testing

    H0 : 0 versus H1 : 1Definitions

    Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis = 0 Composite hypothesis > 0 or < 0 Two-sided test: H0 : = 0 versus H1 : 6= 0 One-sided test: H0 : 0 versus H1 : > 0 Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function () = P [X R] Power of a test: 1 P [Type II error] = 1 = inf

    1()

    Test size: = P [Type I error] = sup0

    ()

    Retain H0 Reject H0H0 true

    Type I Error ()

    H1 true Type II Error ()

    (power)p-value

    p-value = sup0 P [T (X) T (x)] = inf{ : T (x) R

    } p-value = sup0 P [T (X?) T (X)]

    1F(T (X)) since T (X?)F

    = inf{ : T (X) R

    }p-value evidence< 0.01 very strong evidence against H00.01 0.05 strong evidence against H00.05 0.1 weak evidence against H0> 0.1 little or no evidence against H0

    Wald test

    Two-sided test Reject H0 when |W | > z/2 where W = 0

    se P [|W | > z/2] p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)

    Likelihood ratio test (LRT)

    T (X) = sup Ln()sup0 Ln()

    =Ln(n)Ln(n,0) 13

  • (X) = 2 log T (X) D 2rq whereki=1

    Z2i 2k and Z1, . . . , Zk iid N (0, 1)

    p-value = P0 [(X) > (x)] P[2rq > (x)

    ]Multinomial LRT

    mle: pn =(X1n, . . . ,

    Xkn

    ) T (X) = Ln(pn)Ln(p0) =

    kj=1

    (pjp0j

    )Xj (X) = 2

    kj=1

    Xj log

    (pjp0j

    )D 2k1

    The approximate size LRT rejects H0 when (X) 2k1,Pearson Chi-square Test

    T =kj=1

    (Xj E [Xj ])2E [Xj ]

    where E [Xj ] = np0j under H0

    T D 2k1 p-value = P [2k1 > T (x)] Faster D X2k1 than LRT, hence preferable for small n

    Independence testing

    I rows, J columns, X multinomial sample of size n = I J mles unconstrained: pij = Xijn mles under H0: p0ij = pipj = Xin Xjn LRT: = 2Ii=1Jj=1Xij log ( nXijXiXj ) PearsonChiSq: T = Ii=1Jj=1 (XijE[Xij ])2E[Xij ] LRT and Pearson D 2k, where = (I 1)(J 1)

    14 Bayesian Inference

    Bayes Theorem

    f( |x) = f(x | )f()f(xn)

    =f(x | )f()f(x | )f() d Ln()f()

    Definitions

    Xn = (X1, . . . , Xn)

    xn = (x1, . . . , xn) Prior density f() Likelihood f(xn | ): joint density of the data

    In particular, Xn iid = f(xn | ) =ni=1

    f(xi | ) = Ln()

    Posterior density f( |xn) Normalizing constant cn = f(xn) =

    f(x | )f() d

    Kernel: part of a density that depends on Posterior mean n =

    f( |xn) d =

    Ln()f() Ln()f() d

    14.1 Credible Intervals

    Posterior interval

    P [ (a, b) |xn] = ba

    f( |xn) d = 1

    Equal-tail credible interval a

    f( |xn) d = b

    f( |xn) d = /2

    Highest posterior density (HPD) region Rn

    1. P [ Rn] = 1 2. Rn = { : f( |xn) > k} for some k

    Rn is unimodal = Rn is an interval

    14.2 Function of parameters

    Let = () and A = { : () }.Posterior CDF for

    H(r |xn) = P [() |xn] =A

    f( |xn) d

    Posterior density

    h( |xn) = H ( |xn)Bayesian delta method

    |Xn N((), se

    ())14

  • 14.3 Priors

    Choice

    Subjective bayesianism. Objective bayesianism. Robust bayesianism.

    Types

    Flat: f() constant Proper: f() d = 1 Improper: f() d = Jeffreys prior (transformation-invariant):

    f() I() f()

    det(I())

    Conjugate: f() and f( |xn) belong to the same parametric family

    14.3.1 Conjugate Priors

    Discrete likelihood

    Likelihood Conjugate prior Posterior hyperparameters

    Bern (p) Beta (, ) +

    ni=1

    xi, + nni=1

    xi

    Bin (p) Beta (, ) +

    ni=1

    xi, +

    ni=1

    Ni ni=1

    xi

    NBin (p) Beta (, ) + rn, +

    ni=1

    xi

    Po () Gamma (, ) +

    ni=1

    xi, + n

    Multinomial(p) Dir () +

    ni=1

    x(i)

    Geo (p) Beta (, ) + n, +

    ni=1

    xi

    Continuous likelihood (subscript c denotes constant)

    Likelihood Conjugate prior Posterior hyperparameters

    Unif (0, ) Pareto(xm, k) max{x(n), xm

    }, k + n

    Exp () Gamma (, ) + n, +

    ni=1

    xi

    N (, 2c) N (0, 20) (020 +ni=1 xi2c

    )/

    (1

    20+

    n

    2c

    ),(

    1

    20+

    n

    2c

    )1N (c, 2) Scaled Inverse Chi-

    square(, 20) + n,

    20 +ni=1(xi )2 + n

    N (, 2) Normal-scaled InverseGamma(, , , )

    + nx

    + n, + n, +

    n

    2,

    +1

    2

    ni=1

    (xi x)2 + (x )2

    2(n+ )

    MVN(,c) MVN(0,0)(10 + n

    1c

    )1 (10 0 + n

    1x),(

    10 + n1c

    )1MVN(c,) Inverse-

    Wishart(,)n+ , +

    ni=1

    (xi c)(xi c)T

    Pareto(xmc , k) Gamma (, ) + n, +

    ni=1

    logxixmc

    Pareto(xm, kc) Pareto(x0, k0) x0, k0 kn where k0 > knGamma (c, ) Gamma (0, 0) 0 + nc, 0 +

    ni=1

    xi

    14.4 Bayesian Testing

    If H0 : 0:

    Prior probability P [H0] =

    0

    f() d

    Posterior probability P [H0 |xn] =

    0

    f( |xn) d

    Let H0, . . . ,HK1 be K hypotheses. Suppose f( |Hk),

    P [Hk |xn] = f(xn |Hk)P [Hk]K

    k=1 f(xn |Hk)P [Hk]

    ,

    15

  • Marginal likelihood

    f(xn |Hi) =

    f(xn | ,Hi)f( |Hi) d

    Posterior odds (of Hi relative to Hj)

    P [Hi |xn]P [Hj |xn] =

    f(xn |Hi)f(xn |Hj)

    Bayes Factor BFij

    P [Hi]P [Hj ]

    prior odds

    Bayes factorlog10BF10 BF10 evidence

    0 0.5 1 1.5 Weak0.5 1 1.5 10 Moderate1 2 10 100 Strong> 2 > 100 Decisive

    p =p

    1pBF101 + p1pBF10

    where p = P [H1] and p = P [H1 |xn]

    15 Exponential Family

    Scalar parameter

    fX(x | ) = h(x) exp {()T (x)A()}= h(x)g() exp {()T (x)}

    Vector parameter

    fX(x | ) = h(x) exp{

    si=1

    i()Ti(x)A()}

    = h(x) exp {() T (x)A()}= h(x)g() exp {() T (x)}

    Natural form

    fX(x | ) = h(x) exp { T(x)A()}= h(x)g() exp { T(x)}= h(x)g() exp

    {TT(x)

    }16 Sampling Methods

    16.1 The Bootstrap

    Let Tn = g(X1, . . . , Xn) be a statistic.

    1. Estimate VF [Tn] with VFn [Tn].2. Approximate VFn [Tn] using simulation:

    (a) Repeat the following B times to get T n,1, . . . , Tn,B , an iid sample from

    the sampling distribution implied by Fn

    i. Sample uniformly X1 , . . . , Xn Fn.

    ii. Compute T n = g(X1 , . . . , X

    n).

    (b) Then

    vboot = VFn =1

    B

    Bb=1

    (T n,b

    1

    B

    Br=1

    T n,r

    )2

    16.1.1 Bootstrap Confidence Intervals

    Normal-based interval

    Tn z/2seboot

    Pivotal interval

    1. Location parameter = T (F )

    2. Pivot Rn = n 3. Let H(r) = P [Rn r] be the cdf of Rn4. Let Rn,b =

    n,b n. Approximate H using bootstrap:

    H(r) =1

    B

    Bb=1

    I(Rn,b r)

    5. = sample quantile of (n,1, . . . ,

    n,B)

    6. r = sample quantile of (Rn,1, . . . , R

    n,B), i.e., r

    =

    n

    7. Approximate 1 confidence interval Cn =(a, b)

    where

    a = n H1(

    1 2

    )= n r1/2 = 2n 1/2

    b = n H1(

    2

    )= n r/2 = 2n /2

    Percentile interval

    Cn =(/2,

    1/2

    )16

  • 16.2 Rejection Sampling

    Setup

    We can easily sample from g() We want to sample from h(), but it is difficult We know h() up to a proportional constant: h() = k()

    k() d

    Envelope condition: we can find M > 0 such that k() Mg() Algorithm

    1. Draw cand g()2. Generate u Unif (0, 1)3. Accept cand if u k(

    cand)

    Mg(cand)

    4. Repeat until B values of cand have been accepted

    Example

    We can easily sample from the prior g() = f() Target is the posterior h() k() = f(xn | )f() Envelope condition: f(xn | ) f(xn | n) = Ln(n) M Algorithm

    1. Draw cand f()2. Generate u Unif (0, 1)3. Accept cand if u Ln(

    cand)

    Ln(n)

    16.3 Importance Sampling

    Sample from an importance function g rather than target density h.Algorithm to obtain an approximation to E [q() |xn]:

    1. Sample from the prior 1, . . . , niid f()

    2. wi =Ln(i)Bi=1 Ln(i)

    i = 1, . . . , B

    3. E [q() |xn] Bi=1 q(i)wi17 Decision Theory

    Definitions

    Unknown quantity affecting our decision:

    Decision rule: synonymous for an estimator Action a A: possible value of the decision rule. In the estimation

    context, the action is just an estimate of , (x).

    Loss function L: consequences of taking action a when true state is ordiscrepancy between and , L : A [k,).

    Loss functions

    Squared error loss: L(, a) = ( a)2

    Linear loss: L(, a) ={K1( a) a < 0K2(a ) a 0

    Absolute error loss: L(, a) = | a| (linear loss with K1 = K2) Lp loss: L(, a) = | a|p

    Zero-one loss: L(, a) ={

    0 a =

    1 a 6=

    17.1 Risk

    Posterior risk

    r( |x) =L(, (x))f( |x) d = E|X

    [L(, (x))

    ](Frequentist) risk

    R(, ) =

    L(, (x))f(x | ) dx = EX|

    [L(, (X))

    ]Bayes risk

    r(f, ) =

    L(, (x))f(x, ) dx d = E,X

    [L(, (X))

    ]r(f, ) = E

    [EX|

    [L(, (X)

    ]]= E

    [R(, )

    ]r(f, ) = EX

    [E|X

    [L(, (X)

    ]]= EX

    [r( |X)

    ]17.2 Admissibility

    dominates if : R(, ) R(, ) : R(, ) < R(, )

    is inadmissible if there is at least one other estimator that dominatesit. Otherwise it is called admissible.

    17

  • 17.3 Bayes Rule

    Bayes rule (or Bayes estimator)

    r(f, ) = inf r(f, ) (x) = inf r( |x) x = r(f, ) = r( |x)f(x) dx

    Theorems

    Squared error loss: posterior mean Absolute error loss: posterior median Zero-one loss: posterior mode

    17.4 Minimax Rules

    Maximum riskR() = sup

    R(, ) R(a) = sup

    R(, a)

    Minimax rulesupR(, ) = inf

    R() = inf

    supR(, )

    = Bayes rule c : R(, ) = cLeast favorable prior

    f = Bayes rule R(, f ) r(f, f )

    18 Linear Regression

    Definitions

    Response variable Y Covariate X (aka predictor variable or feature)

    18.1 Simple Linear Regression

    ModelYi = 0 + 1Xi + i E [i |Xi] = 0, V [i |Xi] = 2

    Fitted liner(x) = 0 + 1x

    Predicted (fitted) values

    Yi = r(Xi)

    Residualsi = Yi Yi = Yi

    (0 + 1Xi

    )

    Residual sums of squares (rss)

    rss(0, 1) =

    ni=1

    2i

    Least square estimates

    T = (0, 1)T : min

    0,1

    rss

    0 = Yn 1Xn

    1 =

    ni=1(Xi Xn)(Yi Yn)n

    i=1(Xi Xn)2=

    ni=1XiYi nXYni=1X

    2i nX2

    E[ |Xn

    ]=

    (01

    )V[ |Xn

    ]=

    2

    ns2X

    (n1

    ni=1X

    2i Xn

    Xn 1)

    se(0) =

    sXn

    ni=1X

    2i

    n

    se(1) =

    sXn

    where s2X = n1n

    i=1(Xi Xn)2 and 2 = 1n2ni=1

    2i (unbiased estimate).

    Further properties:

    Consistency: 0 P 0 and 1 P 1 Asymptotic normality:

    0 0se(0)

    D N (0, 1) and 1 1se(1)

    D N (0, 1)

    Approximate 1 confidence intervals for 0 and 1:

    0 z/2se(0) and 1 z/2se(1)

    Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 whereW = 1/se(1).

    R2

    R2 =

    ni=1(Yi Y )2ni=1(Yi Y )2

    = 1ni=1

    2in

    i=1(Yi Y )2= 1 rss

    tss18

  • Likelihood

    L =ni=1

    f(Xi, Yi) =

    ni=1

    fX(Xi)ni=1

    fY |X(Yi |Xi) = L1 L2

    L1 =ni=1

    fX(Xi)

    L2 =ni=1

    fY |X(Yi |Xi) n exp{ 1

    22

    i

    (Yi (0 1Xi)

    )2}

    Under the assumption of Normality, the least squares parameter estimators arealso the MLEs, but the least squares variance estimator is not the MLE

    2 =1

    n

    ni=1

    2i

    18.2 Prediction

    Observe X = x of the covariate and want to predict their outcome Y.

    Y = 0 + 1x

    V[Y]

    = V[0

    ]+ x2V

    [1

    ]+ 2xCov

    [0, 1

    ]Prediction interval

    2n = 2

    (ni=1(Xi X)2

    ni(Xi X)2j

    + 1

    )Y z/2n

    18.3 Multiple Regression

    Y = X +

    where

    X =

    X11 X1k... . . . ...Xn1 Xnk

    =1...k

    =1...n

    Likelihood

    L(,) = (2pi2)n/2 exp{ 1

    22rss

    }

    rss = (y X)T (y X) = Y X2 =Ni=1

    (Yi xTi )2

    If the (k k) matrix XTX is invertible,

    = (XTX)1XTY

    V[ |Xn

    ]= 2(XTX)1

    N (, 2(XTX)1)Estimate regression function

    r(x) =

    kj=1

    jxj

    Unbiased estimate for 2

    2 =1

    n kni=1

    2i = X Y

    mle

    = X 2 =n kn

    2

    1 Confidence intervalj z/2se(j)

    18.4 Model Selection

    Consider predicting a new observation Y for covariates X and let S Jdenote a subset of the covariates in the model, where |S| = k and |J | = n.Issues

    Underfitting: too few covariates yields high bias Overfitting: too many covariates yields high variance

    Procedure

    1. Assign a score to each model

    2. Search through all models to find the one with the highest score

    Hypothesis testing

    H0 : j = 0 vs. H1 : j 6= 0 j JMean squared prediction error (mspe)

    mspe = E[(Y (S) Y )2

    ]Prediction risk

    R(S) =

    ni=1

    mspei =

    ni=1

    E[(Yi(S) Y i )2

    ]19

  • Training error

    Rtr(S) =

    ni=1

    (Yi(S) Yi)2

    R2

    R2(S) = 1 rss(S)tss

    = 1 Rtr(S)tss

    = 1ni=1(Yi(S) Y )2ni=1(Yi Y )2

    The training error is a downward-biased estimate of the prediction risk.

    E[Rtr(S)

    ]< R(S)

    bias(Rtr(S)) = E[Rtr(S)

    ]R(S) = 2

    ni=1

    Cov[Yi, Yi

    ]Adjusted R2

    R2(S) = 1 n 1n k

    rss

    tss

    Mallows Cp statistic

    R(S) = Rtr(S) + 2k2 = lack of fit + complexity penalty

    Akaike Information Criterion (AIC)

    AIC(S) = `n(S , 2S) k

    Bayesian Information Criterion (BIC)

    BIC(S) = `n(S , 2S)

    k

    2log n

    Validation and training

    RV (S) =

    mi=1

    (Y i (S) Y i )2 m = |{validation data}|, oftenn

    4or

    n

    2

    Leave-one-out cross-validation

    RCV (S) =

    ni=1

    (Yi Y(i))2 =ni=1

    (Yi Yi(S)1 Uii(S)

    )2

    U(S) = XS(XTSXS)

    1XS (hat matrix)

    19 Non-parametric Function Estimation

    19.1 Density Estimation

    Estimate f(x), where f(x) = P [X A] = Af(x) dx.

    Integrated square error (ise)

    L(f, fn) =

    (f(x) fn(x)

    )2dx = J(h) +

    f2(x) dx

    Frequentist risk

    R(f, fn) = E[L(f, fn)

    ]=

    b2(x) dx+

    v(x) dx

    b(x) = E[fn(x)

    ] f(x)

    v(x) = V[fn(x)

    ]19.1.1 Histograms

    Definitions

    Number of bins m Binwidth h = 1m Bin Bj has j observations Define pj = j/n and pj =

    Bjf(u) du

    Histogram estimator

    fn(x) =mj=1

    pjhI(x Bj)

    E[fn(x)

    ]=pjh

    V[fn(x)

    ]=pj(1 pj)

    nh2

    R(fn, f) h2

    12

    (f (u))2 du+

    1

    nh

    h =1

    n1/3

    (6

    (f (u))2du

    )1/3

    R(fn, f) Cn2/3

    C =

    (3

    4

    )2/3((f (u))2 du

    )1/3Cross-validation estimate of E [J(h)]

    JCV (h) =

    f2n(x) dx

    2

    n

    ni=1

    f(i)(Xi) =2

    (n 1)h n+ 1

    (n 1)hmj=1

    p2j

    20

  • 19.1.2 Kernel Density Estimator (KDE)

    Kernel K

    K(x) 0 K(x) dx = 1 xK(x) dx = 0 x2K(x) dx 2K > 0

    KDE

    fn(x) =1

    n

    ni=1

    1

    hK

    (xXih

    )R(f, fn) 1

    4(hK)

    4

    (f (x))2 dx+

    1

    nh

    K2(x) dx

    h =c2/51 c

    1/52 c

    1/53

    n1/5c1 =

    2K , c2 =

    K2(x) dx, c3 =

    (f (x))2 dx

    R(f, fn) =c4n4/5

    c4 =5

    4(2K)

    2/5

    (K2(x) dx

    )4/5

    C(K)

    ((f )2 dx

    )1/5

    Epanechnikov Kernel

    K(x) =

    {3

    4

    5(1x2/5) |x| 0 i pij = 1

    Chapman-Kolmogorov

    pij(m+ n) =k

    pij(m)pkj(n)

    Pm+n = PmPn

    Pn = P P = PnMarginal probability

    n = (n(1), . . . , n(N)) where i(i) = P [Xn = i]

    0 , initial distributionn = 0P

    n

    20.2 Poisson Processes

    Poisson process

    {Xt : t [0,)} = number of events up to and including time t X0 = 0 Independent increments:

    t0 < < tn : Xt1 Xt0 Xtn Xtn1

    Intensity function (t)

    P [Xt+h Xt = 1] = (t)h+ o(h) P [Xt+h Xt = 2] = o(h)

    Xs+t Xs Po (m(s+ t)m(s)) where m(t) = t

    0(s) ds

    Homogeneous Poisson process

    (t) = Xt Po (t) > 0

    Waiting times

    Wt := time at which Xt occurs

    Wt Gamma(t,

    1

    )

    Interarrival times

    St = Wt+1 Wt

    St Exp(

    1

    )

    tWt1 Wt

    St

    22

  • 21 Time Series

    Mean function

    xt = E [xt] =

    xft(x) dx

    Autocovariance function

    x(s, t) = E [(xs s)(xt t)] = E [xsxt] stx(t, t) = E

    [(xt t)2

    ]= V [xt]

    Autocorrelation function (ACF)

    (s, t) =Cov [xs, xt]V [xs]V [xt]

    =(s, t)

    (s, s)(t, t)

    Cross-covariance function (CCV)

    xy(s, t) = E [(xs xs)(yt yt)]Cross-correlation function (CCF)

    xy(s, t) =xy(s, t)

    x(s, s)y(t, t)

    Backshift operatorBk(xt) = xtk

    Difference operatord = (1B)d

    White noise

    wt wn(0, 2w) Gaussian: wt iid N

    (0, 2w

    ) E [wt] = 0 t T V [wt] = 2 t T w(s, t) = 0 s 6= t s, t T

    Random walk

    Drift xt = t+

    tj=1 wj

    E [xt] = tSymmetric moving average

    mt =

    kj=k

    ajxtj where aj = aj 0 andk

    j=kaj = 1

    21.1 Stationary Time Series

    Strictly stationary

    P [xt1 c1, . . . , xtk ck] = P [xt1+h c1, . . . , xtk+h ck]

    k N, tk, ck, h Z

    Weakly stationary

    E [x2t ]

  • 21.2 Estimation of Correlation

    Sample mean

    x =1

    n

    nt=1

    xt

    Sample variance

    V [x] =1

    n

    nh=n

    (1 |h|

    n

    )x(h)

    Sample autocovariance function

    (h) =1

    n

    nht=1

    (xt+h x)(xt x)

    Sample autocorrelation function

    (h) =(h)

    (0)

    Sample cross-variance function

    xy(h) =1

    n

    nht=1

    (xt+h x)(yt y)

    Sample cross-correlation function

    xy(h) =xy(h)x(0)y(0)

    Properties

    x(h) =1n

    if xt is white noise

    xy(h) =1n

    if xt or yt is white noise

    21.3 Non-Stationary Time Series

    Classical decomposition model

    xt = t + st + wt

    t = trend st = seasonal component wt = random noise term

    21.3.1 Detrending

    Least squares

    1. Choose trend model, e.g., t = 0 + 1t+ 2t2

    2. Minimize rss to obtain trend estimate t = 0 + 1t+ 2t2

    3. Residuals , noise wt

    Moving average

    The low-pass filter vt is a symmetric moving average mt with aj = 12k+1 :

    vt =1

    2k + 1

    ki=k

    xt1

    If 12k+1ki=k wtj 0, a linear trend function t = 0 + 1t passes

    without distortion

    Differencing

    t = 0 + 1t = xt = 1

    21.4 ARIMA models

    Autoregressive polynomial

    (z) = 1 1z pzp z C p 6= 0

    Autoregressive operator

    (B) = 1 1B pBp

    Autoregressive model order p, AR (p)

    xt = 1xt1 + + pxtp + wt (B)xt = wtAR (1)

    xt = k(xtk) +k1j=0

    j(wtj)k,||

  • Moving average polynomial

    (z) = 1 + 1z + + qzq z C q 6= 0Moving average operator

    (B) = 1 + 1B + + pBp

    MA (q) (moving average model order q)

    xt = wt + 1wt1 + + qwtq xt = (B)wt

    E [xt] =qj=0

    jE [wtj ] = 0

    (h) = Cov [xt+h, xt] =

    {2wqhj=0 jj+h 0 h q

    0 h > q

    MA (1)xt = wt + wt1

    (h) =

    (1 + 2)2w h = 0

    2w h = 1

    0 h > 1

    (h) =

    {

    (1+2) h = 1

    0 h > 1

    ARMA (p, q)

    xt = 1xt1 + + pxtp + wt + 1wt1 + + qwtq(B)xt = (B)wt

    Partial autocorrelation function (PACF)

    xh1i , regression of xi on {xh1, xh2, . . . , x1} hh = corr(xh xh1h , x0 xh10 ) h 2 E.g., 11 = corr(x1, x0) = (1)

    ARIMA (p, d, q)dxt = (1B)dxt is ARMA (p, q)

    (B)(1B)dxt = (B)wtExponentially Weighted Moving Average (EWMA)

    xt = xt1 + wt wt1

    xt =

    j=1

    (1 )j1xtj + wt when || < 1

    xn+1 = (1 )xn + xn

    Seasonal ARIMA

    Denoted by ARIMA (p, d, q) (P,D,Q)s P (Bs)(B)Ds dxt = + Q(Bs)(B)wt

    21.4.1 Causality and Invertibility

    ARMA (p, q) is causal (future-independent) {j} :j=0 j

  • Amplitude A Phase U1 = A cos and U2 = A sin often normally distributed rvs

    Periodic mixture

    xt =

    qk=1

    (Uk1 cos(2pikt) + Uk2 sin(2pikt))

    Uk1, Uk2, for k = 1, . . . , q, are independent zero-mean rvs with variances 2k (h) = qk=1 2k cos(2pikh) (0) = E [x2t ] = qk=1 2k

    Spectral representation of a periodic process

    (h) = 2 cos(2pi0h)

    =2

    2e2pii0h +

    2

    2e2pii0h

    =

    1/21/2

    e2piih dF ()

    Spectral distribution function

    F () =

    0 < 02/2 < 02 0

    F () = F (1/2) = 0 F () = F (1/2) = (0)

    Spectral density

    f() =

    h=

    (h)e2piih 12 1

    2

    Needs h= |(h)| 1 (n) = (n 1)! n N (1/2) = pi

    22.2 Beta Function

    Ordinary: B(x, y) = B(y, x) = 1

    0

    tx1(1 t)y1 dt = (x)(y)(x+ y)

    Incomplete: B(x; a, b) = x

    0

    ta1(1 t)b1 dt Regularized incomplete:Ix(a, b) =

    B(x; a, b)

    B(a, b)

    a,bN=

    a+b1j=a

    (a+ b 1)!j!(a+ b 1 j)!x

    j(1 x)a+b1j26

  • I0(a, b) = 0 I1(a, b) = 1 Ix(a, b) = 1 I1x(b, a)

    22.3 Series

    Finite

    nk=1

    k =n(n+ 1)

    2

    nk=1

    (2k 1) = n2

    nk=1

    k2 =n(n+ 1)(2n+ 1)

    6

    nk=1

    k3 =

    (n(n+ 1)

    2

    )2

    nk=0

    ck =cn+1 1c 1 c 6= 1

    Binomial

    nk=0

    (n

    k

    )= 2n

    nk=0

    (r + k

    k

    )=

    (r + n+ 1

    n

    )

    nk=0

    (k

    m

    )=

    (n+ 1

    m+ 1

    ) Vandermondes Identity:

    rk=0

    (m

    k

    )(n

    r k)

    =

    (m+ n

    r

    ) Binomial Theorem:

    nk=0

    (n

    k

    )ankbk = (a+ b)n

    Infinite

    k=0

    pk =1

    1 p ,k=1

    pk =p

    1 p |p| < 1

    k=0

    kpk1 =d

    dp

    ( k=0

    pk

    )=

    d

    dp

    (1

    1 p)

    =1

    (1 p)2 |p| < 1

    k=0

    (r + k 1

    k

    )xk = (1 x)r r N+

    k=0

    (

    k

    )pk = (1 + p) |p| < 1 , C

    22.4 Combinatorics

    Sampling

    k out of n w/o replacement w/ replacement

    ordered nk =

    k1i=0

    (n i) = n!(n k)! n

    k

    unordered

    (n

    k

    )=nk

    k!=

    n!

    k!(n k)!(n 1 + r

    r

    )=

    (n 1 + rn 1

    )

    Stirling numbers, 2nd kind{n

    k

    }= k

    {n 1k

    }+

    {n 1k 1

    }1 k n

    {n

    0

    }=

    {1 n = 0

    0 else

    Partitions

    Pn+k,k =

    ni=1

    Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1

    Balls and Urns f : B U D = distinguishable, D = indistinguishable.

    |B| = n, |U | = m f arbitrary f injective f surjective f bijective

    B : D, U : D mn

    {mn m n0 else

    m!

    {n

    m

    } {n! m = n

    0 else

    B : D, U : D(m+ n 1

    n

    ) (m

    n

    ) (n 1m 1

    ) {1 m = n

    0 else

    B : D, U : Dmk=1

    {n

    k

    } {1 m n0 else

    {n

    m

    } {1 m = n

    0 else

    B : D, U : Dmk=1

    Pn,k

    {1 m n0 else

    Pn,m

    {1 m = n

    0 else

    References

    [1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.Brooks Cole, 1972.

    [2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.The American Statistician, 62(1):4553, 2008.

    [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its ApplicationsWith R Examples. Springer, 2006.

    [4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,Algebra. Springer, 2001.

    [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie undStatistik. Springer, 2002.

    [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.Springer, 2003.

    27

  • Un

    ivari

    ate

    dis

    trib

    uti

    on

    rela

    tion

    ship

    s,co

    urt

    esy

    Lee

    mis

    an

    dM

    cQu

    esto

    n[2

    ].

    28

    Distribution OverviewDiscrete DistributionsContinuous Distributions

    Probability TheoryRandom VariablesTransformations

    ExpectationVarianceInequalitiesDistribution RelationshipsProbability and Moment Generating FunctionsMultivariate DistributionsStandard Bivariate NormalBivariate NormalMultivariate Normal

    ConvergenceLaw of Large NumbersCentral Limit Theorem

    Statistical InferencePoint EstimationNormal-Based Confidence IntervalEmpirical distributionStatistical Functionals

    Parametric InferenceMethod of MomentsMaximum LikelihoodDelta Method

    Multiparameter ModelsMultiparameter delta method

    Parametric Bootstrap

    Hypothesis TestingBayesian InferenceCredible IntervalsFunction of parametersPriorsConjugate Priors

    Bayesian Testing

    Exponential FamilySampling MethodsThe BootstrapBootstrap Confidence Intervals

    Rejection SamplingImportance Sampling

    Decision TheoryRiskAdmissibilityBayes RuleMinimax Rules

    Linear RegressionSimple Linear RegressionPredictionMultiple RegressionModel Selection

    Non-parametric Function EstimationDensity EstimationHistogramsKernel Density Estimator (KDE)

    Non-parametric RegressionSmoothing Using Orthogonal Functions

    Stochastic ProcessesMarkov ChainsPoisson Processes

    Time SeriesStationary Time SeriesEstimation of CorrelationNon-Stationary Time SeriesDetrending

    ARIMA modelsCausality and Invertibility

    Spectral Analysis

    MathGamma FunctionBeta FunctionSeriesCombinatorics