probability and statistics cookbook

Probability and Statistics

Cookbook

Copyright c Matthias Vallentin, [email protected]

18th May, 2014

This cookbook integrates a variety of topics in probability the-

ory and statistics. It is based on literature [1, 6, 3] and in-class

material from courses of the statistics department at the Uni-

versity of California in Berkeley but also influenced by other

sources [4, 5]. If you find errors or have suggestions for further

topics, I would appreciate if you send me an email. The most re-

cent version of this document is available at http://matthias.

vallentin.net/probability-and-statistics-cookbook/. To

reproduce, please contact me.

Contents

1 Distribution Overview 31.1 Discrete Distributions . . . . . . . . . . 31.2 Continuous Distributions . . . . . . . . 4

2 Probability Theory 6

3 Random Variables 63.1 Transformations . . . . . . . . . . . . . 7

4 Expectation 7

5 Variance 7

6 Inequalities 8

7 Distribution Relationships 8

8 Probability and Moment GeneratingFunctions 9

9 Multivariate Distributions 99.1 Standard Bivariate Normal . . . . . . . 99.2 Bivariate Normal . . . . . . . . . . . . . 99.3 Multivariate Normal . . . . . . . . . . . 9

10 Convergence 910.1 Law of Large Numbers (LLN) . . . . . . 1010.2 Central Limit Theorem (CLT) . . . . . 10

11 Statistical Inference 1011.1 Point Estimation . . . . . . . . . . . . . 1011.2 Normal-Based Confidence Interval . . . 1111.3 Empirical distribution . . . . . . . . . . 1111.4 Statistical Functionals . . . . . . . . . . 11

12 Parametric Inference 11

12.1 Method of Moments . . . . . . . . . . . 11

12.2 Maximum Likelihood . . . . . . . . . . . 12

12.2.1 Delta Method . . . . . . . . . . . 12

12.3 Multiparameter Models . . . . . . . . . 12

12.3.1 Multiparameter delta method . . 13

12.4 Parametric Bootstrap . . . . . . . . . . 13

13 Hypothesis Testing 13

14 Bayesian Inference 14

14.1 Credible Intervals . . . . . . . . . . . . . 14

14.2 Function of parameters . . . . . . . . . . 14

14.3 Priors . . . . . . . . . . . . . . . . . . . 15

14.3.1 Conjugate Priors . . . . . . . . . 15

14.4 Bayesian Testing . . . . . . . . . . . . . 15

15 Exponential Family 16

16 Sampling Methods 16

16.1 The Bootstrap . . . . . . . . . . . . . . 16

16.1.1 Bootstrap Confidence Intervals . 16

16.2 Rejection Sampling . . . . . . . . . . . . 17

16.3 Importance Sampling . . . . . . . . . . . 17

17 Decision Theory 17

17.1 Risk . . . . . . . . . . . . . . . . . . . . 17

17.2 Admissibility . . . . . . . . . . . . . . . 17

17.3 Bayes Rule . . . . . . . . . . . . . . . . 18

17.4 Minimax Rules . . . . . . . . . . . . . . 18

18 Linear Regression 18

18.1 Simple Linear Regression . . . . . . . . 18

18.2 Prediction . . . . . . . . . . . . . . . . . 19

18.3 Multiple Regression . . . . . . . . . . . 19

18.4 Model Selection . . . . . . . . . . . . . . 19

19 Non-parametric Function Estimation 20

19.1 Density Estimation . . . . . . . . . . . . 20

19.1.1 Histograms . . . . . . . . . . . . 20

19.1.2 Kernel Density Estimator (KDE) 21

19.2 Non-parametric Regression . . . . . . . 21

19.3 Smoothing Using Orthogonal Functions 21

20 Stochastic Processes 2220.1 Markov Chains . . . . . . . . . . . . . . 2220.2 Poisson Processes . . . . . . . . . . . . . 22

21 Time Series 2321.1 Stationary Time Series . . . . . . . . . . 2321.2 Estimation of Correlation . . . . . . . . 2421.3 Non-Stationary Time Series . . . . . . . 24

21.3.1 Detrending . . . . . . . . . . . . 2421.4 ARIMA models . . . . . . . . . . . . . . 24

21.4.1 Causality and Invertibility . . . . 2521.5 Spectral Analysis . . . . . . . . . . . . . 25

22 Math 2622.1 Gamma Function . . . . . . . . . . . . . 2622.2 Beta Function . . . . . . . . . . . . . . . 2622.3 Series . . . . . . . . . . . . . . . . . . . 2722.4 Combinatorics . . . . . . . . . . . . . . 27

1 Distribution Overview

1.1 Discrete Distributions

Notation1 FX(x) fX(x) E [X] V [X] MX(s)

Uniform Unif {a, . . . , b}

0 x < abxca+1ba a x b

1 x > b

I(a x b)b a+ 1

a+ b

2

(b a+ 1)2 112

eas e(b+1)ss(b a)

Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p+ pes

Binomial Bin (n, p) I1p(n x, x+ 1)(n

x

)px (1 p)nx np np(1 p) (1 p+ pes)n

Multinomial Mult (n, p)n!

x1! . . . xk!px11 pxkk

ki=1

xi = n npi npi(1 pi)(

ki=0

piesi

)n

Hypergeometric Hyp (N,m, n) (

x npnp(1 p)

) (mx

)(mxnx

)(Nx

) nmN

nm(N n)(N m)N2(N 1)

Negative Binomial NBin (r, p) Ip(r, x+ 1)

(x+ r 1r 1

)pr(1 p)x r 1 p

pr

1 pp2

(p

1 (1 p)es)r

Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+ 1p

1 pp2

pes

1 (1 p)es

Poisson Po () exi=0

i

i!

xe

x! e(e

s1)

l l l l l l1n

a bx

PMF

Uniform (discrete)

lllll

l

l

l

l

l

lll

l

l

l

l

lllllllllllllllllllllll0.0

0.1

0.2

0 10 20 30 40x

PMF

l n = 40, p = 0.3n = 30, p = 0.6n = 25, p = 0.9

Binomial

ll

ll

l l l l l l l0.0

0.2

0.4

0.6

0.8

3 6 9x

PMF

l p = 0.2p = 0.5p = 0.8

Geometricl l

l

l

ll l l l l l l l l l l l l l l l0.0

0.1

0.2

0.3

0 5 10 15 20x

PMF

l = 1 = 4 = 10

Poisson

1We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).3

1.2 Continuous Distributions

Notation FX(x) fX(x) E [X] V [X] MX(s)

Uniform Unif (a, b)

0 x < axaba a < x < b

1 x > b

I(a < x < b)

b aa+ b

2

(b a)212

esb esas(b a)

Normal N (, 2) (x) = x

(t) dt (x) =1

2piexp

{ (x )

2

22

} 2 exp

{s+

2s2

2

}Log-Normal lnN (, 2) 1

2+

1

2erf

[lnx

22

]1

x

2pi2exp

{ (lnx )

2

22

}e+

2/2 (e2 1)e2+2

Multivariate Normal MVN (,) (2pi)k/2||1/2e 12 (x)T1(x) exp{T s+

1

2sTs

}

Students t Student() Ix(

2,

2

) ( +12

)pi

(2

) (1 + x2

)(+1)/20

{ 2 > 2 1 < 2

Chi-square 2k1

(k/2)

(k

2,x

2

)1

2k/2(k/2)xk/21ex/2 k 2k (1 2s)k/2 s < 1/2

F F(d1, d2) I d1xd1x+d2

(d12,d12

) (d1x)d1dd22(d1x+d2)

d1+d2

xB(d12, d1

2

) d2d2 2

2d22(d1 + d2 2)d1(d2 2)2(d2 4)

Exponential Exp () 1 ex/ 1ex/ 2

1

1 s (s < 1/)

Gamma Gamma (, )(, x/)

()

1

()x1ex/ 2

(1

1 s)

(s < 1/)

Inverse Gamma InvGamma (, )(,

x

) ()

()x1e/x

1 > 12

( 1)2( 2)2 > 22(s)/2

()K

(4s

)

Dirichlet Dir ()(k

i=1 i)

ki=1 (i)

ki=1

xi1iiki=1 i

E [Xi] (1 E [Xi])ki=1 i + 1

Beta Beta (, ) Ix(, ) (+ )

() ()x1 (1 x)1

+

(+ )2(+ + 1)1 +

k=1

(k1r=0

+ r

+ + r

)sk

k!

Weibull Weibull(, k) 1 e(x/)k k

(x

)k1e(x/)

k

(1 +

1

k

)2

(1 +

2

k

) 2

n=0

snn

n!(

1 +n

k

)Pareto Pareto(xm, ) 1

(xmx

)x xm x

m

x+1x xm xm

1 > 1xm

( 1)2( 2) > 2 (xms)(,xms) s < 0

4

l l

l l

1b a

a bx

PDF

Uniform (continuous)

0.00

0.25

0.50

0.75

5.0 2.5 0.0 2.5 5.0x

(x)

= 0, 2 = 0.2 = 0, 2 = 1 = 0, 2 = 5 = 2, 2 = 0.5

Normal

0.00

0.25

0.50

0.75

1.00

0 1 2 3x

PDF

= 0, 2 = 3 = 2, 2 = 2 = 0, 2 = 1 = 0.5, 2 = 1 = 0.25, 2 = 1 = 0.125, 2 = 1

LogNormal

0.0

0.1

0.2

0.3

0.4

5.0 2.5 0.0 2.5 5.0x

PDF

= 1 = 2 = 5 =

Student's t

0

1

2

3

4

0 2 4 6 8x

PDF

k = 1k = 2k = 3k = 4k = 5

2

0

1

2

3

0 1 2 3 4 5x

PDF

d1 = 1, d2 = 1d1 = 2, d2 = 1d1 = 5, d2 = 2d1 = 100, d2 = 1d1 = 100, d2 = 100

F

0.0

0.5

1.0

1.5

2.0

0 1 2 3 4 5x

PDF

= 2 = 1 = 0.4

Exponential

0.0

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20x

PDF

= 1, = 2 = 2, = 2 = 3, = 2 = 5, = 1 = 9, = 0.5

Gamma

0

1

2

3

4

0 1 2 3 4 5x

PDF

= 1, = 1 = 2, = 1 = 3, = 1 = 3, = 0.5

Inverse Gamma

0

1

2

3

4

5

0.00 0.25 0.50 0.75 1.00x

PDF

= 0.5, = 0.5 = 5, = 1 = 1, = 3 = 2, = 2 = 2, = 5

Beta

0

1

2

3

4

0.0 0.5 1.0 1.5 2.0 2.5x

PDF

= 1, k = 0.5 = 1, k = 1 = 1, k = 1.5 = 1, k = 5

Weibull

0

1

2

3

0 1 2 3 4 5x

PDF

xm = 1, = 1xm = 1, = 2xm = 1, = 4

Pareto

5

2 Probability Theory

Definitions

Sample space Outcome (point or element) Event A -algebra A

1. A2. A1, A2, . . . , A =

i=1Ai A

3. A A = A A Probability Distribution P

1. P [A] 0 A2. P [] = 1

3. P

[ i=1

Ai

]=

i=1

P [Ai]

Probability space (,A,P)Properties

P [] = 0 B = B = (A A) B = (A B) (A B) P [A] = 1 P [A] P [B] = P [A B] + P [A B] P [] = 1 P [] = 0 (nAn) = n An (nAn) = n An DeMorgan P [nAn] = 1 P [n An] P [A B] = P [A] + P [B] P [A B]

= P [A B] P [A] + P [B] P [A B] = P [A B] + P [A B] + P [A B] P [A B] = P [A] P [A B]

Continuity of Probabilities

A1 A2 . . . = limn P [An] = P [A] whereA =i=1Ai

A1 A2 . . . = limn P [An] = P [A] whereA =i=1Ai

Independence A B P [A B] = P [A]P [B]

Conditional Probability

P [A |B] = P [A B]P [B]

P [B] > 0

Law of Total Probability

P [B] =ni=1

P [B|Ai]P [Ai] =ni=1

Ai

Bayes Theorem

P [Ai |B] = P [B |Ai]P [Ai]nj=1 P [B |Aj ]P [Aj ]

=

ni=1

Ai

Inclusion-Exclusion Principle ni=1

Ai

= nr=1

(1)r1

ii1

3.1 Transformations

Transformation functionZ = (X)

Discrete

fZ(z) = P [(X) = z] = P [{x : (x) = z}] = P[X 1(z)] =

x1(z)f(x)

Continuous

FZ(z) = P [(X) z] =Az

f(x) dx with Az = {x : (x) z}

Special case if strictly monotone

fZ(z) = fX(1(z))

ddz1(z) = fX(x) dxdz

= fX(x) 1|J |The Rule of the Lazy Statistician

E [Z] =(x) dFX(x)

E [IA(x)] =IA(x) dFX(x) =

A

dFX(x) = P [X A]

Convolution

Z := X +Y fZ(z) =

fX,Y (x, zx) dx X,Y0= z

0

fX,Y (x, zx) dx

Z := |X Y | fZ(z) = 2

0

fX,Y (x, z + x) dx

Z := XY

fZ(z) =

|x|fX,Y (x, xz) dx =

xfx(x)fX(x)fY (xz) dx

4 Expectation

Definition and properties

E [X] = X =x dFX(x) =

x

xfX(x) X discrete

xfX(x) dx X continuous

P [X = c] = 1 = E [X] = c E [cX] = cE [X] E [X + Y ] = E [X] + E [Y ]

E [XY ] =X,Y

xyfX,Y (x, y) dFX(x) dFY (y)

E [(Y )] 6= (E [X]) (cf. Jensen inequality) P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ] E [X] =

x=1

P [X x]

Sample mean

Xn =1

n

ni=1

Xi

Conditional expectation

E [Y |X = x] =yf(y |x) dy

E [X] = E [E [X |Y ]] E[(X,Y ) |X = x] =

(x, y)fY |X(y |x) dx

E [(Y,Z) |X = x] =

(y, z)f(Y,Z)|X(y, z |x) dy dz E [Y + Z |X] = E [Y |X] + E [Z |X] E [(X)Y |X] = (X)E [Y |X] E[Y |X] = c = Cov [X,Y ] = 0

5 VarianceDefinition and properties

V [X] = 2X = E[(X E [X])2] = E [X2] E [X]2

V[

ni=1

Xi

]=

ni=1

V [Xi] + 2i 6=j

Cov [Xi, Yj ]

V[

ni=1

Xi

]=

ni=1

V [Xi] if Xi Xj

Standard deviationsd[X] =

V [X] = X

Covariance

Cov [X,Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X]E [Y ] Cov [X, a] = 0 Cov [X,X] = V [X] Cov [X,Y ] = Cov [Y,X] Cov [aX, bY ] = abCov [X,Y ]

7

Cov [X + a, Y + b] = Cov [X,Y ]

Cov ni=1

Xi,

mj=1

Yj

= ni=1

mj=1

Cov [Xi, Yj ]

Correlation

[X,Y ] =Cov [X,Y ]V [X]V [Y ]

Independence

X Y = [X,Y ] = 0 Cov [X,Y ] = 0 E [XY ] = E [X]E [Y ]Sample variance

S2 =1

n 1ni=1

(Xi Xn)2

Conditional variance

V [Y |X] = E [(Y E [Y |X])2 |X] = E [Y 2 |X] E [Y |X]2 V [Y ] = E [V [Y |X]] + V [E [Y |X]]

6 Inequalities

Cauchy-SchwarzE [XY ]2 E [X2]E [Y 2]

Markov

P [(X) t] E [(X)]t

Chebyshev

P [|X E [X]| t] V [X]t2

Chernoff

P [X (1 + )] (

e

(1 + )1+

) > 1

JensenE [(X)] (E [X]) convex

7 Distribution Relationships

Binomial

Xi Bern (p) =ni=1

Xi Bin (n, p)

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n+m, p)

limn Bin (n, p) = Po (np) (n large, p small) limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)

Negative Binomial

X NBin (1, p) = Geo (p) X NBin (r, p) = ri=1 Geo (p) Xi NBin (ri, p) =

Xi NBin (

ri, p)

X NBin (r, p) . Y Bin (s+ r, p) = P [X s] = P [Y r]Poisson

Xi Po (i) Xi Xj =ni=1

Xi Po(

ni=1

i

)

Xi Po (i) Xi Xj = Xi

nj=1

Xj Bin nj=1

Xj ,inj=1 j

Exponential

Xi Exp () Xi Xj =ni=1

Xi Gamma (n, )

Memoryless property: P [X > x+ y |X > y] = P [X > x]Normal

X N (, 2) = (X ) N (0, 1) X N (, 2) Z = aX + b = Z N (a+ b, a22) X N (1, 21) Y N (2, 22) = X + Y N (1 + 2, 21 + 22) Xi N

(i,

2i

)= iXi N (i i,i 2i )

P [a < X b] = (b

) (a )

(x) = 1 (x) (x) = x(x) (x) = (x2 1)(x) Upper quantile of N (0, 1): z = 1(1 )

Gamma

X Gamma (, ) X/ Gamma (, 1) Gamma (, ) i=1 Exp () Xi Gamma (i, ) Xi Xj =

iXi Gamma (

i i, )

()

=

0

x1ex dx

Beta

1B(, )

x1(1 x)1 = (+ )()()

x1(1 x)1

E [Xk] = B(+ k, )B(, )

=+ k 1

+ + k 1E[Xk1

] Beta (1, 1) Unif (0, 1)

8

8 Probability and Moment Generating Functions

GX(t) = E[tX] |t| < 1

MX(t) = GX(et) = E[eXt]

= E

[ i=0

(Xt)i

i!

]=

i=0

E[Xi]

i! ti

P [X = 0] = GX(0) P [X = 1] = GX(0)

P [X = i] = G(i)X (0)

i! E [X] = GX(1) E [Xk] = M (k)X (0) E

[X!

(X k)!]

= G(k)X (1

)

V [X] = GX(1) +GX(1) (GX(1))2

GX(t) = GY (t) = X d= Y

9 Multivariate Distributions

9.1 Standard Bivariate Normal

Let X,Y N (0, 1) X Z where Y = X +

1 2Z

Joint density

f(x, y) =1

2pi

1 2 exp{x

2 + y2 2xy2(1 2)

}Conditionals

(Y |X = x) N (x, 1 2) and (X |Y = y) N (y, 1 2)Independence

X Y = 0

9.2 Bivariate Normal

Let X N (x, 2x) and Y N (y, 2y).f(x, y) =

1

2pixy

1 2 exp{ z

2(1 2)}

z =

[(x xx

)2+

(y yy

)2 2

(x xx

)(y yy

)]

Conditional mean and variance

E [X |Y ] = E [X] + XY

(Y E [Y ])

V [X |Y ] = X

1 2

9.3 Multivariate Normal

Covariance matrix (Precision matrix 1)

=

V [X1] Cov [X1, Xk]... . . . ...Cov [Xk, X1] V [Xk]

If X N (,),

fX(x) = (2pi)n/2 ||1/2 exp

{1

2(x )T1(x )

}Properties

Z N (0, 1) X = + 1/2Z = X N (,) X N (,) = 1/2(X ) N (0, 1) X N (,) = AX N (A,AAT ) X N (,) a = k = aTX N (aT, aTa)

10 Convergence

Let {X1, X2, . . .} be a sequence of rvs and let X be another rv. Let Fn denotethe cdf of Xn and let F denote the cdf of X.

Types of convergence

1. In distribution (weakly, in law): XnD X

limnFn(t) = F (t) t where F continuous

2. In probability: XnP X( > 0) lim

nP [|Xn X| > ] = 0

3. Almost surely (strongly): Xnas X

P[

limnXn = X

]= P

[ : lim

nXn() = X()]

= 1

9

4. In quadratic mean (L2): Xnqm X

limnE

[(Xn X)2

]= 0

Relationships

Xn qm X = Xn P X = Xn D X Xn as X = Xn P X Xn D X (c R) P [X = c] = 1 = Xn P X Xn P X Yn P Y = Xn + Yn P X + Y Xn qm X Yn qm Y = Xn + Yn qm X + Y Xn P X Yn P Y = XnYn P XY Xn P X = (Xn) P (X) Xn D X = (Xn) D (X) Xn qm b limn E [Xn] = b limnV [Xn] = 0 X1, . . . , Xn iid E [X] = V [X]

11.2 Normal-Based Confidence Interval

Suppose n N(, se2

). Let z/2 =

1(1 (/2)), i.e., P [Z > z/2] = /2and P

[z/2 < Z < z/2] = 1 where Z N (0, 1). ThenCn = n z/2se

11.3 Empirical distribution

Empirical Distribution Function (ECDF)

Fn(x) =

ni=1 I(Xi x)

n

I(Xi x) ={

1 Xi x0 Xi > x

Properties (for any fixed x)

E[Fn

]= F (x)

V[Fn

]=F (x)(1 F (x))

n

mse = F (x)(1 F (x))n

D 0 Fn P F (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1, . . . , Xn F )

P[supx

F (x) Fn(x) > ] = 2e2n2Nonparametric 1 confidence band for F

L(x) = max{Fn n, 0}U(x) = min{Fn + n, 1}

=

1

2nlog

(2

)

P [L(x) F (x) U(x) x] 1

11.4 Statistical Functionals

Statistical functional: T (F ) Plug-in estimator of = (F ): n = T (Fn) Linear functional: T (F ) = (x) dFX(x) Plug-in estimator for linear functional:

T (Fn) =

(x) dFn(x) =

1

n

ni=1

(Xi)

Often: T (Fn) N(T (F ), se2

)= T (Fn) z/2se

pth quantile: F1(p) = inf{x : F (x) p} = Xn 2 = 1

n 1ni=1

(Xi Xn)2

=1n

ni=1(Xi )33j

=ni=1(Xi Xn)(Yi Yn)n

i=1(Xi Xn)2n

i=1(Yi Yn)

12 Parametric Inference

Let F ={f(x; ) : } be a parametric model with parameter space Rk

and parameter = (1, . . . , k).

12.1 Method of Moments

jth moment

j() = E[Xj]

=

xj dFX(x)

jth sample moment

j =1

n

ni=1

Xji

Method of moments estimator (MoM)

1() = 1

2() = 2

... =...

k() = k11

Properties of the MoM estimator

n exists with probability tending to 1 Consistency: n P Asymptotic normality:

n( ) D N (0,)

where = gE[Y Y T

]gT , Y = (X,X2, . . . , Xk)T ,

g = (g1, . . . , gk) and gj =

1j ()

12.2 Maximum Likelihood

Likelihood: Ln : [0,)

Ln() =ni=1

f(Xi; )

Log-likelihood

`n() = logLn() =ni=1

log f(Xi; )

Maximum likelihood estimator (mle)

Ln(n) = supLn()

Score function

s(X; ) =

log f(X; )

Fisher informationI() = V [s(X; )]

In() = nI()

Fisher information (exponential family)

I() = E[ s(X; )

]Observed Fisher information

Iobsn () = 2

2

ni=1

log f(Xi; )

Properties of the mle

Consistency: n P

Equivariance: n is the mle = (n) ist the mle of () Asymptotic normality:

1. se 1/In()(n )

seD N (0, 1)

2. se

1/In(n)

(n )se

D N (0, 1)

Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-ples. If n is any other estimator, the asymptotic relative efficiency is

are(n, n) =V[n

]V[n

] 1 Approximately the Bayes estimator

12.2.1 Delta Method

If = () where is differentiable and () 6= 0:(n )se()

D N (0, 1)

where = () is the mle of and

se =() se(n)

12.3 Multiparameter Models

Let = (1, . . . , k) and = (1, . . . , k) be the mle.

Hjj =2`n2

Hjk =2`njk

Fisher information matrix

In() =

E [H11] E [H1k]... . . . ...E [Hk1] E [Hkk]

Under appropriate regularity conditions

( ) N (0, Jn)12

with Jn() = I1n . Further, if j is the j

th component of , then

(j j)sej

D N (0, 1)

where se2j = Jn(j, j) and Cov[j , k

]= Jn(j, k)

12.3.1 Multiparameter delta method

Let = (1, . . . , k) and let the gradient of be

=

1...

k

Suppose =6= 0 and = (). Then,

( )se()

D N (0, 1)

where

se() =

()T

Jn

()

and Jn = Jn() and = =

.

12.4 Parametric Bootstrap

Sample from f(x; n) instead of from Fn, where n could be the mle or methodof moments estimator.

13 Hypothesis Testing

H0 : 0 versus H1 : 1Definitions

Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis = 0 Composite hypothesis > 0 or < 0 Two-sided test: H0 : = 0 versus H1 : 6= 0 One-sided test: H0 : 0 versus H1 : > 0 Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function () = P [X R] Power of a test: 1 P [Type II error] = 1 = inf

1()

Test size: = P [Type I error] = sup0

()

Retain H0 Reject H0H0 true

Type I Error ()

H1 true Type II Error ()

(power)p-value

p-value = sup0 P [T (X) T (x)] = inf{ : T (x) R

} p-value = sup0 P [T (X?) T (X)]

1F(T (X)) since T (X?)F

= inf{ : T (X) R

}p-value evidence< 0.01 very strong evidence against H00.01 0.05 strong evidence against H00.05 0.1 weak evidence against H0> 0.1 little or no evidence against H0

Wald test

Two-sided test Reject H0 when |W | > z/2 where W = 0

se P [|W | > z/2] p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)

Likelihood ratio test (LRT)

T (X) = sup Ln()sup0 Ln()

=Ln(n)Ln(n,0) 13

(X) = 2 log T (X) D 2rq whereki=1

Z2i 2k and Z1, . . . , Zk iid N (0, 1)

p-value = P0 [(X) > (x)] P[2rq > (x)

]Multinomial LRT

mle: pn =(X1n, . . . ,

Xkn

) T (X) = Ln(pn)Ln(p0) =

kj=1

(pjp0j

)Xj (X) = 2

kj=1

Xj log

(pjp0j

)D 2k1

The approximate size LRT rejects H0 when (X) 2k1,Pearson Chi-square Test

T =kj=1

(Xj E [Xj ])2E [Xj ]

where E [Xj ] = np0j under H0

T D 2k1 p-value = P [2k1 > T (x)] Faster D X2k1 than LRT, hence preferable for small n

Independence testing

I rows, J columns, X multinomial sample of size n = I J mles unconstrained: pij = Xijn mles under H0: p0ij = pipj = Xin Xjn LRT: = 2Ii=1Jj=1Xij log ( nXijXiXj ) PearsonChiSq: T = Ii=1Jj=1 (XijE[Xij ])2E[Xij ] LRT and Pearson D 2k, where = (I 1)(J 1)

14 Bayesian Inference

Bayes Theorem

f( |x) = f(x | )f()f(xn)

=f(x | )f()f(x | )f() d Ln()f()

Definitions

Xn = (X1, . . . , Xn)

xn = (x1, . . . , xn) Prior density f() Likelihood f(xn | ): joint density of the data

In particular, Xn iid = f(xn | ) =ni=1

f(xi | ) = Ln()

Posterior density f( |xn) Normalizing constant cn = f(xn) =

f(x | )f() d

Kernel: part of a density that depends on Posterior mean n =

f( |xn) d =

Ln()f() Ln()f() d

14.1 Credible Intervals

Posterior interval

P [ (a, b) |xn] = ba

f( |xn) d = 1

Equal-tail credible interval a

f( |xn) d = b

f( |xn) d = /2

Highest posterior density (HPD) region Rn

1. P [ Rn] = 1 2. Rn = { : f( |xn) > k} for some k

Rn is unimodal = Rn is an interval

14.2 Function of parameters

Let = () and A = { : () }.Posterior CDF for

H(r |xn) = P [() |xn] =A

f( |xn) d

Posterior density

h( |xn) = H ( |xn)Bayesian delta method

|Xn N((), se

())14

14.3 Priors

Choice

Subjective bayesianism. Objective bayesianism. Robust bayesianism.

Types

Flat: f() constant Proper: f() d = 1 Improper: f() d = Jeffreys prior (transformation-invariant):

f() I() f()

det(I())

Conjugate: f() and f( |xn) belong to the same parametric family

14.3.1 Conjugate Priors

Discrete likelihood

Likelihood Conjugate prior Posterior hyperparameters

Bern (p) Beta (, ) +

ni=1

xi, + nni=1

xi

Bin (p) Beta (, ) +

ni=1

xi, +

ni=1

Ni ni=1

xi

NBin (p) Beta (, ) + rn, +

ni=1

xi

Po () Gamma (, ) +

ni=1

xi, + n

Multinomial(p) Dir () +

ni=1

x(i)

Geo (p) Beta (, ) + n, +

ni=1

xi

Continuous likelihood (subscript c denotes constant)

Likelihood Conjugate prior Posterior hyperparameters

Unif (0, ) Pareto(xm, k) max{x(n), xm

}, k + n

Exp () Gamma (, ) + n, +

ni=1

xi

N (, 2c) N (0, 20) (020 +ni=1 xi2c

)/

(1

20+

n

2c

),(

1

20+

n

2c

)1N (c, 2) Scaled Inverse Chi-

square(, 20) + n,

20 +ni=1(xi )2 + n

N (, 2) Normal-scaled InverseGamma(, , , )

+ nx

+ n, + n, +

n

2,

+1

2

ni=1

(xi x)2 + (x )2

2(n+ )

MVN(,c) MVN(0,0)(10 + n

1c

)1 (10 0 + n

1x),(

10 + n1c

)1MVN(c,) Inverse-

Wishart(,)n+ , +

ni=1

(xi c)(xi c)T

Pareto(xmc , k) Gamma (, ) + n, +

ni=1

logxixmc

Pareto(xm, kc) Pareto(x0, k0) x0, k0 kn where k0 > knGamma (c, ) Gamma (0, 0) 0 + nc, 0 +

ni=1

xi

14.4 Bayesian Testing

If H0 : 0:

Prior probability P [H0] =

0

f() d

Posterior probability P [H0 |xn] =

0

f( |xn) d

Let H0, . . . ,HK1 be K hypotheses. Suppose f( |Hk),

P [Hk |xn] = f(xn |Hk)P [Hk]K

k=1 f(xn |Hk)P [Hk]

,

15

Marginal likelihood

f(xn |Hi) =

f(xn | ,Hi)f( |Hi) d

Posterior odds (of Hi relative to Hj)

P [Hi |xn]P [Hj |xn] =

f(xn |Hi)f(xn |Hj)

Bayes Factor BFij

P [Hi]P [Hj ]

prior odds

Bayes factorlog10BF10 BF10 evidence

0 0.5 1 1.5 Weak0.5 1 1.5 10 Moderate1 2 10 100 Strong> 2 > 100 Decisive

p =p

1pBF101 + p1pBF10

where p = P [H1] and p = P [H1 |xn]

15 Exponential Family

Scalar parameter

fX(x | ) = h(x) exp {()T (x)A()}= h(x)g() exp {()T (x)}

Vector parameter

fX(x | ) = h(x) exp{

si=1

i()Ti(x)A()}

= h(x) exp {() T (x)A()}= h(x)g() exp {() T (x)}

Natural form

fX(x | ) = h(x) exp { T(x)A()}= h(x)g() exp { T(x)}= h(x)g() exp

{TT(x)

}16 Sampling Methods

16.1 The Bootstrap

Let Tn = g(X1, . . . , Xn) be a statistic.

1. Estimate VF [Tn] with VFn [Tn].2. Approximate VFn [Tn] using simulation:

(a) Repeat the following B times to get T n,1, . . . , Tn,B , an iid sample from

the sampling distribution implied by Fn

i. Sample uniformly X1 , . . . , Xn Fn.

ii. Compute T n = g(X1 , . . . , X

n).

(b) Then

vboot = VFn =1

B

Bb=1

(T n,b

1

B

Br=1

T n,r

)2

16.1.1 Bootstrap Confidence Intervals

Normal-based interval

Tn z/2seboot

Pivotal interval

1. Location parameter = T (F )

2. Pivot Rn = n 3. Let H(r) = P [Rn r] be the cdf of Rn4. Let Rn,b =

n,b n. Approximate H using bootstrap:

H(r) =1

B

Bb=1

I(Rn,b r)

5. = sample quantile of (n,1, . . . ,

n,B)

6. r = sample quantile of (Rn,1, . . . , R

n,B), i.e., r

=

n

7. Approximate 1 confidence interval Cn =(a, b)

where

a = n H1(

1 2

)= n r1/2 = 2n 1/2

b = n H1(

2

)= n r/2 = 2n /2

Percentile interval

Cn =(/2,

1/2

)16

16.2 Rejection Sampling

Setup

We can easily sample from g() We want to sample from h(), but it is difficult We know h() up to a proportional constant: h() = k()

k() d

Envelope condition: we can find M > 0 such that k() Mg() Algorithm

1. Draw cand g()2. Generate u Unif (0, 1)3. Accept cand if u k(

cand)

Mg(cand)

4. Repeat until B values of cand have been accepted

Example

We can easily sample from the prior g() = f() Target is the posterior h() k() = f(xn | )f() Envelope condition: f(xn | ) f(xn | n) = Ln(n) M Algorithm

1. Draw cand f()2. Generate u Unif (0, 1)3. Accept cand if u Ln(

cand)

Ln(n)

16.3 Importance Sampling

Sample from an importance function g rather than target density h.Algorithm to obtain an approximation to E [q() |xn]:

1. Sample from the prior 1, . . . , niid f()

2. wi =Ln(i)Bi=1 Ln(i)

i = 1, . . . , B

3. E [q() |xn] Bi=1 q(i)wi17 Decision Theory

Definitions

Unknown quantity affecting our decision:

Decision rule: synonymous for an estimator Action a A: possible value of the decision rule. In the estimation

context, the action is just an estimate of , (x).

Loss function L: consequences of taking action a when true state is ordiscrepancy between and , L : A [k,).

Loss functions

Squared error loss: L(, a) = ( a)2

Linear loss: L(, a) ={K1( a) a < 0K2(a ) a 0

Absolute error loss: L(, a) = | a| (linear loss with K1 = K2) Lp loss: L(, a) = | a|p

Zero-one loss: L(, a) ={

0 a =

1 a 6=

17.1 Risk

Posterior risk

r( |x) =L(, (x))f( |x) d = E|X

[L(, (x))

](Frequentist) risk

R(, ) =

L(, (x))f(x | ) dx = EX|

[L(, (X))

]Bayes risk

r(f, ) =

L(, (x))f(x, ) dx d = E,X

[L(, (X))

]r(f, ) = E

[EX|

[L(, (X)

]]= E

[R(, )

]r(f, ) = EX

[E|X

[L(, (X)

]]= EX

[r( |X)

]17.2 Admissibility

dominates if : R(, ) R(, ) : R(, ) < R(, )

is inadmissible if there is at least one other estimator that dominatesit. Otherwise it is called admissible.

17

17.3 Bayes Rule

Bayes rule (or Bayes estimator)

r(f, ) = inf r(f, ) (x) = inf r( |x) x = r(f, ) = r( |x)f(x) dx

Theorems

Squared error loss: posterior mean Absolute error loss: posterior median Zero-one loss: posterior mode

17.4 Minimax Rules

Maximum riskR() = sup

R(, ) R(a) = sup

R(, a)

Minimax rulesupR(, ) = inf

R() = inf

supR(, )

= Bayes rule c : R(, ) = cLeast favorable prior

f = Bayes rule R(, f ) r(f, f )

18 Linear Regression

Definitions

Response variable Y Covariate X (aka predictor variable or feature)

18.1 Simple Linear Regression

ModelYi = 0 + 1Xi + i E [i |Xi] = 0, V [i |Xi] = 2

Fitted liner(x) = 0 + 1x

Predicted (fitted) values

Yi = r(Xi)

Residualsi = Yi Yi = Yi

(0 + 1Xi

)

Residual sums of squares (rss)

rss(0, 1) =

ni=1

2i

Least square estimates

T = (0, 1)T : min

0,1

rss

0 = Yn 1Xn

1 =

ni=1(Xi Xn)(Yi Yn)n

i=1(Xi Xn)2=

ni=1XiYi nXYni=1X

2i nX2

E[ |Xn

]=

(01

)V[ |Xn

]=

2

ns2X

(n1

ni=1X

2i Xn

Xn 1)

se(0) =

sXn

ni=1X

2i

n

se(1) =

sXn

where s2X = n1n

i=1(Xi Xn)2 and 2 = 1n2ni=1

2i (unbiased estimate).

Further properties:

Consistency: 0 P 0 and 1 P 1 Asymptotic normality:

0 0se(0)

D N (0, 1) and 1 1se(1)

D N (0, 1)

Approximate 1 confidence intervals for 0 and 1:

0 z/2se(0) and 1 z/2se(1)

Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 whereW = 1/se(1).

R2

R2 =

ni=1(Yi Y )2ni=1(Yi Y )2

= 1ni=1

2in

i=1(Yi Y )2= 1 rss

tss18

Likelihood

L =ni=1

f(Xi, Yi) =

ni=1

fX(Xi)ni=1

fY |X(Yi |Xi) = L1 L2

L1 =ni=1

fX(Xi)

L2 =ni=1

fY |X(Yi |Xi) n exp{ 1

22

i

(Yi (0 1Xi)

)2}

Under the assumption of Normality, the least squares parameter estimators arealso the MLEs, but the least squares variance estimator is not the MLE

2 =1

n

ni=1

2i

18.2 Prediction

Observe X = x of the covariate and want to predict their outcome Y.

Y = 0 + 1x

V[Y]

= V[0

]+ x2V

[1

]+ 2xCov

[0, 1

]Prediction interval

2n = 2

(ni=1(Xi X)2

ni(Xi X)2j

+ 1

)Y z/2n

18.3 Multiple Regression

Y = X +

where

X =

X11 X1k... . . . ...Xn1 Xnk

=1...k

=1...n

Likelihood

L(,) = (2pi2)n/2 exp{ 1

22rss

}

rss = (y X)T (y X) = Y X2 =Ni=1

(Yi xTi )2

If the (k k) matrix XTX is invertible,

= (XTX)1XTY

V[ |Xn

]= 2(XTX)1

N (, 2(XTX)1)Estimate regression function

r(x) =

kj=1

jxj

Unbiased estimate for 2

2 =1

n kni=1

2i = X Y

mle

= X 2 =n kn

2

1 Confidence intervalj z/2se(j)

18.4 Model Selection

Consider predicting a new observation Y for covariates X and let S Jdenote a subset of the covariates in the model, where |S| = k and |J | = n.Issues

Underfitting: too few covariates yields high bias Overfitting: too many covariates yields high variance

Procedure

1. Assign a score to each model

2. Search through all models to find the one with the highest score

Hypothesis testing

H0 : j = 0 vs. H1 : j 6= 0 j JMean squared prediction error (mspe)

mspe = E[(Y (S) Y )2

]Prediction risk

R(S) =

ni=1

mspei =

ni=1

E[(Yi(S) Y i )2

]19

Training error

Rtr(S) =

ni=1

(Yi(S) Yi)2

R2

R2(S) = 1 rss(S)tss

= 1 Rtr(S)tss

= 1ni=1(Yi(S) Y )2ni=1(Yi Y )2

The training error is a downward-biased estimate of the prediction risk.

E[Rtr(S)

]< R(S)

bias(Rtr(S)) = E[Rtr(S)

]R(S) = 2

ni=1

Cov[Yi, Yi

]Adjusted R2

R2(S) = 1 n 1n k

rss

tss

Mallows Cp statistic

R(S) = Rtr(S) + 2k2 = lack of fit + complexity penalty

Akaike Information Criterion (AIC)

AIC(S) = `n(S , 2S) k

Bayesian Information Criterion (BIC)

BIC(S) = `n(S , 2S)

k

2log n

Validation and training

RV (S) =

mi=1

(Y i (S) Y i )2 m = |{validation data}|, oftenn

4or

n

2

Leave-one-out cross-validation

RCV (S) =

ni=1

(Yi Y(i))2 =ni=1

(Yi Yi(S)1 Uii(S)

)2

U(S) = XS(XTSXS)

1XS (hat matrix)

19 Non-parametric Function Estimation

19.1 Density Estimation

Estimate f(x), where f(x) = P [X A] = Af(x) dx.

Integrated square error (ise)

L(f, fn) =

(f(x) fn(x)

)2dx = J(h) +

f2(x) dx

Frequentist risk

R(f, fn) = E[L(f, fn)

]=

b2(x) dx+

v(x) dx

b(x) = E[fn(x)

] f(x)

v(x) = V[fn(x)

]19.1.1 Histograms

Definitions

Number of bins m Binwidth h = 1m Bin Bj has j observations Define pj = j/n and pj =

Bjf(u) du

Histogram estimator

fn(x) =mj=1

pjhI(x Bj)

E[fn(x)

]=pjh

V[fn(x)

]=pj(1 pj)

nh2

R(fn, f) h2

12

(f (u))2 du+

1

nh

h =1

n1/3

(6

(f (u))2du

)1/3

R(fn, f) Cn2/3

C =

(3

4

)2/3((f (u))2 du

)1/3Cross-validation estimate of E [J(h)]

JCV (h) =

f2n(x) dx

2

n

ni=1

f(i)(Xi) =2

(n 1)h n+ 1

(n 1)hmj=1

p2j

20

19.1.2 Kernel Density Estimator (KDE)

Kernel K

K(x) 0 K(x) dx = 1 xK(x) dx = 0 x2K(x) dx 2K > 0

KDE

fn(x) =1

n

ni=1

1

hK

(xXih

)R(f, fn) 1

4(hK)

4

(f (x))2 dx+

1

nh

K2(x) dx

h =c2/51 c

1/52 c

1/53

n1/5c1 =

2K , c2 =

K2(x) dx, c3 =

(f (x))2 dx

R(f, fn) =c4n4/5

c4 =5

4(2K)

2/5

(K2(x) dx

)4/5

C(K)

((f )2 dx

)1/5

Epanechnikov Kernel

K(x) =

{3

4

5(1x2/5) |x| 0 i pij = 1

Chapman-Kolmogorov

pij(m+ n) =k

pij(m)pkj(n)

Pm+n = PmPn

Pn = P P = PnMarginal probability

n = (n(1), . . . , n(N)) where i(i) = P [Xn = i]

0 , initial distributionn = 0P

n

20.2 Poisson Processes

Poisson process

{Xt : t [0,)} = number of events up to and including time t X0 = 0 Independent increments:

t0 < < tn : Xt1 Xt0 Xtn Xtn1

Intensity function (t)

P [Xt+h Xt = 1] = (t)h+ o(h) P [Xt+h Xt = 2] = o(h)

Xs+t Xs Po (m(s+ t)m(s)) where m(t) = t

0(s) ds

Homogeneous Poisson process

(t) = Xt Po (t) > 0

Waiting times

Wt := time at which Xt occurs

Wt Gamma(t,

1

)

Interarrival times

St = Wt+1 Wt

St Exp(

1

)

tWt1 Wt

St

22

21 Time Series

Mean function

xt = E [xt] =

xft(x) dx

Autocovariance function

x(s, t) = E [(xs s)(xt t)] = E [xsxt] stx(t, t) = E

[(xt t)2

]= V [xt]

Autocorrelation function (ACF)

(s, t) =Cov [xs, xt]V [xs]V [xt]

=(s, t)

(s, s)(t, t)

Cross-covariance function (CCV)

xy(s, t) = E [(xs xs)(yt yt)]Cross-correlation function (CCF)

xy(s, t) =xy(s, t)

x(s, s)y(t, t)

Backshift operatorBk(xt) = xtk

Difference operatord = (1B)d

White noise

wt wn(0, 2w) Gaussian: wt iid N

(0, 2w

) E [wt] = 0 t T V [wt] = 2 t T w(s, t) = 0 s 6= t s, t T

Random walk

Drift xt = t+

tj=1 wj

E [xt] = tSymmetric moving average

mt =

kj=k

ajxtj where aj = aj 0 andk

j=kaj = 1

21.1 Stationary Time Series

Strictly stationary

P [xt1 c1, . . . , xtk ck] = P [xt1+h c1, . . . , xtk+h ck]

k N, tk, ck, h Z

Weakly stationary

E [x2t ]

21.2 Estimation of Correlation

Sample mean

x =1

n

nt=1

xt

Sample variance

V [x] =1

n

nh=n

(1 |h|

n

)x(h)

Sample autocovariance function

(h) =1

n

nht=1

(xt+h x)(xt x)

Sample autocorrelation function

(h) =(h)

(0)

Sample cross-variance function

xy(h) =1

n

nht=1

(xt+h x)(yt y)

Sample cross-correlation function

xy(h) =xy(h)x(0)y(0)

Properties

x(h) =1n

if xt is white noise

xy(h) =1n

if xt or yt is white noise

21.3 Non-Stationary Time Series

Classical decomposition model

xt = t + st + wt

t = trend st = seasonal component wt = random noise term

21.3.1 Detrending

Least squares

1. Choose trend model, e.g., t = 0 + 1t+ 2t2

2. Minimize rss to obtain trend estimate t = 0 + 1t+ 2t2

3. Residuals , noise wt

Moving average

The low-pass filter vt is a symmetric moving average mt with aj = 12k+1 :

vt =1

2k + 1

ki=k

xt1

If 12k+1ki=k wtj 0, a linear trend function t = 0 + 1t passes

without distortion

Differencing

t = 0 + 1t = xt = 1

21.4 ARIMA models

Autoregressive polynomial

(z) = 1 1z pzp z C p 6= 0

Autoregressive operator

(B) = 1 1B pBp

Autoregressive model order p, AR (p)

xt = 1xt1 + + pxtp + wt (B)xt = wtAR (1)

xt = k(xtk) +k1j=0

j(wtj)k,||

Moving average polynomial

(z) = 1 + 1z + + qzq z C q 6= 0Moving average operator

(B) = 1 + 1B + + pBp

MA (q) (moving average model order q)

xt = wt + 1wt1 + + qwtq xt = (B)wt

E [xt] =qj=0

jE [wtj ] = 0

(h) = Cov [xt+h, xt] =

{2wqhj=0 jj+h 0 h q

0 h > q

MA (1)xt = wt + wt1

(h) =

(1 + 2)2w h = 0

2w h = 1

0 h > 1

(h) =

{

(1+2) h = 1

0 h > 1

ARMA (p, q)

xt = 1xt1 + + pxtp + wt + 1wt1 + + qwtq(B)xt = (B)wt

Partial autocorrelation function (PACF)

xh1i , regression of xi on {xh1, xh2, . . . , x1} hh = corr(xh xh1h , x0 xh10 ) h 2 E.g., 11 = corr(x1, x0) = (1)

ARIMA (p, d, q)dxt = (1B)dxt is ARMA (p, q)

(B)(1B)dxt = (B)wtExponentially Weighted Moving Average (EWMA)

xt = xt1 + wt wt1

xt =

j=1

(1 )j1xtj + wt when || < 1

xn+1 = (1 )xn + xn

Seasonal ARIMA

Denoted by ARIMA (p, d, q) (P,D,Q)s P (Bs)(B)Ds dxt = + Q(Bs)(B)wt

21.4.1 Causality and Invertibility

ARMA (p, q) is causal (future-independent) {j} :j=0 j

Amplitude A Phase U1 = A cos and U2 = A sin often normally distributed rvs

Periodic mixture

xt =

qk=1

(Uk1 cos(2pikt) + Uk2 sin(2pikt))

Uk1, Uk2, for k = 1, . . . , q, are independent zero-mean rvs with variances 2k (h) = qk=1 2k cos(2pikh) (0) = E [x2t ] = qk=1 2k

Spectral representation of a periodic process

(h) = 2 cos(2pi0h)

=2

2e2pii0h +

2

2e2pii0h

=

1/21/2

e2piih dF ()

Spectral distribution function

F () =

0 < 02/2 < 02 0

F () = F (1/2) = 0 F () = F (1/2) = (0)

Spectral density

f() =

h=

(h)e2piih 12 1

2

Needs h= |(h)| 1 (n) = (n 1)! n N (1/2) = pi

22.2 Beta Function

Ordinary: B(x, y) = B(y, x) = 1

0

tx1(1 t)y1 dt = (x)(y)(x+ y)

Incomplete: B(x; a, b) = x

0

ta1(1 t)b1 dt Regularized incomplete:Ix(a, b) =

B(x; a, b)

B(a, b)

a,bN=

a+b1j=a

(a+ b 1)!j!(a+ b 1 j)!x

j(1 x)a+b1j26

I0(a, b) = 0 I1(a, b) = 1 Ix(a, b) = 1 I1x(b, a)

22.3 Series

Finite

nk=1

k =n(n+ 1)

2

nk=1

(2k 1) = n2

nk=1

k2 =n(n+ 1)(2n+ 1)

6

nk=1

k3 =

(n(n+ 1)

2

)2

nk=0

ck =cn+1 1c 1 c 6= 1

Binomial

nk=0

(n

k

)= 2n

nk=0

(r + k

k

)=

(r + n+ 1

n

)

nk=0

(k

m

)=

(n+ 1

m+ 1

) Vandermondes Identity:

rk=0

(m

k

)(n

r k)

=

(m+ n

r

) Binomial Theorem:

nk=0

(n

k

)ankbk = (a+ b)n

Infinite

k=0

pk =1

1 p ,k=1

pk =p

1 p |p| < 1

k=0

kpk1 =d

dp

( k=0

pk

)=

d

dp

(1

1 p)

=1

(1 p)2 |p| < 1

k=0

(r + k 1

k

)xk = (1 x)r r N+

k=0

(

k

)pk = (1 + p) |p| < 1 , C

22.4 Combinatorics

Sampling

k out of n w/o replacement w/ replacement

ordered nk =

k1i=0

(n i) = n!(n k)! n

k

unordered

(n

k

)=nk

k!=

n!

k!(n k)!(n 1 + r

r

)=

(n 1 + rn 1

)

Stirling numbers, 2nd kind{n

k

}= k

{n 1k

}+

{n 1k 1

}1 k n

{n

0

}=

{1 n = 0

0 else

Partitions

Pn+k,k =

ni=1

Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1

Balls and Urns f : B U D = distinguishable, D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

B : D, U : D mn

{mn m n0 else

m!

{n

m

} {n! m = n

0 else

B : D, U : D(m+ n 1

n

) (m

n

) (n 1m 1

) {1 m = n

0 else

B : D, U : Dmk=1

{n

k

} {1 m n0 else

{n

m

} {1 m = n

0 else

B : D, U : Dmk=1

Pn,k

{1 m n0 else

Pn,m

{1 m = n

0 else

References

[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.Brooks Cole, 1972.

[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.The American Statistician, 62(1):4553, 2008.

[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its ApplicationsWith R Examples. Springer, 2006.

[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,Algebra. Springer, 2001.

[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie undStatistik. Springer, 2002.

[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.Springer, 2003.

27

Un

ivari

ate

dis

trib

uti

on

rela

tion

ship

s,co

urt

esy

Lee

mis

an

dM

cQu

esto

n[2

].

28

Distribution OverviewDiscrete DistributionsContinuous Distributions

Probability TheoryRandom VariablesTransformations

ExpectationVarianceInequalitiesDistribution RelationshipsProbability and Moment Generating FunctionsMultivariate DistributionsStandard Bivariate NormalBivariate NormalMultivariate Normal

ConvergenceLaw of Large NumbersCentral Limit Theorem

Statistical InferencePoint EstimationNormal-Based Confidence IntervalEmpirical distributionStatistical Functionals

Parametric InferenceMethod of MomentsMaximum LikelihoodDelta Method

Multiparameter ModelsMultiparameter delta method

Parametric Bootstrap

Hypothesis TestingBayesian InferenceCredible IntervalsFunction of parametersPriorsConjugate Priors

Bayesian Testing

Exponential FamilySampling MethodsThe BootstrapBootstrap Confidence Intervals

Rejection SamplingImportance Sampling

Decision TheoryRiskAdmissibilityBayes RuleMinimax Rules

Linear RegressionSimple Linear RegressionPredictionMultiple RegressionModel Selection

Non-parametric Function EstimationDensity EstimationHistogramsKernel Density Estimator (KDE)

Non-parametric RegressionSmoothing Using Orthogonal Functions

Stochastic ProcessesMarkov ChainsPoisson Processes

Time SeriesStationary Time SeriesEstimation of CorrelationNon-Stationary Time SeriesDetrending

ARIMA modelsCausality and Invertibility

Spectral Analysis

MathGamma FunctionBeta FunctionSeriesCombinatorics

probability and statistics cookbook

Documents

p i1pn x

x b1 x bia x bb

p1x p p1 p

p pesbinomial bin n

p pesnmultinomial mult

pesrgeometric geo p

parametric regression

parametric function