chapter 5 sampling and statistics math 6203 fall 2009 instructor: ayona chatterjee

Chapter 5Sampling and Statistics

Math 6203Fall 2009

Instructor: Ayona Chatterjee

5.1 Sampling and Statistics

• Typical statistical problem: – We have a random variable X with pdf f(x) or pmf

p(x) unknown. • Either f(x) and p(x) are completely unknown.• Or the form of f(x) or p(x) is known down to the

parameter θ, where θ may be a vector.

• Here we will consider the second option.• Example: X has an exponential distribution with θ

unknown.

• Since θ is unknown, we want to estimate it. • Estimation is based on a sample.• We will formalize the sampling plan:– Sampling with replacement.• Each draw is independent and X’s have the same

distribution.

– Sampling without replacement.• Each draw is not independent but X’s still have the

same distribution.

Random Sample

• The random variables X1, X2, …., Xn constitute a random sample on the random variable X if they are independent and each has the same distribution as X. We will abbreviate this by saying that X1, X2, …., Xn are iid; i.e. independent and identically distributed.– The joint pdf can be given as

n

iinXX xfxxf

n1

1 )(),....,(,...1

Statistic

• Suppose the n random variables X1, X2, …., Xn constitute a sample from the distribution of a random variable X. Then any function T=T(X1, X2, …., Xn ) of the sample is called a statistic.

• A statistic, T=T(X1, X2, …., Xn ), may convey information about the unknown parameter θ. We call the statistics a point estimator of θ.

5.2 Order Statistics

Notation

• Let X1 , X2 , ….Xn denote a random sample from a distribution of the continuous type having a pdf f(x) that has a support S = (a, b), where -∞≤ a< x< b ≤ ∞. Let Y1 be the smallest of these Xi, Y2 the next Xi in order of magnitude,…., and Yn the largest of the Xi. That is Y1 < Y2 < …<Yn represent X1 , X2 , ….Xn, when the latter is arranged in ascending order of magnitude. We call Yi the ith order statistic of the random sample X1 , X2 , ….Xn.

Theorem 5.2.1• Let Y1 < Y2 < …<Yn denote the n order statistics

based on the random sample X1 , X2 , ….Xn from a continuous distribution with pdf f(x) and support (a,b). Then the joint pdf of Y1 , Y2 , ….Yn is given by,

elsewhere

byayfyFyFknk

nyg

elsewhere

byyyayfyfyfnyyyg

kkkn

kk

kkk

nnn

0

)()](1[)]([)!()!1(

!)(

asgiven becan statisticsorder any of pdf marginal The

0

...)()...()(!),....,(

1

212121

Note

• The joint pdf of any two order statistics, say• Yi < Yj can be written as

elsewhere 0

)()()](1[)]()([)]([)!()!1()!1(

!

),(

11

byya

yfyfyFyFyFyFjniji

n

yygji

jijn

jij

iji

i

jiij

Note

• Yn - Y1 is called the range of the random sample.

• (Y1 + Yn )/2 is called the mid-range

• If n is odd then Y(n+1)/2 is called the median of the random sample

5.4 MORE ON CONFIDENCE INTERVALS

The Statistical Problem

• We have a random variable X with density f(x,θ), where θ is unknown and belongs to the family of parameters Ω.

• We estimate θ with some statistics T, where T is a function of the random sample X1 , X2 , ….Xn.

• It is unlikely that value of T gives the true value of θ. – If T has a continuous distribution then P(T= θ)=0.

• What is needed is an estimate of the error of estimation.– By how much did we miss θ?

Central Limit Theorem

• Let θ0 denote the true, unknown value of the parameter θ. Suppose T is an estimator of θ such that

• Assume that σT2 is known.

),0()( 20 T

D

NTn

nT

nTP.

Z

TnZ

TT

T

96.196.1950

llyalgebraica showcan We

0.95 1.96) Z P(-1.96 Hence

N(0,1)ally asymptotic is Then

/)(Let

0

0

Note

• When σ is unknown we use s(sample standard deviation) to estimate it.

• We have a similar interval as obtained before with the σ replaced with st.

• Note t is the value of the statistic T.

Confidence Interval for Mean μ

• Let X1 , X2 , ….Xn be a random sample from the distribution with unknown mean μ and unknown standard deviation σ.

)n1.96s/x,n1.96s/-x(

is for interval confidence 95% eapproximatAn

N(0,1) is /)(n then TheoremLimit Central By the

ly.respective varianceandmean sample denote S and 2

SX

XLet

Note

• We can find confidence intervals for any confidence level.

• Let Zα/2 as the upper α/2 quantile of a standard normal variable.

• Then the approximate (1- α)100% confidence interval for θ0 is

)ns/zt,ns/z-(t /2/2

Confidence Interval for Proportions

• Let X be a Bernoulli random variable with probability of success p.

• Let X1 , X2 , ….Xn be a random sample from the distribution of X.

• Then the approximate (1- α)100% confidence interval for p is

error standard thecalled is /)ˆ1(ˆ

p)/2)-p(1N(p,~p̂

size sample

successes ofnumber ˆ

/)ˆ1(ˆˆ,/)ˆ1(ˆˆ 2/2/

npp

Note

n

xp

nppzpnppzp

5.5 Introduction to Hypothesis Testing

Introduction

• Our interest centers on a random variable X which has density function f(x,θ), where θ belongs to Ω.

• Due to theory or preliminary experiment, suppose we believe that

1100

1010

10

:H versus:H

as hypotheses theselabel We

. and of subsets are and where

or

• The hypothesis H0 is referred to as the null hypothesis while H1 is referred to as the alternative hypothesis.

• The null hypothesis represents ‘no change’. • The alternative hypothesis is referred to the as

research worker’s hypothesis.

Error in Hypothesis Testing

• The decision rule to take H0 or H1 is based on a sample X1 , X2 , ….Xn from the distribution of X and hence the decision could be wrong.

True State of Nature

Decision Ho is true H1 is true

Reject Ho Type I Error Correct Decision

Accept Ho Correct Decision Type II Error

• The goal is to select a critical region from all possible critical regions which minimizes the probabilities of these errors.

• In general this is not possible, the probabilities of these errors have a see-saw effect.– Example if the critical region is Φ, then we would

never reject the null so the probability of type I error would be zero but then probability of type II error would be 1.

• Type I error is considered the worse of the two.

Critical Region

• We fix the probability of type I error and we try and select a critical region that minimizes type II error.

• We saw critical region C is of size α if

• Over all critical regions of size α, we want to consider critical regions which have lower probabilities of Type II error.

]),...,,[(max 210

CXXXP n

• We want to maximize

• The probability on the right hand side is called the power of the test at θ.

• It is the probability that the test detects the alternative θ when θ belongs to w1 is the true parameter.

• So maximizing power is the same as minimizing Type II error.

]),...,P[(XError] II Type[1

for

21

1

CXXP n

Power of a test

• We define the power function of a critical region to be

• Hence given two critical regions C1 and C2 which are both of size α, C1 is better than C2 if

121 ];),,[()( CXXXP nC

. allfor )()( 121 CC

Note

• Hypothesis of the form H0 : p = p0 is called simple hypothesis.

• Hypothesis of the form H1 : p < p0 is called a composite hypothesis.

• Also remember α is called the significance level of the test associated with that critical region.

Test Statistics for Mean

1,0

10

0

10

/

-X

is H offavor in HReject /

-X

is H offavor in HReject

ntns

T

zn

Z

5.7 Chi-Square Tests

Introduction

• Originally proposed by Karl Pearson in 1900• Used to check for goodness of fit and

independence.

Goodness of fit test

• Consider the simple hypothesis– H0 : p1 =p10 , p2 =p20 , …, pk-1 =pk-1,0

• If the hypothesis H0 is true, the random variable

• Has an approximate chi-square distribution with k-1 degrees of freedom.

k

i i

iik np

npXQ

0

20

1

)(

Test for Independence

• Let the result of a random experiment be classified by two attributes.

• Let Ai denote the outcomes of the first kind and Bj denote the outcomes for the second kind.

• Let pij = P(Ai Bj )• The random experiment is said to be repeated n

independent times and Xij will denote the frequencies of an event in Ai Bj

b1,2,...,jfor where,.

ˆ

a1,2,...,ifor where,ˆ

,...,2,1;,...,2,1),()()(:

1..

1.

..

..

0

a

iijjj

b

jiji

ii

jiij

jiji

XXn

jXp

XXn

Xp

ppp

bjaiBPAPBAPH

• The random variable

• Has an approximate chi-square distribution with (a-1)(b-1) degrees of freedom provided n is large.

b

j

a

i ji

jiijba nXnXn

nXnXnXQ

1 1 ..

2..

)1)(1( )/)(/(

)]/)(/([

chapter 5 sampling and statistics math 6203 fall 2009 instructor: ayona chatterjee

Documents