thessl ch1 statistics, entropy , lagrange , score test, estimation

8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation

1/24

Chapter (II)

Definitions and Notation

This chapter is concerned with some important definitions and notation that

will be used in this study. The first section deals with review of some different

approaches of estimation, the second section is devoted to some topics in hypotheses

testing, the third section will focus on measures of information, the fourth section will

focus on optimization subject to conditions via Lagrange multiplier, finally the fifth

section explained in brief some important distributions.

2.1 Methods of EstimationProblem of point estimation for distributions parameter plays a vital role in

the statistical literatures, therefore many methods of estimation were proposed, this

section is concerned with three methods of estimation.

1 Method of Moments

It is difficult to track back who introducemethod of moments MOM , but

Johnan Bernoulli(1667-1748) was the first who used the method in his work seeGelder(1997), the idea for this method that we estimate the unknown parameters in

terms of the unobserved populations moments for instance (mean, variance,

skewness, kurtosis and coefficient of variation), then estimate the unobserved

moments with the observed sampling moments. Typically, the different types that can

represent the observed sampling moments have the following formulas:

.1 The moments about zero (raw moments ): )(rxE

.2 The central moments :r

xxE )(

.3 The standard moments :rxxE )(

Where kr ..1= and ,x and k refer to the mean, standard deviation and the number

of the estimated parameters of the distribution respectively. Hence the methods works

by solving simultaneously a system of k equations in k unknown parameters and k

observed sample moments

1


2/24

2 Method of Maximum Likelihood

It is difficult to track who discovered this tool, but Bernoulil in 1700 was the

first who reported about it see Gelder (1997), the idea that it is required to give the

specified sample high probability to be drawn, so it is required to research about the

parameters that maximized the likelihood function for the specified sample.

The likelihood function is the joint density function for the completely random

sampling taking the following formula:

);();...(1

1 in

in xfxxL

==

The method of maximum likelihood is required to estimate by finding the value of^

that maximizes );...( 1 nxxL , hence^

is called maximum likelihood estimator

MLE, indeed obtaining^

, in many cases, by solving the following equation:

0);...( 1 =

d

xxdL n )1.1.2(

The maximum likelihood method can be also used for estimated k unknown

parameters, therefore solving homogenous k equations in k unknown parameters. It

can be shown that it can't obtain defined in (2.1.1) equation, if the following

conditions are not valid (often called regularity conditions):

1. The first and second derivatives of the log-likelihood function must be defined.

2. The range of Xs doesnt depend on the unknown parameter .

Note: In many situations solving (2.1.1) can not be easily, thus one can use monotonic

transformation that making the calculation easier and no loss information:

d

xfd

d

xxLd

n

i

i

n == 11);(ln

)};...(ln{

2


3/24

3 Method of Least Square

The method of least squares or ordinary least squares (OLS) is often has a vital

role in statistical researches, particularly regression analysis, is proposed by Gauss see

Gelder(1997), typically OLS used to estimate the relation between two variables are

known as independent and dependent variables. Least squares problems fall into two

categories, linear and non-linear. The linear least squares problem has a closed form

solution, but the non-linear problem does not and is usually solved by iterative

process, furthermore OLS can be applied for one or more independent variables, in

this study will focus on one independent variable.

Suppose nYYY ..21 are pairwise uncorrelated random variables represent the

dependent variables and nXXX ..21 represent the fixed independent variables,

suppose the relation between the sY and sX expressed as:

iii UXBBY ++= 10 ni ..1=

Where sU refers to the residuals of the model. Thus OLS states one should peak the

values of sB which make the sum of squared residuals as minimum as possible:

=

+=n

i

ii xBByXsBYUMIN1

2

10 )(),,(

Differentiating ),,( XsBYE with respect to sB it will yield:

=+=

=+=

=

=

0)(2),,(

0)(2),,(

1

10

1

1

10

0

n

i

iii

n

i

ii

xBByxB

XsBYdE

xBByB

XsBYdE

(2.1.2)

It is easily to check (2.1.2) gives a minimum values, hence solving (2.1.2) it will be

obtained:

3


4/24

xbyb

xnx

yxnxy

bn

i

n

i

10

2

1

2

11

=

=

=

=

So far, it is not obvious to prefer which method can be more efficient than

other, fortunately, to overcome this problem it should be discuss some topics related

to the properties of point estimator and confidence interval.

Definition (2.1.1): In statistics, point estimation refers to the use of sample data to

calculate a single value is well known as a statistic, an observed function of a sample

where the function itself is independent of the parameter, which is to serve as a best

guess for an unknown population parameter

Definition (2.1.2) Unbiased Estimator: The first criteria which can classify the

estimators is unbiaseness, suppose is a statistic from observed random sample and

consider point estimator for, we called is unbiased estimator for iff E () = ,

if the previous condition valid in the large sample size, we called, is asymptotic

unbiased estimator for .

Definition (2.1.3) Relative Efficient Estimator: Suppose 1 , 2 are two

estimators for , iff 1)(

)(

2

1 k

Where )()( 111

i

n

iio

n

ixfxf

=== and k is a positive constant.

The idea that we calculate the ratio between the likelihood function under oH

and 1H , that is high value refers to accept oH otherwise indicates to reject oH ,

9


10/24

therefore this ratio is well known as simple likelihood ratio or Neyman-Pearson

lemma.

Definition (2.2.2): if it is required to test simple hypothesis verses composite

alternative hypothesis among all the tests haveor less than, the statistical test has

most powerful verses all alternative hypotheses called Uniformly Most Powerful Test,

and take the following formula :

Reject oH if < c accept oH if > c

Where )()(11

i

n

iio

n

ixfxf

=== and c is a positive constant.

The idea that we calculate the ratio between the likelihood function under oH

and 1H , )( ixf means all sample space for the parameter , this ratio is called

typically Generalized Likelihood Ratio.

It is obvious that is an special case from .The distribution of

corresponding to a particular null and alternative hypothesis using the sampling

distribution of the test, in many cases it is not quite, fortunately it is proved that for

any particular null and alternative hypothesis ln2 has approximately 2

distribution with degree of freedom the number of the tested parameter in the null

hypothesis.

2.3Measures of Information

A great variety of the informations measures are proposed in the literatures

recently see Estban (1995), since Shannon (1948) has a huge contribution for

development the information theory , thus in this section it will deal with Shannons

entropy and some measures related to Shannons (1948) entropy.

Definition (2.3.1):The origin of the entropy concept goes back to Ludwig

Boltzmann (1877), it is a Greek notation meaning transformation, it has been given a

probabilistic interpretation in information theory by Shannon (1948).He consider the

entropy as index of the uncertainty associated with a random variable expressed in

10


11/24

nats , where nat (sometimes nit or nepit) is aunitofinformation orentropy, based on

natural logarithms. Let there is n events with probabilities nppp ..21 adding up to 1,

Shannon (1948) stated the entropy corresponding these events can take the following

formula:

=

=n

i

ii xpxpXH1

)(ln)()( (2.3.1)

Hence, Shannon (1948) claimed that via (2.3.1) one can transform the

information in the sample from the invisible form to numerical physical form so the

comparisons can easily made and can be understood. Frenken (2003) mentioned that

(2.3.1) can be regarded the variance for the qualitative data.

To show how Shannon (1948) concluded (2.3.1), assume knnn .., 21 be the

number of each categories occurs in the experiment of length n, where:

nnk

i

i ==1

andn

np ii =

According to Golan (1996), Shannon (1948) mentioned that the all possible

combination that partition n into k categories of size kn can be indicator for the

accuracy of any decision associated to this sample, one can present the numbers all

possible combination as:

!!..!

!

21

..2,1k

n

knnn nnn

nCW == (2.3.2)

It is obvious that if (2.3.2) is always greater than or equal to one, if (2.3.2) equals one

this indicator for the sample has one category and that refers to the maximum of

accuracy and minimum uncertainty, for more simplicity Shannon (1948) preferred to

deal with logarithm of W as follows:

=

=k

i

innW1

!ln!ln)ln(

11
http://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Informationhttp://en.wikipedia.org/wiki/Information_entropyhttp://en.wikipedia.org/wiki/Natural_logarithmhttp://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Informationhttp://en.wikipedia.org/wiki/Information_entropyhttp://en.wikipedia.org/wiki/Natural_logarithm


12/24

Using approximation of Striling that states:

xasxxxx ln!ln

Thus ln(W) will be:

==

+k

i

i

k

i

ii nnnnnnW

11

lnln)ln(

=

k

i

ii nnnnW1

lnln)ln(

=

k

i

ii npnnn1

lnln

=

+ ki

ii pnnnn1

)ln(lnln

==

k

i

ii

k

i

i pnnnnn11

lnlnln

=

k

i

ii ppn1

ln

)(ln)ln(1

1 pHppWnk

i

ii = =

Therefore Shannons (1948) entropy can be regarded as a measurement of the

accuracy associated to the decisions sample in average. Indeed Shannon (1948)

mentioned (2.3.1) satisfy the following properties:

12


13/24

1 The quantity )(XH reaches a minimum, equal to zero, when one of

the events is a certainty, assuming 0)0ln(0 = ,and )(XH reaches the

maximum when all the probabilities are equal, hence)(XH

can beregarded as a concave function.

2 If some events have zero probability, they can just as well be left out

of the entropy when we evaluate the uncertainty.

3. Entropy information must be symmetric that doesnt depend on the

order of the probabilities.

For the continuous distribution (2.3.1) can take the following formula:

= dxxfxfXH ),(ln),()(

Definition (2.3.2):joint entropy is a measurement concerned with uncertainty ofthe two variables takes the following formula:

==n

i

iiii yxpyxpYXH1

),(ln),(),(

It is obvious that:

)()(),( YHXHYXH +

According to Shannon (1948) the uncertainty of a joint events is less than or

equal to the sum of the individual uncertainties and with equality only if the events

are independent.

Definition (2.3.3):mutual information measures the information thatXand Yshare, takes the following formula:

=

=n

i ii

iiii

ypxp

yxpyxpYXM

1 )()(

),(ln),(),(

It is obvious that 0),( =YXM if the two variables are independent.

13


14/24

Definition (2.3.4):conditional entropy )/( YXH is a measure of what Y doesntsay aboutX,meaning how much information in X doesnt in Y, takes the following

formula:

)(),()/( YHYXHYXH =

Remark: definitions from (2-10) - (2-12) can be extended to the continuous variables

if the summation symbol replace with the integration symbol.

If the two variables are independent the conditional entropy )/( YXH will equal

)(XH . it can realize that there is a relation between the measures of information as

follows:

Venn diagram: relation between informations measures

Definition (2.3.5): Kullback and Leibler (1951) introduced relative-entropy or

information divergence ,which measures the distance between two distributions of a

random variable. This information measure is also known as KL-entropy taking the

following formula:

==n

i i

ii

yqxpxpYXKL

1 )()(ln)()/( (2.3.3)

Typically (2.3.3) is also regarded as the relative entropy for using Y instead of X,

since (2.3.3) can be expressed as another form:

= =

=n

i

n

i

iiii yqxpxpxpYXKL1 1

)(ln)()(ln)()/(

14


15/24

=

=n

i

ii yqxpXH1

)(ln)()(

For more simplicity taking the following example: suppose we have five events in the

specified sample associated to the following probabilities ( .2,.1,.3,.25,.15).Assuming that we

want to know the divergence between theses events and the probabilities uniform

distribution. Substituting in (2.3.3) it will yield:

=

=n

i i

ii

yq

xpxpYXKL

1 )(

)(ln)()/(

065.2.15.ln15.

2.25.ln25.

2.3.ln3.

2.1.ln1.

2.2.ln2. =++++=

Therefore, it can be concluded that if we replace the distribution of the sample with the

uniform distribution it will loss .065 nat , thus (2.3.2) can be consider a good tool for

discrimination between two distributions Gohale (1983). One would assume that whenever

0)( =iyq , the corresponding 0)( =ixp and 00

0ln0 = see Dukkipati (2006), indeed

KL-entropy isn't symmetry that:

)/()/( XYKLYXKL

Furthermore )/( YXKL is non-negative measure and it equals zero iff X and Y

are identity:

iallforYXKL 0)/( (2.3.4)

According Liu(2007) ,(2.3.4) can be studied using the following identity :

0,)ln( > yxforyx

y

xx (2.3.5)

Hence, one can rewrite (2.3.3) according to (2.3.5) as:

0)(),()()()(

)(ln)(

111

> ===

ii

n

i

i

n

i

i

n

i i

ii xqxpforxqxp

yq

xpxp

0)(11

=

n

i

ixq

iallforYXKL 0)/(

15


16/24

Remark : KL can be applied when the variables are continuous that it will replace the

symbol of summation with integration notation, furthermore also all the properties

are valid see Dukkipati (2006).

2.4Lagrange Multiplier

In mathematical optimization, the method of Lagrange multipliers provides a

strategy for finding the maximum or minimum of the objective function subject to

constraints.To see this point consider the following example:

222),( yxyxfMin += (2.4.1)

Subject to

1=+ yx

To solve (2.4.1), one can insert the constrain in the objective function and

transform the restricted optimization into unrestricted optimization, then search for

the extreme values as follows:

xy = 1

(2.4.2)

Hence, (2.4.1) can be written:

22 )1(2),( xxyxfMin +=

So the minimum point of yx, can be obtained as follows:

0)(

=dx

xdf

0)1(24 = xx

026 =x

16
http://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Constraint_(mathematics)http://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Constraint_(mathematics)


17/24

3

1=x (2.4.3)

It is obvious (2.4.3) refers to the minimum point that the second derivative ispositive, for obtaining the value of y it can substitute (2.4.3) in (2.4.2), it will yield :

3

2=y

Indeed, the values of x and y can be reached via another route, which it can use

the principle of Lagrange multiplier as follows:

To solve (2.4.1) , it should write Lagrangian function as follows:

)52(),,( 2 ++= yxxyyxyxLagr

Where the constant refers to Lagrange multipliers, and Lagr refers to Lagrangian

function. The method works as follows:

==

==

==

0)1(),,(

02),,(

02),,(

yxd

yxdLagr

ydy

yxdLagr

xdx

yxdLagr

(2.4.4)

Since (2.4.4), generally represents a nonlinear equations, refers to a homogenous

system in three variables, solving theses equations yielded the solution of (2.4.1), as

follows:

5.25.1 === yx (2.4.5)

One can conclude that transforming the (2.4.1) from constrained optimization

into unconstrained optimization is equivalent for using Lagrange multiplier principle,

indeed there is another approach, known as dual problem, to solve (2.4.1) that we

transform the constrained problem with unconstrained problem via replace all the

17


18/24

variables that in the objective function with the Lagrange multiplier, that From (2.4.4)

it can conclude:

24 == yx (2.4.6)

Substituting (2.4.6) in (2.4.1) it will yield that the objective function contain only the

Lagrange multiplier therefore to minimize (2.4.1) with respect to yx, imply

maximizing the objective function with respect to , since has the negative sign

thus there is usually opposite relation between Lagrange multiplier and the objective

function, hence (2.4.1) can be rewritten as unrestricted problem :

+

=++ 222

8

3)1

24(

88Max (2.4.7)

Taking the first derivative (2.4.7) to obtain the extreme values as follows:

3

401

4

38

3 2

==+

=+

d

d

(2.4.8)

Substituting (2.4.8) in (2.4.4) it will yield the same solution as (2.4.6).

According to (later) some remarks should be taken in consideration for searching to

solution when using Lagrange multiplier principle as follows:

1. The number of the constraints must be less than or equal to the number of thevariables.

2. The constraints in the optimization problem must be independent.

In statistical inference there is a well-known test related to Lagrange multiplier

for testing hypothesis concerned with the parameter of the distribution see Engle

(1984). Aitcheson and Silvey (1958) proposed the Lagrange multiplier test which

derives from a restricted maximum likelihood estimation using Lagrange multiplier,

18


19/24

suppose it is required to maximized );...( 1 nxxL with respect to subject to the

hypothesis that 0 = , as mentioned above the Lagrangian function can take the

form:

)();...(),( 01 = nxxLLagr

Differentiating ),( Lagr with respect to and then setting to zero it will yield:

0),...(),( 1 ==

d

xxdL

d

dLagr n (2.4.9)

0),(

==d

dLagr (2.4.10)

For solving (2.4.9) and (2.4.10) simultaneously, one can obtain the derivative of

the );...( 1 nxxL , then substituting (2.4.10) in to the derivative , thus it will be

obtained:

==

d

xxdL

d

dLagr n ),...(),( 01 (2.4.11)

Typically (2.4.11) known as the score function )( 0S .Since is often unknown

so it will be estimated by MLE see section (2.1), hence smaller value of )( 0S will

agree with 0 is close to MLE and accept the null otherwise reject 0 is MLE, thus

score test measures the magnitude between the tested value and MLE, it is obvious

that zero and the fisher information )(I represents the mean and the variance of

)( 0S respectively , thus Lagrange multiplier (LM) can be written as :

)(

))((

0

2

0

I

SLM =

Under the null hypotheses, for large sample LM has Chi-Square distribution

with one degree of freedom, for more details see Judge el at (1982), indeed LM test

can be extended to test k parameters simultaneously as follows:

)()()(1

= SISLMt

(2.4.12)

19


20/24

Where )( S refers the score function of the vector

,1)(

I refers to the inverse of

the information matrix of order k, taking the following formula respectively:

=

k

n

n

d

xxdL

d

xxdL

S

),...(

.

),...(

)(

1

1

1

=

));...(ln());...(ln(

));...(ln());...(ln());...(ln(

));...(ln());...(ln());...(ln(

)(

2

11

1

1

2

2

1

2

1

21

1

1

1

21

2

1

n

k

n

k

n

k

nn

n

k

nn

xxLd

dExxL

dd

dE

xxLdd

dExxL

d

dExxL

dd

dE

xxLdd

dExxL

dd

dExxL

d

dE

I

Note: also (2.4.12) has Chi-Square distribution with k degree of freedom, for more

simplicity it should take the following example:

Let nXXX ..21 be a random variables from the sample of size n follows Normal (

), 2 see section (2.3.3), suppose it is required to test :

2

0

2

0: ==oHUsing LM test the logarithm of the normal distributions likelihood function can

be:

=

=n

i

in xnn

xxL1

2

2

22

1 )(5.

ln2

)2ln(2

),,...(ln

The score function will be :

20


21/24

=

2

2

1

2

1

2

),,...(ln

),,...(ln

),(

d

xxLd

d

xxLd

S

n

n

normal

+

=

=

=

n

i

i

n

i

i

normal

xn

x

S

1

2

42

12

2

)(1

2

)(1

),(

Hence the score function under the null hypothesis:

+

=

=

=

n

i

i

n

i

i

normal

xn

x

S

1

2

04

0

2

0

1

02

02

00

)(1

2

)(1

),(

The information matrix under the null hypothesis associated to the normal

distribution:

=

4

0

2

02

00

20

0

),(

n

n

Inormal and

=

n

nI normal

4

0

2

0

2

00

1

20

0

),(

Hence, the LM test can take the following formula:

)()()(1

= normalnormalt

normalnormal SISLM

LMa n o( )

2

n o( )2

1

2

n o( )2

b 2 a o n o( )2

+2

o( )4

n

+

Where:

====

n

i

i

n

i i

xbxa1

2

1

,

21


22/24

Remark: as mentioned above normalLM has Chi-Square distribution with 2 degrees of

freedom. Suppose instead of testing the mean and the variance of the normal

simultaneously, it is required to test the mean only, hence the only change will be in

the score function as follows :

= =0

)(1

),( 102

0

2

0

n

i

i

normal

xS

Therefore the LM test will be:

)1(2

2

2)(

=

n

naLM onormal

2.5 Some Important Distributions

In this section, it will be in brief shown some famous distributions which will

be used in this thesis.

1 Normal Distribution:

The normal distribution, also called the Gaussian distribution, is an important

family of continuous probability distributions, applicable in many fields. Each

member of the family may be defined by two parameters, location and scale. The

standard normal distribution is the normal distribution with a mean of zero and a

variance of one, The importance of the normal distribution as a model of quantitative

phenomena in the natural andbehavioral sciences is due in part to the central limit

theorem.

If X has a normal distribution with mean and variance 2 the density

function will take the following form:

2

2)(

2

1

2

1)(

=x

exf


23/24

There is an important properties for normal distribution such as the mean , median

and mode are all equal also the skewness and the Excess kurtosis equal zero.In fact

normal distribution has a maximum entropy among all the distributions with fixed

variance and it is equal e 2ln( with moment generation function equal

)2

exp()(22t

ttMX

+= .

2 Uniform distribution

In probability theory and statistics, the continuous uniform distribution is a

family of probability distributions such that for each member of the family all

intervals of the same length on the distribution are equally probable. This distribution

is defined by the two parameters, a and b, which are its minimum and maximum

values respectively. It has an important role in the generating random numbers

technique The distribution is often abbreviated U(a,b).

If X has a Uniform Distribution with minimum a and maximum b the density



24/24

Poisson process, indeed exponential distributions can be a special case for Gamma

Distribution, it has a widely application in life models, biology, mechanics..etc.

If X has a exponential distribution with rate parameter > 0 the density


thessl ch1 statistics, entropy , lagrange , score test, estimation

Documents