thessl ch1 statistics, entropy , lagrange , score test, estimation
TRANSCRIPT
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
1/24
Chapter (II)
Definitions and Notation
This chapter is concerned with some important definitions and notation that
will be used in this study. The first section deals with review of some different
approaches of estimation, the second section is devoted to some topics in hypotheses
testing, the third section will focus on measures of information, the fourth section will
focus on optimization subject to conditions via Lagrange multiplier, finally the fifth
section explained in brief some important distributions.
2.1 Methods of EstimationProblem of point estimation for distributions parameter plays a vital role in
the statistical literatures, therefore many methods of estimation were proposed, this
section is concerned with three methods of estimation.
1 Method of Moments
It is difficult to track back who introducemethod of moments MOM , but
Johnan Bernoulli(1667-1748) was the first who used the method in his work seeGelder(1997), the idea for this method that we estimate the unknown parameters in
terms of the unobserved populations moments for instance (mean, variance,
skewness, kurtosis and coefficient of variation), then estimate the unobserved
moments with the observed sampling moments. Typically, the different types that can
represent the observed sampling moments have the following formulas:
.1 The moments about zero (raw moments ): )(rxE
.2 The central moments :r
xxE )(
.3 The standard moments :rxxE )(
Where kr ..1= and ,x and k refer to the mean, standard deviation and the number
of the estimated parameters of the distribution respectively. Hence the methods works
by solving simultaneously a system of k equations in k unknown parameters and k
observed sample moments
1
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
2/24
2 Method of Maximum Likelihood
It is difficult to track who discovered this tool, but Bernoulil in 1700 was the
first who reported about it see Gelder (1997), the idea that it is required to give the
specified sample high probability to be drawn, so it is required to research about the
parameters that maximized the likelihood function for the specified sample.
The likelihood function is the joint density function for the completely random
sampling taking the following formula:
);();...(1
1 in
in xfxxL
==
The method of maximum likelihood is required to estimate by finding the value of^
that maximizes );...( 1 nxxL , hence^
is called maximum likelihood estimator
MLE, indeed obtaining^
, in many cases, by solving the following equation:
0);...( 1 =
d
xxdL n )1.1.2(
The maximum likelihood method can be also used for estimated k unknown
parameters, therefore solving homogenous k equations in k unknown parameters. It
can be shown that it can't obtain defined in (2.1.1) equation, if the following
conditions are not valid (often called regularity conditions):
1. The first and second derivatives of the log-likelihood function must be defined.
2. The range of Xs doesnt depend on the unknown parameter .
Note: In many situations solving (2.1.1) can not be easily, thus one can use monotonic
transformation that making the calculation easier and no loss information:
d
xfd
d
xxLd
n
i
i
n == 11);(ln
)};...(ln{
2
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
3/24
3 Method of Least Square
The method of least squares or ordinary least squares (OLS) is often has a vital
role in statistical researches, particularly regression analysis, is proposed by Gauss see
Gelder(1997), typically OLS used to estimate the relation between two variables are
known as independent and dependent variables. Least squares problems fall into two
categories, linear and non-linear. The linear least squares problem has a closed form
solution, but the non-linear problem does not and is usually solved by iterative
process, furthermore OLS can be applied for one or more independent variables, in
this study will focus on one independent variable.
Suppose nYYY ..21 are pairwise uncorrelated random variables represent the
dependent variables and nXXX ..21 represent the fixed independent variables,
suppose the relation between the sY and sX expressed as:
iii UXBBY ++= 10 ni ..1=
Where sU refers to the residuals of the model. Thus OLS states one should peak the
values of sB which make the sum of squared residuals as minimum as possible:
=
+=n
i
ii xBByXsBYUMIN1
2
10 )(),,(
Differentiating ),,( XsBYE with respect to sB it will yield:
=+=
=+=
=
=
0)(2),,(
0)(2),,(
1
10
1
1
10
0
n
i
iii
n
i
ii
xBByxB
XsBYdE
xBByB
XsBYdE
(2.1.2)
It is easily to check (2.1.2) gives a minimum values, hence solving (2.1.2) it will be
obtained:
3
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
4/24
xbyb
xnx
yxnxy
bn
i
n
i
10
2
1
2
11
=
=
=
=
So far, it is not obvious to prefer which method can be more efficient than
other, fortunately, to overcome this problem it should be discuss some topics related
to the properties of point estimator and confidence interval.
Definition (2.1.1): In statistics, point estimation refers to the use of sample data to
calculate a single value is well known as a statistic, an observed function of a sample
where the function itself is independent of the parameter, which is to serve as a best
guess for an unknown population parameter
Definition (2.1.2) Unbiased Estimator: The first criteria which can classify the
estimators is unbiaseness, suppose is a statistic from observed random sample and
consider point estimator for, we called is unbiased estimator for iff E () = ,
if the previous condition valid in the large sample size, we called, is asymptotic
unbiased estimator for .
Definition (2.1.3) Relative Efficient Estimator: Suppose 1 , 2 are two
estimators for , iff 1)(
)(
2
1 k
Where )()( 111
i
n
iio
n
ixfxf
=== and k is a positive constant.
The idea that we calculate the ratio between the likelihood function under oH
and 1H , that is high value refers to accept oH otherwise indicates to reject oH ,
9
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
10/24
therefore this ratio is well known as simple likelihood ratio or Neyman-Pearson
lemma.
Definition (2.2.2): if it is required to test simple hypothesis verses composite
alternative hypothesis among all the tests haveor less than, the statistical test has
most powerful verses all alternative hypotheses called Uniformly Most Powerful Test,
and take the following formula :
Reject oH if < c accept oH if > c
Where )()(11
i
n
iio
n
ixfxf
=== and c is a positive constant.
The idea that we calculate the ratio between the likelihood function under oH
and 1H , )( ixf means all sample space for the parameter , this ratio is called
typically Generalized Likelihood Ratio.
It is obvious that is an special case from .The distribution of
corresponding to a particular null and alternative hypothesis using the sampling
distribution of the test, in many cases it is not quite, fortunately it is proved that for
any particular null and alternative hypothesis ln2 has approximately 2
distribution with degree of freedom the number of the tested parameter in the null
hypothesis.
2.3Measures of Information
A great variety of the informations measures are proposed in the literatures
recently see Estban (1995), since Shannon (1948) has a huge contribution for
development the information theory , thus in this section it will deal with Shannons
entropy and some measures related to Shannons (1948) entropy.
Definition (2.3.1):The origin of the entropy concept goes back to Ludwig
Boltzmann (1877), it is a Greek notation meaning transformation, it has been given a
probabilistic interpretation in information theory by Shannon (1948).He consider the
entropy as index of the uncertainty associated with a random variable expressed in
10
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
11/24
nats , where nat (sometimes nit or nepit) is aunitofinformation orentropy, based on
natural logarithms. Let there is n events with probabilities nppp ..21 adding up to 1,
Shannon (1948) stated the entropy corresponding these events can take the following
formula:
=
=n
i
ii xpxpXH1
)(ln)()( (2.3.1)
Hence, Shannon (1948) claimed that via (2.3.1) one can transform the
information in the sample from the invisible form to numerical physical form so the
comparisons can easily made and can be understood. Frenken (2003) mentioned that
(2.3.1) can be regarded the variance for the qualitative data.
To show how Shannon (1948) concluded (2.3.1), assume knnn .., 21 be the
number of each categories occurs in the experiment of length n, where:
nnk
i
i ==1
andn
np ii =
According to Golan (1996), Shannon (1948) mentioned that the all possible
combination that partition n into k categories of size kn can be indicator for the
accuracy of any decision associated to this sample, one can present the numbers all
possible combination as:
!!..!
!
21
..2,1k
n
knnn nnn
nCW == (2.3.2)
It is obvious that if (2.3.2) is always greater than or equal to one, if (2.3.2) equals one
this indicator for the sample has one category and that refers to the maximum of
accuracy and minimum uncertainty, for more simplicity Shannon (1948) preferred to
deal with logarithm of W as follows:
=
=k
i
innW1
!ln!ln)ln(
11
http://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Informationhttp://en.wikipedia.org/wiki/Information_entropyhttp://en.wikipedia.org/wiki/Natural_logarithmhttp://en.wikipedia.org/wiki/Logarithmic_unithttp://en.wikipedia.org/wiki/Informationhttp://en.wikipedia.org/wiki/Information_entropyhttp://en.wikipedia.org/wiki/Natural_logarithm -
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
12/24
Using approximation of Striling that states:
xasxxxx ln!ln
Thus ln(W) will be:
==
+k
i
i
k
i
ii nnnnnnW
11
lnln)ln(
=
k
i
ii nnnnW1
lnln)ln(
=
k
i
ii npnnn1
lnln
=
+ ki
ii pnnnn1
)ln(lnln
==
k
i
ii
k
i
i pnnnnn11
lnlnln
=
k
i
ii ppn1
ln
)(ln)ln(1
1 pHppWnk
i
ii = =
Therefore Shannons (1948) entropy can be regarded as a measurement of the
accuracy associated to the decisions sample in average. Indeed Shannon (1948)
mentioned (2.3.1) satisfy the following properties:
12
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
13/24
1 The quantity )(XH reaches a minimum, equal to zero, when one of
the events is a certainty, assuming 0)0ln(0 = ,and )(XH reaches the
maximum when all the probabilities are equal, hence)(XH
can beregarded as a concave function.
2 If some events have zero probability, they can just as well be left out
of the entropy when we evaluate the uncertainty.
3. Entropy information must be symmetric that doesnt depend on the
order of the probabilities.
For the continuous distribution (2.3.1) can take the following formula:
= dxxfxfXH ),(ln),()(
Definition (2.3.2):joint entropy is a measurement concerned with uncertainty ofthe two variables takes the following formula:
==n
i
iiii yxpyxpYXH1
),(ln),(),(
It is obvious that:
)()(),( YHXHYXH +
According to Shannon (1948) the uncertainty of a joint events is less than or
equal to the sum of the individual uncertainties and with equality only if the events
are independent.
Definition (2.3.3):mutual information measures the information thatXand Yshare, takes the following formula:
=
=n
i ii
iiii
ypxp
yxpyxpYXM
1 )()(
),(ln),(),(
It is obvious that 0),( =YXM if the two variables are independent.
13
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
14/24
Definition (2.3.4):conditional entropy )/( YXH is a measure of what Y doesntsay aboutX,meaning how much information in X doesnt in Y, takes the following
formula:
)(),()/( YHYXHYXH =
Remark: definitions from (2-10) - (2-12) can be extended to the continuous variables
if the summation symbol replace with the integration symbol.
If the two variables are independent the conditional entropy )/( YXH will equal
)(XH . it can realize that there is a relation between the measures of information as
follows:
Venn diagram: relation between informations measures
Definition (2.3.5): Kullback and Leibler (1951) introduced relative-entropy or
information divergence ,which measures the distance between two distributions of a
random variable. This information measure is also known as KL-entropy taking the
following formula:
==n
i i
ii
yqxpxpYXKL
1 )()(ln)()/( (2.3.3)
Typically (2.3.3) is also regarded as the relative entropy for using Y instead of X,
since (2.3.3) can be expressed as another form:
= =
=n
i
n
i
iiii yqxpxpxpYXKL1 1
)(ln)()(ln)()/(
14
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
15/24
=
=n
i
ii yqxpXH1
)(ln)()(
For more simplicity taking the following example: suppose we have five events in the
specified sample associated to the following probabilities ( .2,.1,.3,.25,.15).Assuming that we
want to know the divergence between theses events and the probabilities uniform
distribution. Substituting in (2.3.3) it will yield:
=
=n
i i
ii
yq
xpxpYXKL
1 )(
)(ln)()/(
065.2.15.ln15.
2.25.ln25.
2.3.ln3.
2.1.ln1.
2.2.ln2. =++++=
Therefore, it can be concluded that if we replace the distribution of the sample with the
uniform distribution it will loss .065 nat , thus (2.3.2) can be consider a good tool for
discrimination between two distributions Gohale (1983). One would assume that whenever
0)( =iyq , the corresponding 0)( =ixp and 00
0ln0 = see Dukkipati (2006), indeed
KL-entropy isn't symmetry that:
)/()/( XYKLYXKL
Furthermore )/( YXKL is non-negative measure and it equals zero iff X and Y
are identity:
iallforYXKL 0)/( (2.3.4)
According Liu(2007) ,(2.3.4) can be studied using the following identity :
0,)ln( > yxforyx
y
xx (2.3.5)
Hence, one can rewrite (2.3.3) according to (2.3.5) as:
0)(),()()()(
)(ln)(
111
> ===
ii
n
i
i
n
i
i
n
i i
ii xqxpforxqxp
yq
xpxp
0)(11
=
n
i
ixq
iallforYXKL 0)/(
15
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
16/24
Remark : KL can be applied when the variables are continuous that it will replace the
symbol of summation with integration notation, furthermore also all the properties
are valid see Dukkipati (2006).
2.4Lagrange Multiplier
In mathematical optimization, the method of Lagrange multipliers provides a
strategy for finding the maximum or minimum of the objective function subject to
constraints.To see this point consider the following example:
222),( yxyxfMin += (2.4.1)
Subject to
1=+ yx
To solve (2.4.1), one can insert the constrain in the objective function and
transform the restricted optimization into unrestricted optimization, then search for
the extreme values as follows:
xy = 1
(2.4.2)
Hence, (2.4.1) can be written:
22 )1(2),( xxyxfMin +=
So the minimum point of yx, can be obtained as follows:
0)(
=dx
xdf
0)1(24 = xx
026 =x
16
http://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Constraint_(mathematics)http://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Constraint_(mathematics) -
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
17/24
3
1=x (2.4.3)
It is obvious (2.4.3) refers to the minimum point that the second derivative ispositive, for obtaining the value of y it can substitute (2.4.3) in (2.4.2), it will yield :
3
2=y
Indeed, the values of x and y can be reached via another route, which it can use
the principle of Lagrange multiplier as follows:
To solve (2.4.1) , it should write Lagrangian function as follows:
)52(),,( 2 ++= yxxyyxyxLagr
Where the constant refers to Lagrange multipliers, and Lagr refers to Lagrangian
function. The method works as follows:
==
==
==
0)1(),,(
02),,(
02),,(
yxd
yxdLagr
ydy
yxdLagr
xdx
yxdLagr
(2.4.4)
Since (2.4.4), generally represents a nonlinear equations, refers to a homogenous
system in three variables, solving theses equations yielded the solution of (2.4.1), as
follows:
5.25.1 === yx (2.4.5)
One can conclude that transforming the (2.4.1) from constrained optimization
into unconstrained optimization is equivalent for using Lagrange multiplier principle,
indeed there is another approach, known as dual problem, to solve (2.4.1) that we
transform the constrained problem with unconstrained problem via replace all the
17
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
18/24
variables that in the objective function with the Lagrange multiplier, that From (2.4.4)
it can conclude:
24 == yx (2.4.6)
Substituting (2.4.6) in (2.4.1) it will yield that the objective function contain only the
Lagrange multiplier therefore to minimize (2.4.1) with respect to yx, imply
maximizing the objective function with respect to , since has the negative sign
thus there is usually opposite relation between Lagrange multiplier and the objective
function, hence (2.4.1) can be rewritten as unrestricted problem :
+
=++ 222
8
3)1
24(
88Max (2.4.7)
Taking the first derivative (2.4.7) to obtain the extreme values as follows:
3
401
4
38
3 2
==+
=+
d
d
(2.4.8)
Substituting (2.4.8) in (2.4.4) it will yield the same solution as (2.4.6).
According to (later) some remarks should be taken in consideration for searching to
solution when using Lagrange multiplier principle as follows:
1. The number of the constraints must be less than or equal to the number of thevariables.
2. The constraints in the optimization problem must be independent.
In statistical inference there is a well-known test related to Lagrange multiplier
for testing hypothesis concerned with the parameter of the distribution see Engle
(1984). Aitcheson and Silvey (1958) proposed the Lagrange multiplier test which
derives from a restricted maximum likelihood estimation using Lagrange multiplier,
18
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
19/24
suppose it is required to maximized );...( 1 nxxL with respect to subject to the
hypothesis that 0 = , as mentioned above the Lagrangian function can take the
form:
)();...(),( 01 = nxxLLagr
Differentiating ),( Lagr with respect to and then setting to zero it will yield:
0),...(),( 1 ==
d
xxdL
d
dLagr n (2.4.9)
0),(
==d
dLagr (2.4.10)
For solving (2.4.9) and (2.4.10) simultaneously, one can obtain the derivative of
the );...( 1 nxxL , then substituting (2.4.10) in to the derivative , thus it will be
obtained:
==
d
xxdL
d
dLagr n ),...(),( 01 (2.4.11)
Typically (2.4.11) known as the score function )( 0S .Since is often unknown
so it will be estimated by MLE see section (2.1), hence smaller value of )( 0S will
agree with 0 is close to MLE and accept the null otherwise reject 0 is MLE, thus
score test measures the magnitude between the tested value and MLE, it is obvious
that zero and the fisher information )(I represents the mean and the variance of
)( 0S respectively , thus Lagrange multiplier (LM) can be written as :
)(
))((
0
2
0
I
SLM =
Under the null hypotheses, for large sample LM has Chi-Square distribution
with one degree of freedom, for more details see Judge el at (1982), indeed LM test
can be extended to test k parameters simultaneously as follows:
)()()(1
= SISLMt
(2.4.12)
19
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
20/24
Where )( S refers the score function of the vector
,1)(
I refers to the inverse of
the information matrix of order k, taking the following formula respectively:
=
k
n
n
d
xxdL
d
xxdL
S
),...(
.
),...(
)(
1
1
1
=
));...(ln());...(ln(
));...(ln());...(ln());...(ln(
));...(ln());...(ln());...(ln(
)(
2
11
1
1
2
2
1
2
1
21
1
1
1
21
2
1
n
k
n
k
n
k
nn
n
k
nn
xxLd
dExxL
dd
dE
xxLdd
dExxL
d
dExxL
dd
dE
xxLdd
dExxL
dd
dExxL
d
dE
I
Note: also (2.4.12) has Chi-Square distribution with k degree of freedom, for more
simplicity it should take the following example:
Let nXXX ..21 be a random variables from the sample of size n follows Normal (
), 2 see section (2.3.3), suppose it is required to test :
2
0
2
0: ==oHUsing LM test the logarithm of the normal distributions likelihood function can
be:
=
=n
i
in xnn
xxL1
2
2
22
1 )(5.
ln2
)2ln(2
),,...(ln
The score function will be :
20
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
21/24
=
2
2
1
2
1
2
),,...(ln
),,...(ln
),(
d
xxLd
d
xxLd
S
n
n
normal
+
=
=
=
n
i
i
n
i
i
normal
xn
x
S
1
2
42
12
2
)(1
2
)(1
),(
Hence the score function under the null hypothesis:
+
=
=
=
n
i
i
n
i
i
normal
xn
x
S
1
2
04
0
2
0
1
02
02
00
)(1
2
)(1
),(
The information matrix under the null hypothesis associated to the normal
distribution:
=
4
0
2
02
00
20
0
),(
n
n
Inormal and
=
n
nI normal
4
0
2
0
2
00
1
20
0
),(
Hence, the LM test can take the following formula:
)()()(1
= normalnormalt
normalnormal SISLM
LMa n o( )
2
n o( )2
1
2
n o( )2
b 2 a o n o( )2
+2
o( )4
n
+
Where:
====
n
i
i
n
i i
xbxa1
2
1
,
21
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
22/24
Remark: as mentioned above normalLM has Chi-Square distribution with 2 degrees of
freedom. Suppose instead of testing the mean and the variance of the normal
simultaneously, it is required to test the mean only, hence the only change will be in
the score function as follows :
= =0
)(1
),( 102
0
2
0
n
i
i
normal
xS
Therefore the LM test will be:
)1(2
2
2)(
=
n
naLM onormal
2.5 Some Important Distributions
In this section, it will be in brief shown some famous distributions which will
be used in this thesis.
1 Normal Distribution:
The normal distribution, also called the Gaussian distribution, is an important
family of continuous probability distributions, applicable in many fields. Each
member of the family may be defined by two parameters, location and scale. The
standard normal distribution is the normal distribution with a mean of zero and a
variance of one, The importance of the normal distribution as a model of quantitative
phenomena in the natural andbehavioral sciences is due in part to the central limit
theorem.
If X has a normal distribution with mean and variance 2 the density
function will take the following form:
2
2)(
2
1
2
1)(
=x
exf
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
23/24
There is an important properties for normal distribution such as the mean , median
and mode are all equal also the skewness and the Excess kurtosis equal zero.In fact
normal distribution has a maximum entropy among all the distributions with fixed
variance and it is equal e 2ln( with moment generation function equal
)2
exp()(22t
ttMX
+= .
2 Uniform distribution
In probability theory and statistics, the continuous uniform distribution is a
family of probability distributions such that for each member of the family all
intervals of the same length on the distribution are equally probable. This distribution
is defined by the two parameters, a and b, which are its minimum and maximum
values respectively. It has an important role in the generating random numbers
technique The distribution is often abbreviated U(a,b).
If X has a Uniform Distribution with minimum a and maximum b the density
function will take the following form:
-
8/14/2019 Thessl Ch1 Statistics, Entropy , Lagrange , Score Test, Estimation
24/24
Poisson process, indeed exponential distributions can be a special case for Gamma
Distribution, it has a widely application in life models, biology, mechanics..etc.
If X has a exponential distribution with rate parameter > 0 the density
function will take the following form: