1 theory of differentiation in statistics mohammed nasser department of statistics

63
1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

Upload: miles-lee

Post on 17-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

1

Theory of Differentiation in StatisticsMohammed Nasser

Department of Statistics

Page 2: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

2

Relation between Statistics and Differentiation

Statistical Concepts/Techniques

Use of Differentiation Theory

Study of shapes of univariate pdfs

An easy application of first-order and second-order derivatives

Calculation/stablization of variance of a random variable

An application of Taylor’s theorem

Calculation of Moments from MGF/CF

Differentiating MGF/CF

Page 3: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

3

Description of a density/ a model

dy/dx=k, dy/dx=kx

Optimize some risk functional/regularized functional/ empirical risk functional with/without constraints

Needs heavy tools of nonlinear optimizationTechniques that depend on multivariate differential calculus and functional differential calculus

Relation between Statistics and Differentiation

Influence function to assess robustness of a statistical

An easy application of directional derivative in function space

Page 4: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

4

Relation between Statistics and Differentiation

Classical delta theorem to find asymptotic distribution

An application of ordinary Taylor’s theorem

Von Mises Calculus Extensive application of functional differential calculus

Relation between probability measures and probability density functions

Radon Nikodym theorem

Page 5: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

5

Monotone Function

f(x)

Monotone Increasing

Monotone Decreasing

Strictly Increasing

Non Decreasing

Strictly Decreasing

Non Increasing

 

Page 6: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

6

Increasing/Decreasing test

 

 

 

3( ): f x xf R R

3( ): f x xf R R

Page 7: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

7

Example of Monotone Increasing Function

0

 

 

3( ): f x xf R R

Page 8: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

8

a b

Maximum/Minimum

Is there any sufficient condition that guarantees existence of global max/global min/both?

Page 9: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

9

If the function is continuous and its domain is compact, the function attains its extremum

It’s a very general result It holds for any compact space other compact set of Rn.

Any convex ( concave) function attains its global min ( max).

Without satisfying any of the above conditions some functions may have global min ( max).

Some Results to Mention

Firstly, proof of existence of extremum

Calculation of extremum

Then

Page 10: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

10

What Does Say about f"0)(" 0 xf I

Fermat’s Theorem: if f has local maximum or minimum at c, and if exist, then but converse is not true

)(cf I,0)( cf I

Page 11: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

11

Concave

Convex

Point of inflectionc

Concavity

• If for all x in (a,b), then the graph of f concave on (a,b).• If for all x in (a,b), then the graph of f concave on (a,b).• If then f has a point of inflection at c.

0)( xf II

0)( xf II

0)( cf II

Page 12: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

12

Maximum/Minimum

Let f(x) be a differential function on an interval I

• f is maximum at

• f is maximum at

• If for all x in an interval, then f is maximum at first end point of the interval if left side is closed and minimum at last end point if right side is closed.

• If for all x in an interval, then f is minimum at first end point of the interval if left side is closed and maximum at last end point if right side is closed.

0)(0)( cfandcfifIc III

0)(0)( cfandcfifIc III

0)( xf I

0)( xf I

Page 13: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

13    

Concave

Convex Convex

point of in

flection

Normal Distribution

The probability density function is given as,2

2

1

2

1)(

x

exf

continuous on Rf(x)>=0Differentiable on

R

( ) 0xlt f x

Page 14: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

14

Take log both sidePut first derivative equal to zero

Now,

Normal Distribution

x

fx

f

xf

f

xf

exf

I

I

x

01

10

1

2

1

2

1loglog

2

1)(

2

2

12

Page 15: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

15

Normal Distribution

01

1

)(1

2

2

2

f

ff

ffxf

I

III

Therefore f is maximum at

xSince

x

Page 16: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

16

Normal Distribution

Put 2nd derivative equal to zero

x

xx

x

x

ffx

ffx

f

I

II

0)}()}{({

0)(

01

0

0)(1

0

22

2

2

2

Therefore f has point of inflection at x

Page 17: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

17

Convex Concave

Logistic Distribution

The distribution function is defined as,

xe

exF

x

x

;1

)(

Page 18: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

18

Logistic DistributionTake first derivative with respect to x

Therefore F is strictly increasing

Take2nd derivative and put equal to zero

Therefore F has a point of inflection at x=0

0

0)1log(

1

0)1(

0)1(

)1()(

3

x

x

e

ee

e

eexF

x

xx

x

xxII

xe

exF

x

xI

;0

)1()(

2

Page 19: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

19

Logistic Distribution

Now we comment that F has no maximum and minimum.

),0(;0)(

)0,(;0)(

xxF

xxFII

II

Therefore F is convex on and concave on

Since,

)0,( ).,0(

Page 20: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

20

Variance of a Function of Poisson Variate Using Taylor’s Theorem

We know that,

, ( ) , , ( )Mean E Y VarianceV Y

We are interested to find the Variance of YYg )(

?)( YV

12

1 212

,

1( ) ( )

21 1

( ) ( )2 4

I

I I

Giventhat

gY Y g Y Y

g g

Page 21: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

21

The Taylor’s series is defined as,

4

1

)(4

1

)()(0))((

))(()()(

)())(()()(

1

2

YV

YVgYgV

YggYg

YoYggYg

I

I

I

Therefore the variance of4

1isY

Variance of a Function of Poisson Variate Using Taylor’s Theorem

Page 22: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

22

Risk Functional

Risk functional, RL,P(g)= ( , , ( )) ( , )

( , , ( )) ( / )

X Y

X

X Y

L x y g x dP x y

L x y g x dP y x dP

Population Regression functional /classifier, g* *, ,

:( ) inf ( )L P L P

g X YR g R g

From sample D, we will select gD by a learning method(???)

P is chosen by nature , L is chosen by the scientist

Both RL,P(g*) and g* are uknown

Page 23: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

23

Problems of empirical risk minimization

Empirical risk minimizationEmpirical Risk functional, =

1

( , , ( )) ( , )

( , , ( )) ( / )

1( , , ( ))

n

X Y

n n X

X Y

n

i i ii

L x y g x dP x y

L x y g x dP y x dP

L x y g xn

,( )

nL PR g

Page 24: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

24

What Can We Do?

We can restrict the set of functions over which we minimize empirical risk functionals

modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two.

Stru

ctural risk

Min

imizatio

n

Regularization

Page 25: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

25

Regularized Error Function

22

1

1( ( ) )

2 2

l

i ii

f x y wl

2

1

1( ( ) )

2

l

ii

C E f x y w

In linear regression, we minimize the error function:

Replace the quadratic error function by Є-insensitive error function:

An example of Є-insensitive error function:

Page 26: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

26

Linear SVR: Derivation

Meaning of equation 3

Page 27: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

27

●●

Linear SVR: Derivation

●●

Complexity Sum of errors

vs.

Case I:

Case II:

“tube” complexity

“tube” complexity

Page 28: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

28

Linear SVR: Derivation

Case I:

Case II:

“tube” complexity

“tube” complexity

• The role of C

●●

C is small

●●

C is big

Page 29: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

29

●●

Linear SVR: derivation

●●

●Subject to:

Page 30: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

30

Lagrangian

2* * * * *

1 1 1 1

*

1

*

1

* **

1( ) ( ) ( , ) ( , )

2

0 ( )

0 ( ) 0

0

0

l l l l

n n n n n n n n n n n n n nn n n n

l

n n nn

l

n nn

n nn

n nn

L C w y w x b y w x b

Lw x

w

L

b

LC

La C

Minimize:

f(x)=<w,x>= * *

1 1

( ) , ( ) ,l l

n n n n n nn n

x x x x

Dual var. α

n,α

n*,μn,μ*

n >=

0

Page 31: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

31

Dual Form of Lagrangian

* * * * *

1 1 1 1

*

*

1

1( , ) ( )( ) , ( ) ( )

2

0

0

( ) 0

l l l l

n n m m n m n n n n nn m n n

n

n

l

n nn

W a a x x y

C

C

Prediction can be made using:

*

1

( ) ( ) ,l

n n nn

f x x x b

Maximize:

???

Page 32: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

32

How to determine b?

Karush-Kuhn-Tucker (KKT) conditions implies( at the optimal solutions:

* *

* *

( , ) 0

( , ) 0

( ) 0

( ) 0

n n n n

n n n n

n n

n n

y w x b

y w x b

C

C

Support vectors are points that lie on the boundary or outside the tube

These equations implies many important things.

Page 33: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

33

Important Interpretations

* *0, . . 0 (why??)i i i ii e

* *

*

, 0

,

,

i n n n

n n n

n n

C y w x b

w x b y

w x b y

*

*

0 0,

and 0

0

i i

i

i

Page 34: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

34

Support Vector: The Sparsity of SV Expansion

*

0 ( )

0 ( )

i i i

i i i

y f x

f x y

and

*

( ) 0

( ) 0

i i i

i i i

y f x

f x y

Page 35: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

35

Dual Form of Lagrangian(Nonlinear case)

* * * * *

1 1 1 1

*

*

1

1( , ) ( )( ) ( , ) ( ) ( )

2

0

0

( ) 0

l l l l

n n m m n m n n n n nn m n n

n

n

l

i ii

W k x x y

C

C

Prediction can be made using:

*

1

( ) ( ) ( , )l

n n nn

f x a a k x x b

Maximize:

Page 36: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

36

Non-linear SVR: derivation

Subject to:

Page 37: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

37

Non-linear SVR: derivationSubject to:

Saddle point of L has to be found:

min with respect to

max with respect to

Page 38: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

38

Non-linear SVR: derivation

...

Page 39: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

39

UA Banach Space

V,AnotherB-space

f,a nonlinear function

What is Differentiation?

Differentiation is nothing but local linearization

In differentiation we approximate a non-linear function locally by a (continuous) linear function

Page 40: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

40

Fréchet Derivative

0||

|)()()(|

0)()()(

0

0

h

hxfxfhxfLt

h

hxfxfhxfLt

h

h

It can be easily generalized to Banach space valued function, f: 2211 ,, BB

0||||

||)()()(||

1

2

0

h

hxfxfhxfLt

h

is a linear map. It can be shown,.1 2

( ) :f x B B

every linear map between infinite-dimensional spaces is not always continuous.

Definition 1

Page 41: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

41

We have just mentioned that Fréchet recognized , the definition 1 could be easily generalized to normed spaces in the following way: lim

)2(............0))(()()(

lim

0))(()()(

lim

10

1

2

0

h

hxdfxfhxf

h

hxdfxfhxf

h

h

Frécehet Derivative

Where and the set of all continuous linear functions between B1and B2 If we write, the remainder of f at x+h, ; Rem(x+h)= f(x+h)-f(x)-df(x)(h)

Page 42: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

42

Then 2 becomes

)3(.............0)(Re

lim

0)(Re

lim

10

1

2

0

h

hxm

h

hxm

h

h

Soon the definition is generalized (S-differentiation ) in general topological vector spaces in such a way ; i) a particular case of the definition becomes equivalent to the previous definition when , domain of f is a normed space, ii) Gateaux derivative remains the weakest derivative in all types of S-differentiation.

S Derivative

Page 43: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

43

Definition 3When S= all singletons of B1, f is called Gâteaux differentiable with Gâteaux derivative . When S= all compact subsets of B1, f is called Hadamard or compactly differentiable with Hadamard or compact derivative . When S= all bounded subsets of B1, f is called or boundedly differentiable with or bounded derivative .

Definition 2Let S be a collection of subsets of B1 , let t R. Then f is S-differentiable at x with derivative df(x) if ),( 21 BBL SA

Ahinuniformlytast

hxm

00

)(Re

S Derivatives

Page 44: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

44

Equivalent Definitions of Fréchet derivative

(a) For each bounded set, as in R,

uniformly

0)(

,1

t

thxRBE 0t

Eh

(b) For each sequence, and each sequence1}{ Bhn ;0}0/{}{ Rtn

nast

htxR

n

nn 0)(

Page 45: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

45

(c) 00)(

1

hash

hxR

00)(

tast

thxRUniformly in }1:{

11 hBhh (d)

(e) 00)(

tast

thxR Uniformly in }1:{11 hBhh

Statisticians generally uses this form or its some slight modification

Page 46: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

46

Relations among Usual Forms of Definitions

Set of Gateaux differentiable function at set of Hadamad differentiable function at set Frechet differentiable function x. In application to find Frechet or Hadamard derivative generally we shout try first to determine the form of derivative deducing Gateaux derivative acting on h,df(h) for a collection of directions h which span B1. This reduces to computing the ordinary derivative (with respect to R) of the mapping which is much related to influence function, one of the central concepts in robust statistics. It can be easily shown that,

(i) When B1=R with usual norm, they will three coincide

(ii)When B1, a finite dimensional Banach space, Frechet and Hadamard derivative are equal. The two coincide with familiar total derivative.

)(xDG )(xDx G

)(xDx G

,0)( tatthxft

Page 47: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

47

Properties of Fréchet derivative

Hadamard diff. implies continuity but Gâteaux does not.

Hadamard diff. satisfies chain rule but Gâteaux does not.

Meaningful Mean Value Theorem, Inverse Function Theorem, Taylor’s Theorem and Implicit Function Theorem have been proved for Fréchet derivative

Page 48: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

48

0

0

[(1 ) ] ( )( , ; )

[( ( )] ( ) =

x

x

T F T FIF T x F lt

T F F T Flt

Page 49: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

49

nn xdFFTX )(

( ) ( )T F x dF

( ) ( ) ( ) ( )T F x dF x f x dx

1

( ) ( )i

i

x x dF

Lebesgue

Counting

Page 50: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

50

Mathematical Foundations of Robust Statistics

T(G)≈T(F)+ )(1

FGTF

d 1(F,G) <δ

d 2(T(F),T(G)) <ε

(T(G)-T(F))≈ )(1

FGTF n n

Page 51: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

51

Math

ematical F

ou

nd

ation

s o

f Ro

bu

st Statistics

Page 52: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

52

Math

ematical F

ou

nd

ation

s o

f Ro

bu

st Statistics

Page 53: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

53

Math

ematical F

ou

nd

ation

s o

f Ro

bu

st Statistics

Page 54: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

54

Given a Measurable Space (W,F),

There exist many measures on F.

If W is the real line, the standard measure is “length”. That is, the measure of each interval is its length. This is known as “Lebesgue measure”.

The s-algebra must contain intervals. The smallest s-algebra that contains all open sets (and hence intervals) is call the “Borel” s-algebra and is denoted B.

A course in real analysis will deal a lot with the measurable space . ),( B

Page 55: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

55

Given a Measurable Space (W,F),

A measurable space combined with a measure is called a measure space. If we denote the measure by m, we would write the triple: (W,F,m).

Given a measure space (W,F,m), if we decide instead to use a different measure, say , u then we call this a “change of measure”. (We should just call this using another measure!)

Let m and u be two measures on (W,F), then

(Notation )0)(0)( AA u is “absolutely continuous” with respect to m if

u and m are “equivalent” if

0)(0)( AA

Page 56: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

56

gdd

d

dd

The Radon-Nikodym Theorem

If u<<m then u is actually the integral of a function wrt m.

d

d

d

d

g

d

dg

AAA

gddd

ddA

)(

A

gdA )(

g is known as the Radon-Nikodym derivative and denoted:

d

dg

Page 57: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

57

The Radon-Nikodym Theorem

If u<<m then u is actually the integral of a function wrt m.

Consider the set function (this is actually a signed measure)

)()())(( AAA

Then A is the a-superlevel set of g.

Idea of proof: Create the function through its superlevel sets

Choose and let A be the largest set such that

0)()( AA for all AA (You must prove such an A exists.)

Now, given superlevel sets, we can construct a function by:

}|sup{)( Ag

Page 58: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

58

The Riesz Representation Theorem:

All continuous linear functionals on Lp are given by integration against a function with

qLg 111 qp

That is, let pLL :)( fLy

be a cts. linear functional.

Then: fgdfL )(

Note, in L2 this becomes:

gffgdfL ,)(

Page 59: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

59

The Riesz Representation Theorem:

All continuous linear functionals on Lp are given by integration against a function with

qLg 111 qp

What is the idea behind the proof:

Linearity allows you to break things into building blocks, operate on them, then add them all together.

What are the building blocks of measurable functions.

Indicator functions! Of course!

)1()( ALA Let’s define a set valued function from indicator functions:

Page 60: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

60

The Riesz Representation Theorem:

All continuous linear functionals on Lp are given by integration against a function with

qLg 111 qp

)1()( ALA A set valued function

How does L operate on simple functions

n

ii

n

iAi

n

iAi ALLL

ii111

)()1()1()(

This looks like an integral with u the measure! dL )(

But, it is not too hard to show that u is a (signed) measure. (countable additivity follows from continuity). Furthermore, u<<m. Radon-Nikodym then says du=gdm.

Page 61: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

61

The Riesz Representation Theorem:

All continuous linear functionals on Lp are given by integration against a function with

qLg 111 qp

)1()( ALA A set valued function

How does L operate on simple functions

n

ii

n

iAi

n

iAi ALLL

ii111

)()1()1()(

This looks like an integral with u the measure! gdL )(

For measurable functions it follows from limits and continuity.

fgdL )(

The details are left as an “easy” exercise for the reader...

Page 62: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

62

A random variable is a measurable function.

)(X

The expectation of a random variable is its integral:

XdPXE )(

A density function is the Radon-Nikodym derivative wrt Lebesgue measure:

dx

dPf X

dxxxfXdPXE X )()(

A probability measure P is a measure that satisfiesThat is, the measure of the whole space is 1.

1)( P

Page 63: 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

63

In finance we will talk about expectations with respect to different measures.

A probability measure P is a measure that satisfiesThat is, the measure of the whole space is 1.

1)( P

P XdPXE P )(

Q XdQXEQ )(

)()( XEdQXdQdQ

dPXXdPXE QP

where dQ

dPor dQdP

And write expectations in terms of the different measures: