random variable distribution. 200 trials where i flipped the coin 50 times and counted heads...

42
Random variable Distribution

Upload: tiffany-newman

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Random variable

Distribution

Page 2: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

200 trials where I flipped the coin 50 times and counted heads

His togram (hazenikorunou 52v*200c )

kolik jednicek = 200*1*norm al(x, 24.26, 3.8335)

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

kolik jednicek

0

2

4

6

8

10

12

14

16

18

20

22

24

No

of o

bs

no_of_heads in a trial

Page 3: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

or I can describe frequencies in percentages

His togram (hazenikorunou 52v*200c )

kolik jednicek = 200*1*norm al(x, 24.26, 3.8335)

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

kolik jednicek

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

11%

12%

Per

cent

of o

bs

no_of_ones

Page 4: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

or as a cumulative histogramHis togram (hazenikorunou 52v*200c )

kolik jednicek = 200*iNorm al(x, 24.26, 3.8335)

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

kolik jednicek

0

20

40

60

80

100

120

140

160

180

200

220

No

of o

bs

no_of_ones

Page 5: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

which can also be in percentage scaleHis togram (hazenikorunou 52v*200c )

kolik jednicek = 200*iNorm al(x, 24.26, 3.8335)

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

kolik jednicek

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Per

cent

of o

bs

no_of_ones

Page 6: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

When my population is infinite

• Then my number of elements is infinite – but I can characterize it as a proportion from all observations in any interval (i.e. probability, that randomly chosen value in be in interval)

• For discrete variable: enumeration of all values and their probabilities pi=P(X=xi) – as a table or formula.

Page 7: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Hus tota pravděpodobnos ti

0 2 4 6 8 10 12 14 16 18 20 22 240.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

D is tribuč ní funkc e

0 2 4 6 8 10 12 14 16 18 20 22 240.0

0.2

0.4

0.6

0.8

1.0

Continuous variable is characterized by distribution function and probability density

probability density Distribution function

Page 8: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Distribution function F(x) =P(X<x) has these basic properties

1. P(a X < b) = F(b) - F(a) ;

2. F(x1) F(x2) pro x1 < x2 ;

3.4.

1)(lim

xFx

0)(lim

xFx

It is actually idealized cumulative histogram with infinitely narrow columns.

Page 9: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

How to “idealize” normal histogram

If my columns are endlessly narrow, there will be “nothing” in them – therefore the percentage of observations of the interval is divided by “width” of the column. In a limit case I get for probability density

xxxXxP

xxf

)(

0lim)(

Page 10: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

For probability distribution govern:

Page 11: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

For distributive function, mean and variance can be computed

Discrete variable

n

i

iipx1

n

i

ii px1

22 )(

Continuous variable

,)( dxxxf

dxxfx )()( 22

Page 12: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Quartile

When this area is 0.75 or 75%

than 12.54 is 75% quantile of distribution (i.e. upper quartile)

Probability distribution

Page 13: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Testing of hypothesis

+ 2 test

Page 14: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

I cannot prove any hypothesis

• That’s why I formulate null hypothesis (H0), and with rejection of it I prove its opposite.

• Alternative hypothesis H1 or HA is negation of the null hypothesis

• I, the biologist, am the one who formulates null hypothesis – that’s why null hypothesis would be constructed in such way to be interesting if it is rejected.

Page 15: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Errors in decision

• In the case that the data are random (and it is practically every time in biology) I have to take in account that I can make wrong decision – statistics knows Type I error and Type II error, which are unavoidable part of our decision

• In addition we can make an error by a mistake in computation, but this isn’t necessary :-) .

Page 16: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Recipe for hypothesis’ testing

• 1. I formulate the null hypothesis

• 2. I choose the level of significance and so I obtain critical value (from some tables)

• 3. I compute test criteria from my data

• 4. When the value of test criteria is higher than critical value, I reject the null hypothesis

Page 17: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

2 test (test of goodness of fit)

• Example – I hybridize peas: I expect

F1:

F2:

I have 80 offspring – I expect 60:20, I have 70:10

Is it just random variability, or Mendel’s rates doesn’t work in this case?

Page 18: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

• 1. Rejecting of null hypothesis about 3:1 ratio is interesting from the biological view. I could test statistically null hypothesis about 4,2371:1 ratio, its rejecting doesn’t bring us any biologically interesting information.

• 2. Null hypothesis will be in the formal way: probability of dominant phenotype’s manifestation is 0.75 (in infinitely large population of potential offspring is ratio of phenotypes 3:1)

Page 19: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Calculation

66.620

)2010(60

)6070( 222

DF=1 (number of categories - 1 for prior given hypothesis), critical value = 3,84

Value of test criteria > critical value, I reject null hypothesis – I say, ratio in F2 is statistically significantly different from expected 3:1 with = 0.05 – or I write (2 = 6.66, df=1, P<0.05)

k

i

k

i i

ii

fff

1

2

1

22

E)EO(

ˆ)ˆ(

f - absolute freqency, i.e. number of random independent observations

Page 20: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What can happen – flipping the coin

Reality – the coin is OK, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS) 100 flips, I get 55:45

Than 2=(55-50)2/50+(45-50)2/50 = 1.0 < 3.84.

I cannot reject null hypothesis.

Right decision.

Page 21: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What can happen – flipping the coin

Reality – the coin is OK, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS)100 flips, I get 60:40

Then 2=(60-50) 2/50+(40-50) 2/50 = 4.0 > 3.84. I reject null hypothesis on the 5% level of significance. I have made Type I error (and I gibbet innocent). We know the probability of the error: it is . Level of significance is subjected to the probability of rejecting null hypothesis providing that it is true.

Page 22: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What can happen – flipping the coin

Reality – the coin is false, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS)

100 flips, I get 60:40

Then 2=(60-50) 2/50+(40-50) 2/50 = 4.0 > 3.84. I reject null hypothesis on 5% level of significance. Right decision (and I gibbet blackguard).

Page 23: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What can happen – flipping the coinReality – the coin is false, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS)

100 flips, I get 55:45

Then 2=(55-50)2/50+(45-50)2/50 = 1.0 < 3.84. I cannot reject null hypothesis (and blackguard is free). I have committed Type II error. Its probability is signed as and it is mostly unknown. 1 - is power of the test. Generally, the power of the test is higher with deviation from null hypothesis and with number of observations. As we don’t know , the right formulation of our outcome is: Based on our data we cannot reject null hypothesis. Formulation: We have proved null hypothesis is wrong!

Page 24: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Decision Table Reality H0 is true H0 is false

Our H0 is rejected Type I error Správné rozhodnutí decision H0 isn’t rejected Správné rozhodnutí Type II error

By given number of observations – the better protected against one type error, the more is outcome predisposed to the second one.

If I decide to test on 1% level of significance – the critical value is then 6,63

Right decisionRight decision

Page 25: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What can happen – flipping the coin

Reality – the coin is OK, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS)100 flips, I get 60:40

Then 2=(60-50) 2/50+(40-50) 2/50 = 4.0 <6,63. I don’t reject null hypothesis on 1% level of significance. - OK, I didn’t gibbet innocent.

Page 26: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What can happen – flipping the coin

Reality – the coin is false, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS)

100 flips, I get 60:40

Then 2=(60-50) 2/50+(40-50) 2/50 = 4.0 < 6,63. I reject null hypothesis on 5% level of significance. Type II error (blackguard is free).

Page 27: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

After 20 flips of the coin

head tails chi-squ0 20 201 19 16.22 18 12.83 17 9.84 16 7.25 15 56 14 3.27 13 1.88 12 0.89 11 0.210 10 0

Page 28: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Power of testReality – the coin is false, i.e. P0=P1=0.5 (BUT WE DON’T KNOW THIS) – When it goes exactly according to probability.

100 flips, I get 55:45

Then 2=(55-50)2/50+(45-50)2/50 = 1.0 < 3.84. I don’t reject error II

1000 flips, I get 550:450Then 2=(550-500)2/500+(450-500)2/500 = 10.0 > 3.84. I reject and it is OK.

Reality – the coin is false, i.e. P0=0.51; P1=0.49

100 flips, I get 51:49

Then 2=(51-50)2/50+(49-50)2/50 = 0.04 < 3.84. I don’t reject error II

1000 hodů, dostávám 510:490

Then 2=(510-500)2/500+(490-500)2/500 = 0.4 < 3.84. I don’t reject error II

10000 flips, I get 5100:4900Then 2=(5100-5000)2/5000+(4900-5000)2/5000 = 4 > 3.84. I reject and it is OK.

Page 29: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Power of test grows

• With number of independent observations

• With magnitude of deviance from null hypothesis

• With lowering protection against Type I error

Page 30: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100N

procento

Percentage of heads in a sample sufficient to reject the null hypothesis P1=P2=0.5 by the 2 -test as a function of totalnumber of observations

P>0.05

0.01<P<0.05

P<0.01

P<0.01

percentage

Page 31: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Examples of use

• phenotype ratio

• 3:1

• 9:3:3:1 (number of degrees of freedom = number of categories - 1, for a priori hypothesis, i.e. DF=3)

Page 32: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Examples of use• Sex ratio

• 1:1

• Assumptions!

• Random sampling!

• The same probability

In praxis can be rejecting of null hypothesis sign of three facts:

1. Null hypothesis is wrong.

2. Null hypothesis is right, but the decision os consequence of Type I error.

3. Null hypothesis is right, but the assumptions of the test were violated.

Page 33: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Examples of use• Bee’s orientation

according to the disk colour

• H0: 1:1:1

• How to ensure independence?

• Solid size of sample

Page 34: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Examples of use

• Hardy-Weiberg’s equilibrium

• p2+ 2pq + q

• attention – we take off one degree of freedom more for a parameter that we estimate from data, so DF= 3 - 1 - 1 = 1

Page 35: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What are critical values?

k

i

k

i i

ii

fff

1

2

1

22

E)EO(

ˆ)ˆ(

The higher deviation from null hypothesis, the higher chi-square

Page 36: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

What are critical values?

When this is 5%, then 11.1 is critical value on 5% level of significance (here is DF=5)

Page 37: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

Nowadays is used more often

We can use the opposite procedure as well. We have computed chi-square=14

The area of the “tail” = P = 0.014 is

“Probability”P is probability, that these or more different result from null hypothesis is just thanks to chance, if H0 is right.

Page 38: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

We usually write

• the result is significant on = 0.05 -

• or we write (2 = 6.66, df=1, P<0.05)

Page 39: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

And what is about the 2 value is near aroud zero

P>0.99

Can we take it as an evidence of true of H0?

Page 40: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

TOO GOOD TO BE TRUE

Page 41: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

2 – is deduced just theoretically, butI simulated these values by flipping the coin.

Problem - chi-square is continuous distribution, frequencies are discrete from their definition

Page 42: Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial

That’s why Yates` correlation (on continuity) is sometimes used

k

i i

ii

f

ff

1

22

ˆ

)5.0ˆ(

But this test is too conservative then (i.e. probability of error is usually smaller than α, and so the power of test is smaller too). It is not recommended to use, if the expected frequencies > 5, but isn’t used even if just few of them are smaller.