biostat hw 09ms009

12
LS321 -Take Home Assignment ARITRA KR. MUKHOPADHYAY (09MS009) 3RD YEAR DEPARTMENT OF PHYSICAL SCIENCES (IISER-K) April 13, 2012 Softwares used-OCTAVE and QTIPLOT. The program codes are all at- tached in the email as a .tar.gz file with the assignment . Only some relevant portions of the codes (not entire ones) have been provided here too. Answer 1: A program is run in software OCTAVE which draws 50 normally distributed random samples with mean 170cms. and standard deviation 8.2cms. each hav- ing 136 members. The means of each of the sample is found out and a histogram plot is made of the means by using software QTIPLOT. I have drawn two his- tograms , one with step size 0.4cm and the other with 0.3cm. The best fit comes from the one with step size 0.3 cms with the adjusted R 2 (=1- SSE TSE )value 0.944 which is close to 1 indicating the SSE or the sum of squares of errors is very small compared to the TSE or the total sum of squares hence a good fit. The distribution of the sample means is accordance with the Central Limit Theorem with most of the samples having mean quite close to the population mean of 170cms (from the plot the mean comes to 169.79cms). The source code is pasted here: x=zeros (136,50); %defining a 136 * 50 dimensional zero matrix . %aim- each of the 50 columns will correspond to a sample and will have 136 elements in it (so 136 rows) chosen randomly so that they are consistent with the population parameters ’mean=170 cms ’ and ’stanard deviation=8.2cms’. the function ’randn ’ generates random numbers distributed normally (mean=0, s . d=1) , so i transform it to ”s.d * randn + mean” to generate normally distributed samples with mean=170 and s.d=8.2 . for i =1:50 x(:,i)=8.2 * randn (136,1)+170; end m= mean(x); %calculates the mean of each column that is each sample and returns an array of means having 50 elements . 1

Upload: aritraabir1910

Post on 28-Oct-2014

53 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Biostat HW 09MS009

LS321 -Take Home Assignment

ARITRA KR. MUKHOPADHYAY (09MS009)3RD YEAR

DEPARTMENT OF PHYSICAL SCIENCES(IISER-K)

April 13, 2012

Softwares used-OCTAVE and QTIPLOT. The program codes are all at-tached in the email as a .tar.gz file with the assignment . Only some relevantportions of the codes (not entire ones) have been provided here too.

Answer 1:

A program is run in software OCTAVE which draws 50 normally distributedrandom samples with mean 170cms. and standard deviation 8.2cms. each hav-ing 136 members. The means of each of the sample is found out and a histogramplot is made of the means by using software QTIPLOT. I have drawn two his-tograms , one with step size 0.4cm and the other with 0.3cm. The best fit comesfrom the one with step size 0.3 cms with the adjusted R2 (=1-SSE

TSE )value 0.944which is close to 1 indicating the SSE or the sum of squares of errors is verysmall compared to the TSE or the total sum of squares hence a good fit. Thedistribution of the sample means is accordance with the Central Limit Theoremwith most of the samples having mean quite close to the population mean of170cms (from the plot the mean comes to 169.79cms). The source code is pastedhere:

x=zeros (136 ,50) ;%de f i n i n g a 136∗50 dimensiona l zeromatrix .

%aim− each o f the 50 columns w i l l correspond to a sampleand w i l l have 136 e lements in i t ( so 136 rows ) chosenrandomly so t ha t they are c on s i s t e n t wi th thepopu la t i on parameters ’mean=170 cms ’ and ’ s tanardd e v i a t i on =8.2cms ’ . the func t i on ’ randn ’ genera t e srandom numbers d i s t r i b u t e d normal ly (mean=0, s . d=1) ,so i transform i t to ” s . d∗randn + mean” to genera tenormal ly d i s t r i b u t e d samples wi th mean=170 and s . d=8.2.

for i =1:50x ( : , i ) =8.2∗randn (136 ,1 ) +170;

endm=mean( x ) ; %ca l c u l a t e s the mean o f each column tha t i s

each sample and re turns an array o f means having 50e lements .

1

Page 2: Biostat HW 09MS009

dlmwrite ( ’m. dat ’ ,m, ’ \n ’ ) ; %expor t s the array o f meansin t o a data f i l e ’m. dat ’

The graphs are shown below :

Histogram & normal fit of the means (step size-0.4cm)

Coun

ts

-2

0

2

4

6

8

10

12

14

-2

0

2

4

6

8

10

12

14

Means (cm.)167 168 169 170 171 172 173

167 168 169 170 171 172 173

Mean = 169.7920166847662Standard Deviation = 1.539216886766Chi^2/doF = 3.119772650346584e+00R^2 = 0.912529738775329Adjusted R^2 = 0.842553529795593

Histogram & normal fit of the means (step size-0.3cm)

Coun

ts

-2

0

2

4

6

8

10

0

2

4

6

8

10

Means (cm.)167 168 169 170 171 172 173

167 168 169 170 171 172 173

Mean = 169.7949254594500Standard Deviation = 1.45410560771153Chi^2/doF = 6.264486427601768e-01R^2 = 0.961668352278661Adjusted R^2 = 0.94463206440251

Answer 2:

First of all the sample size is 7 perhaps too small to comment on a good fit.The mean and variance of the sample comes to be 78.256 and 159.90 which areobviously not equal which is a criteria for the Poisson distribution. So apriori itdoes not seem to fit to Poisson distribution. But I draw 10,000 random sampleseach having 7 units with the mean as that of the given sample 78.256. Since theexpectation of the sample means and that of sample variance are the unbiasedestimates of population means and population variances, these quantities arecalculated and comes to 78.277 and 78.521 (almost equal as it should be forPoisson).

But the variance of the sample variances is large at 2102.9 . Noting that thevariance of the given data or sample is 159.90 which is quite small compared to2102.9 we can conclude that the data can be modelled to Poisson distribution.If the given number of sample points were large one could have done a Poissonfitting and obtain a Chi square goodness of fit result ,which would have beenmore accurate. The code is pasted here:

2

Page 3: Biostat HW 09MS009

x =[87 , 53 , 72 , 90 , 78 , 85 , 83 ] ;%given dataM=mean( x ) ;%mean o f the g iven dataV=var ( x ) ;%var iance o f the datay=zeros (7 ,10000) ;for i =1:10000

y ( : , i )=randp (M, 7 , 1 ) ; %genera t ing 10 ,000 Poissond i s t r i b u t e d samples havin 7 e lements each ,wi th the mean as the mean o f the popu la t i ondata M.

endm=mean( y ) ; %array o f mean o f each o f the samplesv=var ( y ) ; %array o f var iance o f each o f the samplesem=sum(m) /10000 %expec t a t i on o f sample meansev=sum( v ) /10000 %expec t a t i on o f sample var iancesvv=var ( v ) %var iance o f sample var iances

Answer 3:

The two sets of graphs showing the change of the 5% critical value of the t-distribution are attached here. One a normal plot and the other a log-log plot.The value decreases with the increase of the degree of freedom. As can be seenfrom the plot the value decrease rapidly between degree of freedom 1 and 10(approx) and then the slope falls, ultimately saturating. The log-log plot ismore indicative since the fall of the function at small values of the degree offreedom is shown clearly as well as the fall at the largest values close to 100.Whereas the normal plot only gives an idea of the steep fall in the beginningand the slow decrease in the end. The relevant portion of the code and plotsare attached here:

x =1:100; %genera te an array f o r gegree o f freedom va lu e st=t inv ( 0 . 9 5 , x ) ;%genera te the 5% c r i t i c a l v a l u e s f o r t

s t a t i s t i c s wi th degree o f freedoms in ’ x ’ . in oc tave toge t y% c r i t i c a l v a l u e s one need to input (1−y ) /100 asthe argument f o r ’ t inv ’ hence the argument 0.95 .

plot (x , t , ’ r ’ ) ; %p l o t s the graph o f the change o f va l u e swi th degree o f freedom

plot ( log10 ( x ) , log10 ( t ) , ’ r ’ ) ; %crea t e s a log−l o g p l o t o fthe same

3

Page 4: Biostat HW 09MS009

4

Page 5: Biostat HW 09MS009

Answer 4:

A simple regression model is defined by

Yi = a+ bXi + εi (1)

where the Yis are the response values , Xi s the predictor values ,a and b are theregression parameters and the εi are the errors. For the least square estimationthe target is to minimise the sum of the squares of these individual errors withrespect to the regression parameters. That is

∂a

n∑i=1

e2i =∂

∂a

n∑i=1

(Yi − a− bXi)2 = 0 (2)

∂b

n∑i=1

e2i =∂

∂b

n∑i=1

(Yi − a− bXi)2 = 0 (3)

The fist equation gives

n∑i=1

(Yi − a− bXi) = 0 (4)

⇒ a = Y − b X (5)

and the second one yields the equation

n∑i=1

Xi(Yi − a− bXi) = 0 (6)

⇒n∑

i=1

(Xi −X)(Yi − a− bXi) +X

n∑i=1

(Yi − a− bXi) = 0 (7)

⇒n∑

i=1

(Xi −X)(Yi − Y + b X − bXi) = 0 (8)

⇒n∑

i=1

(Xi −X)(Yi − Y )−n∑

i=1

b(Xi −X)2 = 0 (9)

⇒ b =Cov(X,Y )

V (X)(10)

For the maximum likelihood estimation one assumes that the errors εis areindependently and identically normally distributed with mean 0 and varianceσ2. Hence the Yis are also normally distributed with mean (a+bxi) and varianceσ2. Under this assumption the joint pdf of the Yis is

f(Y1, ..., Yn|a, b, σ2) =

n∏i=1

f(Yi|a, b, σ2) (11)

=1

(2πσ2)n/2exp(− 1

2σ2

n∑i=1

(Yi − a− bXi)2) (12)

Taking the log of the likelihood function one has

logL = −n2log(2π)− n

2log(σ2)− 1

2σ2

n∑i=1

(Yi − a− bXi)2 (13)

5

Page 6: Biostat HW 09MS009

The target is to maximise this with respect to the parameters a,b and σ2.Dfferentiating with respect to a and b and setting the result to zero gives us theequations:

n∑i=1

(Yi − a− bXi) = 0 (14)

n∑i=1

Xi(Yi − a− bXi) = 0 (15)

which are exactly the same equations we obtained in the case of the least squareestimation of these parameters. Then by solving we get the same estimate valuesas obtained before. Hence the maximum likelihood estimates of the simpleregression parameters are same as their least square estimates. (Proved)

Answer 5:

To answer this question I have used a two step procedure 1.] ANOVA test 2.]t-test .

ANOVA: First ANOVA was performed under the Null hypothesis that themean weights of the control plants , those under treatment1 and those undertreatment2 are all equal. The Alternate was that any one or more of the meansdiffer. The test was carried out at 95% level of significance and it was foundthe Null was rejected and hence all the means are not same the treatments DOHAVE SOME EFFECT on the plant weights.

t-tests: Secondly it was to be decided what type of influence do each treatmenthave. For this two t-tests were performed to compare 1.]the means of the controland treatment1 and 2.]the means of the control and treatment2. Both the testwere performed at 95% level of significance. The lecture notes of Dr. ParthaSarathi Mazumdar was used for the relevant formulas.

RESULTS FOR t-TEST for treatment 1 [A]Null hypothesis-the meanweights of the control and treatment-1 groups are equal Alternate hypothesis-the plants under treatment 1 have decreased mean weights Null Hypothesisaccepted at 95% confidence limit

[B]Null hypothesis-the mean weights of the control and treatment-1 groups areequal Alternate hypothesis-the plants under treatment 1 have increased meanweights Null Hypothesis accepted at 95% confidence limit

RESULTS FOR t-TEST for treatment 2 [A]Null hypothesis-the meanweights of the control and treatment-2 groups are equal Alternate hypothesis-the plants under treatment 2 have decreased mean weights Null Hypothesisaccepted at 95% confidence limit

6

Page 7: Biostat HW 09MS009

[B]Null hypothesis-the mean weights of the control and treatment-2 groups areequal Alternate hypothesis-the plants under treatment 2 have increased meanweights Null Hypothesis rejected at 95% confidence limit.

CONCLUSION: Treatment 1 donot affect the mean weights of the plantswhereas treatment2 seems to increase the mean weights of the plants. I ampasting relevant part of the codes here, rest being similar:

load ’ p lant . dat ’ ; %t h i s has the data in th r ee columns ,t r e a t e d as contro l , t reatment 1 and treatment2r e s p e c t i v e l y .

x=plant ;[ p , f , dfb , dfw]=anova ( x )%ANOVA i s performed on the th r ee groups . ’p ’ g i v e s the 1−

CDF va lue o f the F−d i s t r i b u t i o n , ’ f ’ g i v e s the va lue o fthe observed F−d i s t . , ’ dfb ’ the degree o f freedom

corresponding to the between mean var iance (SSB) and ’dfw ’ the degree o f freedom corresponding to the w i th ingroup vara ince (SSW) .

i f f>f i n v ( 0 . 9 5 , dfb , dfw )disp ( ’ Nul l Hypothes is r e j e c t e d at 95% con f idence

l i m i t ’ )else

disp ( ’ Nul l Hypothes is accepted at 95% con f idencel i m i t ’ )

end%de f i n i n g a genera l f unc t i on f o r two t−t e s t s . ’ t ’ g i v e s

the observed t−d i s t va lue and ’ df ’ the degree o ffreedom fo r the t−d i s t t h a t the t e s t d t a t i s t i c f o l l o w s. I t i s assumed t ha t the group var iances are UNEQUALhence the S a t t e r t h w a i t e a s approximate ’ df ’ va lue i scomputed ( formula from PSM’ s l e c t u r e s l i d e s ) .

function [ t , d f ]= t t e s t (y , z )ny=length ( y ) ;nz=length ( z ) ;my=mean( y ) ;mz=mean( z ) ;t=(my−mz) /sqrt ( var ( y ) /ny + var ( z ) /nz ) ;c=(var ( y ) /ny ) /( var ( y ) /ny + var ( z ) /nz ) ;dfn=(ny−1)∗( nz−1) / ( ( nz−1)∗c ˆ2 + (ny−1)∗(1−c ) ˆ2) ;df=round( dfn ) ;end[ t1 , df1 ]= t t e s t ( x ( : , 1 ) , x ( : , 2 ) )i f t1>t inv ( 0 . 9 5 , df1 )

disp ( ’ Nul l Hypothes is r e j e c t e d at 95% con f idencel i m i t ’ )

elsedisp ( ’ Nul l Hypothes is accepted at 95% con f idence

l i m i t ’ )end

7

Page 8: Biostat HW 09MS009

Answer 6:

A linear regression model was fit for the data with record times as the indepen-dent variable and distance and climb as the dependent ones(in this order) sepa-rately for males and the females. The regression coefficients for the males were1.6554×10−01 and 4.4466×10−05 repectively and for the females 2.9002×10−01and −1.1677× 10−04 respectively .

Observation: So this shows that record times for males increase with increas-ing distance and climb (as the regresion coefficients are positive) which is trueaccording to our general notion. But in females the record times decrease withclimb although it increases with distance.

Transformation: The data was transformed and the record time was re-gressed on the logarithm of distance and climb so that the coefficients can becompared easily. The coefficients for the males turned out to be 0.79306 and0.31727. For females they were 0.79124 and 0.31602. This was quite alright andthis shows that times do increase for both increase of distance and climb forboth males and females.

Presence of outlier: But a glance at the plots of the data shows that one ofthe values is quite far from the trend for both the males and females. This is the19th data( 46 7500 8.3069 13.5478). Also the regression coefficient correspondingto ’climb’ was negative for females, which is counter intuitive. So naturally onethinks this may be an outlier. I have repeated the same program with a modifieddata set without this data point.

Observation: The results were promising. The regression coefficients for themales were 9.4252×10−02 and 1.7034×10−04 . For females they were 1.1061×10−01 and 2.0217 × 10−04. In the log transformed case ,regression coefficientsfor the males were 0.75692 and 0.31848 , for females were 0.72183 and 0.31874.

Final Conclusions: After the outlier elimination it is found that the recordtimes for both males and females increase with distance and climb (on accountof the positive regression coefficients hence positive slope). The R2 value and Fstatistics value were also compared and the analysis without outlier gave betterresults. The details are in the program file ’prb6.m’ and ’prb6a.m’. Comparingthe coefficients between males and females , it is found that females take moretime to cover a given distance and climb.

The relevant portion of the code is given here:

load ’ h i l l s . dat ’ %data f i l e wi th the o u t l i e rx=h i l l s ; %matrix wi th 1 s t column ’ d i s tance ’ , 2nd column

’ cl imb ’ , 3 rd column ’ time f o r males ’ and 4 th column ’time f o r females ’

[ b1 , bint1 , r1 , r in t1 , s t a t s 1 ]= r e g r e s s ( x ( : , 3 ) , [ ones ( s ize ( x( : , 1 ) ) ) , x ( : , 1 ) , x ( : , 2 ) ] ) ;

%f i t s a r e g r e s s i on o f the ’ time (males ) ’ wi th the ’d i s tance ’ and ’ cl imb ’ .

8

Page 9: Biostat HW 09MS009

% ∗ b1 i s the be ta matrix ( or the r e g r e s s i onc o e f f i c i e n t s ) in the model

% ∗ b in t 1 i s the con f idence i n t e r v a l f o r b% ∗ r1 i s a column vec to r o f r e s i d u a l s% ∗ r i n t 1 i s the con f idence i n t e r v a l f o r r% ∗ s t a t s 1 i s a row vec to r con ta in ing :% o The Rˆ2 s t a t i s t i c% o The F s t a t i s t i c% o The p va lue f o r the f u l l model% o The es t imated error var iance

The relevant graphs are attached here:

Figure 1: Data with outlier

9

Page 10: Biostat HW 09MS009

Figure 2: Data with outlier

10

Page 11: Biostat HW 09MS009

Figure 3: Data without outlier

11

Page 12: Biostat HW 09MS009

Figure 4: Data without outlier

12