Transcript
Page 1: Statistical analysis of bluegill sunfish data using linear

STATISTICAL ANALYSIS OF BLUEGILL SUNFISH DATA

USING LINEAR LOGISTIC REGRESSION

Susan Ng

B.A.(~onors), University of Hong Kong, 1973

PROJECT SUBMITTED IN PARTIAL FULF'ILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in the Department

of

Mathematics and Statistics

@ Susan Ng 1986

SIMON FRASER UNIVERSITY

August 1986

All rights reserved. This work may not be reproduced in whole or in part, by photocopy

or other means, without permission of the author.

Page 2: Statistical analysis of bluegill sunfish data using linear

Name: Susan Ng

Degree: Master of Science

Title of p r o j e c t : Statistical Analysis of Bluegill Sunfish Data

Using Linear Logistic Regression

Examining Committee:

Chairman : Dr, A,R. Freedman

Dr, R. Lockhart Senior Supervisor

Dr. D. Eaves

--- - - -- - Dr. R, Routledge External Examiner Mathematics and Statistics Department Simon Fraser University

Date Approved: 8 August 1986

Page 3: Statistical analysis of bluegill sunfish data using linear

I hereby grant to Slmon Fraser

my thesis, project or extended essay (the

to users oi the Simon Fraser University L

PART l AL COPYRl GHT LICENSE

University the right to

title of which Is shown

ibrary, and to make part

lend

below)

la1 or

single copies on1 y for such users or i n response to a request from t h e

library of any other university, or other educational Institution, on

i t s own behalf or for one of its users. I further agree that permission for multiple copying of this work for scholarly purposes m y be granted

by me or the Dean of Graduate Studies. It is understwd that copying

or publication of this work for flnanclal gain shall not be allowed

without my w f itten permission.

T i t l e of Thes i s/Project/Extended Essay

Author:

(signature)

(name 1

V

(da te )

Page 4: Statistical analysis of bluegill sunfish data using linear

ABSTRACT

A data set from an artificial breeding experiment on the

bluegill sunfish is analysed. The aim of the experiment is to

test whether alternative reproductive patterns in the bluegill

sunfish are genetically inherited, and to identify factors that

contribute to the different reproductive patterns. Linear

logistic regression models using the maximum likelihood

estimation method are employed. Improvement' chi-square

statistics are used to select explanatory variables. Various

goodness-of-fit statistics are used to check the adequacy of the

fitted models. Finally, a Monte-Carlo study is carried out to

check the appropriateness of the selection criterion and the

sensitivity of the goodness-of-fit statistics. The model

identified before the Monte-Carlo study is then reassessed.

iii

Page 5: Statistical analysis of bluegill sunfish data using linear

ACKNOWLEDGMENTS

I would like to take this opportunity to thank my Senior

supervisor, Dr. R. Lockhart, for his invaluable advice and for

his many helpful suggestions during my M.Sc. studies.

I would also like to thank the members of my Supervising

committee for their valuable comments and advice.

In particular, I am grateful to Dr. D. Eaves for his

supervision during the initial analysis of the data set.

Finally, my special thanks to Dr. M. Gross for supplying the

detail of the sunfish experiment.

Page 6: Statistical analysis of bluegill sunfish data using linear

TABLE OF CONTENTS

Approval .................................................... ii

~bstract ................................................... iii ~cknowledgments ............................................ iv

List of Figures ............................................ vii List of Tables ............................................ viii FOREWORD .................................................... ix I . EXPERIMENT ............................................. 1

Data ................................................... 3 CHOICE OF MODEL ........................................ 4 Problem Set-up ......................................... 4

................................ Choice of ink unction 6

.................................. Linear Logistic Model 7

............................. BMDP Statistical Software 1 1

I11 . EXPLANATORY VARIABLES ................................. 13 ...................... Improvement Chi-square Statistic 15

.......... Result of Selection of Explanatory Variables 16

Monte-Carlo Study of Improvement Chi-square Statistic . 19 .......................... Reassessment of Fitted Model 23

Genetic-Environmental Interaction ..................... 24 . ....... CHECKING ADEQUACY OF MODELS RESIDUAL ANALYSIS 25

.................. Residuals for Linear Logistic Models 25

........................... Checking Adequacy of Models 28

. .... CHECKING ADEQUACY OF MODELS GOODNESS-OF-FIT TEST 31

.................... Hosmer's Goodness-of-fit Statistic 32

C.C. Brown's Goodness-of-fit Statistic ................ 33

Page 7: Statistical analysis of bluegill sunfish data using linear

Likelihood Ratio Statistic ............................ 34 ........................ Pearson's Chi-square Statistic 36

Monte-Carlo Study of Quality . of Chi-square ~pproximation ...................................... 36

Checking Adequacy of Models ........................... 41

Comparison of Model-A and Model-B ..................... 43 VI . TESTS ............................................... 45 VII . Analysis of Environmental Effects ..................... 51

Results .............................................. 52

VIII . Conclusion ............................................ 54

Appendix A .................................................. 74 ~ibliography ................................................ 76

Page 8: Statistical analysis of bluegill sunfish data using linear

LIST OF FIGURES

Figure Page

A.I Plot of Residual versus Predicted Proportion Precocious : Model-A ................................ 56

~ . 2 Normal probability Plot : Model-A ..................... 57 ~ . 3 Plot of Observed versus Predicted Proportion

Precocious : Model-A ................................ 58 ~ 1 . 1 Plot of Residual versus Predicted Proportion

Precocious : Model-B (Collapsed Cells) .............. 59 B1.2 Normal Probability Plot : Model-B (Collapsed Cells) ... 60 ~1.3 Plot of Observed versus Predicted Proportion

Precocious : Model-B (Collapsed Cells) .............. 61 B2.1 Plot of Residual versus Predicted Proportion

Precocious : Model-B (96 cells) ..................... 62 B2.2 Normal probability Plot : Model-B (96 Cells) .......... 63 B2.3 Plot of Observed versus Predicted Proportion

Precocious : Model-B (96 Cells) ..................... 64

Page 9: Statistical analysis of bluegill sunfish data using linear

LIST OF TABLES

able Page

2 Variable Selection by Improvement Chi-square Statistic 17

3 Simulation Results of Improvement Chi-square Statistic 22

4 Simulation Results of Goodness-of-fit Statistics ...... 37 1 . 1 Number of Fish Survived Including Unsacrificed ........ 65

Number of Fish Survived and Sacrificed ................ 66 ...................... Number of Female Fish Sacrificed 67

........................ Number of Male Fish Sacrificed 68

................. Number of Precocious Males Sacrificed 69

..... Ranking of Individual Cuckolder Fathers : Model-A 70

..... Ranking of Individual Cuckolder Fathers : Model-B 71

Ranking of Individual Mothers : Model-A ............... 72 Ranking of Individual Mothers : Model-B ............... 73

viii

Page 10: Statistical analysis of bluegill sunfish data using linear

FOREWORD

An important recent discovery in biology is the existence of

alternative mating strategies in some fish populations. In these

fish populations, some males mature precociously while 5-25% the

body size of 'normal' adult males and mate by sneaking into the

- latter's nests.

In the bluegill sunfish population in Lake Opinicon in

ontario, 'normal' males (called parentals) mature at the age of

seven to eight, build nests and provide care for the offspring.

The 'precocious' males (called cuckolders) mature at the age of

one to two years and fertilize eggs in nests of parental males

either by sneaking or mimicking the behavior of females. These

small cuckolder males provide no parental care for their

offspring.

An artificial breeding experiment on the bluegill sunfish

was conducted in 1982/1983 by Dr. M. Gross of ~iological

Sciences Department of Simon Fraser University and Dr. D.

Philipp of Illinois Natural History Survey. The aim of the

experiment was to test whether alternative reproductive patterns

in the bluegill sunfish (precocial maturity) are genetically

inherited, and to identify factors, genetic or environmental,

which contribute to the different mating patterns.

The experiment is described briefly in Chapter 1. Various

models for analysing the data are studied in Chapter 2. Linear

Page 11: Statistical analysis of bluegill sunfish data using linear

logistic regression was found to be the most appropriate model

and the PLR program of the BMDP Statistical Software was used

for analysing the data set. Chapter 3 describes how the

explanatory variables are selected. It also discusses the result

of a small-scale Monte-Carlo study aimed at investigating the

appropriateness of the selection criterion. Two models were

found appropriate. They are checked for adequacy by residual

analysis in Chapter 4 and by goodness-of-fit tests in Chapter 5.

~ l s o in Chapter 5, the results of a Monte-Carlo study to test

the quality of the chi-square approximation to the distributions

of various goodness-of-fit statistics are discussed. Results of

various tests and inferences are summarized in Chapter 6.

Lastly, in Chapter 7, an environmental factor, the pond effect,

is further analysed.

Page 12: Statistical analysis of bluegill sunfish data using linear

CHAPTER I

EXPERIMENT

~eproductively mature parental males, cuckolder males, and

female bluegill sunfish were collected on 27 June 1982 from an

active breeding colony in Sandy Bay, Lake Opinicon. Eight

- females of age five were crossed with six cuckolder males of age

three and six parental males of age eight. They were selected to

be of similar body size and the age that is the mean of their

type. Sperm from each of the parental and cuckolder males were

used to fertilize approximately 200 eggs from each female.

Approximately 19200 progeny of the 96 crosses were reared

for 24 hours in a laboratory at Queen's university Biological

Station, Lake Opinicon. They were then transferred to a

laboratory at the Illinois Natural History Survey (INHS),

Champaign, Illinois. The eggs and emerging fry from each cross

were placed in 96 separate 40-litre tanks. In August, the

crosses were transferred to a greenhouse containing 96 200-litre

tanks.

In September 1982, about 30 progeny per tank were chosen

randomly and tagged to identify both mother and father. They

were then transferred to four ponds at the INHS quat tic Research

Field Laboratory where they were reared together on a natural

diet. Each pond contained all the progeny of two females. The

ponds were also stocked with six mature parental and twelve

mature female bluegill sunfish from a local Illinois population

Page 13: Statistical analysis of bluegill sunfish data using linear

to simulate natural 'breeding conditions.

In June 1983, about one year after establishing the crosses

and at the height of natural bluegill breeding activity, the

ponds were drained and the progeny captured.

190 fish from Pond 2 were chosen randomly and kept alive for

future experiments and the remaining 1159 progeny were killed.

Their parents were identified and their weight and length were

measured. Their gonads, while fresh, were examined under a

dissecting microscope. The sex and reproductive state of all but

58 (31 from parental fathers and 27 from cuckolder fathers) were

readily determined. Of the remaining 1101 fish, 432 were male.

166 males were identified and categorized as precociously mature

by the presence of free sperm in their testes.

The 190 fish which were retained alive'were tested for the

presence of sperm by light pressure on the lower abdomen body

wall. Individuals that shed sperm were tagged by left opercular

clips and those that did not with right opercular clips. Then

all fish were released into a pond containing mature adult male

and female bluegill sunfish. Observation of spawning activity in

the ponds in the next year revealed that only those tagged by

left opercular clips were behaving as cuckolders. Thus, in this

experiment, the presence of free sperm at the age of one could

be used to clearly identify cuckolder males from parental males.

Page 14: Statistical analysis of bluegill sunfish data using linear

r-

Data - The counts , - number of fish survived, number of fish

sacrificed, number of females sacrificed, number of males

sacrificed and number of precocious males sacrificed - classified by crosses (families) are summarised in Tables 1.1 to

- 1.5.

There were altogether 432 male offspring killed and

identified, that is, an average of 4.5 male offspring from each

family. However, they were not evenly distributed. There were 9

families with no male offspring, 46 with one to four, 34 with

five to nine and 7 with ten or more.

Of these 432 male offspring, 195 were fathered by parental

males. Of these 195, 55 or 28% were precocious. On the other

hand, 1 1 1 or 4.7% out of 237 sons fathered by cuckolder's ' were

precocious.

Page 15: Statistical analysis of bluegill sunfish data using linear

CHAPTER I I

CHOICE OF MODEL

Problem Set-up

The objective of the experiment is to test whether precocial

- in the bluegill sunfish is genetically inherited. This

hypothesis predicts a greater proportion of precocious sons

among the male progeny of cuckolder fathers as compared to those

of parental fathers.

Within family or cell i (i = 1 , ..., N), there are n male i

off spring, Y ( j = 1,. . . ,n. 1. Each of these Y 's takes one of ij 1 i j

two possible forms: precocious or non-precocious. Let Y take i j

on value 1 if it is precocious and 0 if it is non-precocious.

Let

th Then Y is the number of precocious sons in the i family.

i Assuming the probability of becoming precocious, Bit is constant

for all individuals in the same family and the observations on

all individuals are independent, then the distribution of Y i

given n is binomial with index n and parameter 8 . i i i

Page 16: Statistical analysis of bluegill sunfish data using linear

The saturated model describing the data has N parameters 8 1 '

..., 8 where 8 is estimated by N i

The saturated model yields a perfect fit to the data but

does not help us understand the underlying process generating

the data, or the underlying structure of the data. To get this

understanding, we need to replace the individual data values,

the 8.'s, by a summary (model.) that describes their general 1

characteristics in terms of fewer parameters.

We suppose that the probability of becoming precocious, 8,

when suitably transformed depends linearly on some explanatory

variables or covariates. That is,

where h is a function of 8 and is usually called the link i

function,

X. is the i th

row of the Nxp matrix of explanatory 1

variables (known constants), and

0 is a vector of p unknown parameters.

Page 17: Statistical analysis of bluegill sunfish data using linear

choice of Link Function

An obvious candidate for the link function is the identity

transform h(8.1 = 8 . However it has the serious restriction 1 i

that the model may lead to some fitted values of X . 0 , and thus 1

some fitted values of 8 , outside the range [ 0 , 1 ] . i

The variance-stabilizing anqular transform

has the nice property that arcsin d(y./n.) (where y is the 1 1 i

number of successes in cell i) has asymptotic normal

distribution with mean A and variance 1/4n which is i i

independent of 0 . However, similar to the identity i

transformation; it is limited for general usefulness by.

finite range.

Two other transformations have received considerable

attention. One is the inteqral normal transform

where 0(t) is the value of the cumulative normal curve at the

point t.

Page 18: Statistical analysis of bluegill sunfish data using linear

The other is the loq-odds transform

8 = log i

'i 1-8 i

Both transformations always satisfy the constraint 0 I 8 I i

1 and are very similar over the range 0.1 5 8 5 0.9 when i

appropriately standardized. However the integrated normal

transform has the disadvantage over the log-odds transform of

the absence of sufficient statistics. Also, the log-odds

transform is preferred in this analysis because of its simple

interpretation as the logarithm of the odds ratio. Models

involving such a transformation are logistic models.

Linear Loqistic Model

The model

8 = log i

'i 1-8 i

where X = [x ... x 1 i i I i~

and p = [ P , ... 6 I T P

Page 19: Statistical analysis of bluegill sunfish data using linear

is a linear logistic model,

There are two ways of estimating the unknown parameters P .

One is based on non-iterative weighted least squares and the

other is the maximum likelihood method.

1 . m on-iterative Weiqhted Least Squares Method

The number of successes Y in each cell i of size n is i i

distributed as binomial with index n and parameter 0 . For i i

large n., provided that 8 is not too close to 0 or 1 , 1 i

y./n. u = log i l-yi/ni

is nearly normally distributed. As n tends to infinity, the i

asymptotic mean and variance are respectively

0 . 1 log - and

1-8 n.0. (1-0.) i 1 1 1

The asymptotic variance is consistently estimated by

These estimated variances can be used to obtain weights and

thus the weighted least squares method can be applied.

The expression for u is undefined when y equals 0 or i i

n . The u 's need to be transformed to i i

Page 20: Statistical analysis of bluegill sunfish data using linear

The estimate for the asymptotic variance is also modified to

The computation in the weighted least squares method is

less complex and no iteration is involved. Also various

graphical analyses are possible by treating the u 's as i

approximately normally distributed with known variances.

However if some n 's are small, the Central Limit Theorem i

cannot be applied. Moreover, if some n 's are small and some i

ye's are equal to 0 or n , the addition of the term 1/2n in 1 i i

the expression ( * ) might change the relative magnitude of

the data in an undesirable way. Lastly, since this method is

not based on sufficient statistics, some efficiency may be

lost.

2. Maximum likelihood Method

The second method for estimating the parameter 0 is the

maximum likelihood method. This method uses the sufficient

statistics

Page 21: Statistical analysis of bluegill sunfish data using linear

.d

3r I -r

l

C h

-rl

Q,

I .- w

-r

l 3r

.d

rg

zc

ii -r

l

.d

C n

.rl Q

, I .- w

.d

3r n

h

-4

I C

Y

\ -A

Q

, U

zc

ii -d

II

r"r

..I

C n

h

@a

.r(

X

w

a

X

aJ

+ C

u

zc

ii .r

l b.J

\

c.cI n

@a

.I.(

X .d

3r V

a

X

aJ C

ZC

I1 .d

L.J

II

Page 22: Statistical analysis of bluegill sunfish data using linear

th The jk entry of the information matrix 1(0) is given

A

Under regularity conditions, the estimate P has an

approximate multivariate normal distribution with

covariance matrix I which can be estimated by

1-Y;).

There are nice asymptotic results for some test

statistics. They can be used to select the 'best' set of

explanatory variables and to test whether a model fits

. the data adequately.

The main problem in using the maximum likelihood

method is a computational one of maximizing the log

likelihood. This can be solved by using statistical

packages like BMDP.

BMDP Statistical Software -

The PLR program of the BMDP Statistical Software was used

for analysing this data set. It computes maximum likelihood

Page 23: Statistical analysis of bluegill sunfish data using linear

using an iteratively reweighted Gauss-Newton method,

~t also calculates the asymptotic variance-covariance matrix of

the estimates.

~t can handle continuous or categorical explanatory

variables and generates design variables for the categorical

ones. It can select explanatory variables to be included in the

model in a stepwise manner based on maximum likelihood ratio.

It also provides, at each step, the log-likelihood, the

change in log-likelihood from the previous step, and three

goodness-of-fit statistics, namely the likelihood ratio

statistic, Hosmer's statistic and Brown's statistic.

predicted probabilities, standardized residuals and various

scatter plots are also available on request.

Page 24: Statistical analysis of bluegill sunfish data using linear

CHAPTER I 1 1

EXPLANATORY VARIABLES

The biologists believe that the factors affecting whether a

male offspring will become precocious or not can be classified

into environmental and genetic.

The only environmental explanatory factor for whi,ch the

experiment provides data was the pond effect. The genetic factor

was accounted for by the father-type effect, an individual

father effect, an individual mother effect and possibly

individual father crossed individual mother interaction effects.

The individual father effect was nested in father-type

because there were six fathers of father-type parental and six

fathers of father-type cuckolder. Also, due to the design of the

experiment, the individual mother effect was nested in pond

because the offspring of two mothers were stocked in each of the

four ponds.

If n and Y are respectively the number of males and ijkl ijkl

the number of precocious males in family of father k of

father-type i and mother 1 of pond j, then the full model

suggested was

Y - Bin(n ijkl ijkl' 'ijkl 1

Page 25: Statistical analysis of bluegill sunfish data using linear

. -r

l

d

rl a

k

0

W

0

II h

.rl w

Y

@a -

row 11 x

. . 'a . . . . . - II x k

al C

C,

a

W

rl a 7

a

-rl

3

-rl

a

C

-rl

II h

-rl

w

x

@a

C

Oa

lk

Ud

O

A

+I

a

. 0

II n

C -

d'w n - n

0.

d' . . . . .

.-- II - n

a

C

0

a

II ' n

C

. rl . n .

.rl

rl d

a

k

0

W

0

11 h

n

-4

w

rl

x

a -

row a x

Page 26: Statistical analysis of bluegill sunfish data using linear

'pond' and ' fathertype' , the 'rest were nested effects. Therefore

the design variables for these nested effects had to be

generated manually. As a result, they were treated as separate

continuous variables by the program and the stepwise selection

feature could not be used. Rather, the explanatory variables to

be included in the model had to be specified every time the

program was run.

The selection process of the explanatory variables was

sequential and the order in which they were considered was a

hierarchical one, that is, main effects followed by nested

effects and lastly by interaction effects. The variables that

significantly improved the fit of the model were retained while

those that did not were excluded from the model.

Improvement Chi-square Statistic

Whether a variable or a group of variables significantly

improve the fit of the model can be quantified by the

'improvement chi-square statistic'. It is twice the logarithm of

the ratio of the current versus previous likelihood function

values. Suppose there are two models, model(1) and model(2) and

that model(2) contains a variable or a group of variables in

addition to the variables in model(1). The likelihood ratio

statistic of model(2) is

Page 27: Statistical analysis of bluegill sunfish data using linear

where L(O), L(G2) are the log likelihood of the saturated model

and of model(2) respectively. Then

Then ~ ~ [ ( 2 ) 1(1)1 is asymptotically distributed as chi-square

with degrees of freedom equal to the number of parameters that

are in Model(2) but not in Model(1). To test the hypothesis that

model(1) holds against the alternate hypothesis that model(2)

holds but model(1) does not, we use G2[(2) 1 (I)] as the test statistic. A small p-value indicates that the new variable or

group of variables added significantly improves the fit of the

model to the data. G2[(2) 1(1)1 is called the improvement ~.

chi-square statistic or the deviance.

Result - of Selection - of Explanatory variables

The data in Tables 1.4 and 1.5 are the n 's and the ijkl

Y 's respectively in this analysis. There were altogether 96 ijkl families, 9 of which had no sacrificed male offspring. Thus

there were 87 effective covariate patterns.

The log-likelihood of each model considered and the

corresponding improvement chi-square statistic for each group of

variables added are presented in Table 2.

16

Page 28: Statistical analysis of bluegill sunfish data using linear

T a b l e 2

Variable Selection b~ Improvement Chi-square Statistic

-

MODEL PARAMETERS PARAMETERS LOG D.F. IMPROV. IN MODEL ADDED LIKELIHOOD CHI -SQ .

P-VALUE

B C, ~ond(~) Pond -270.079 3 < 0,001

D C, P, T, Parental -259.099 5 0.220 Parental Father father(^^)

E C, P, T, Cuc kolder -255.459 5 0.014 Cuc kolder Father Father(CF) (over ~odel-C)

F C, P, T, Mother -247-071 4 0.002 CF, ~other(M) (over Model-E)

G C, P, T, CF, Parental Fa./ -227.386 20 0.006 M, PF/M Mother Inter. Interaction (over Model-F)

H C, P, T, CF, Cuckolder Fa./ -230.467 20 0.032 M, CF/M Mother Inter. Interaction (over ~odel-F)

I C, P, T, Parental Fa./ -210.787 20 0.006 CF, M, Mother Inter. PF/M Inter., (over Model-H) CF/M Inter.

Cuckolder Fa./ -210.787 20 0.032 Mother Inter. (over Model-G)

Page 29: Statistical analysis of bluegill sunfish data using linear

The first variable to enter the model was 'pond'. The

improvement chi-square statistic over the null model with the

constant term was very significant with p-value less than 0.005.

'Father-type' was the next variable considered. The

improvement chi-square statistic was again very significant

(p-value = 0.0001) indicating that it should be in the model.

However when 'individual parental father' was added to the

model, the improvement chi-square statistic had a value of 0.22

meaning that they did not contribute significant improvement to

the model; therefore they were left out of the model.

Next, 'individual cuckolder father' was tried and the

improvement chi-square statistic was significant (p-value =

0.014).

The 'individual mother' effect was incorporated into the

model. The improvement chi-square statistic showed a highly

significant p-value of 0.002 indicating that it should be

included.

To see whether the order in which the explanatory variables

were entered would affect the result of the variable selection,

the 'parental father', 'cuckolder father', and 'mother' effects

were entered in different orders into the model after 'pond' and

'father-type'. In all cases, the 'parental father' effect was

insignificant while the other two were significant.

Page 30: Statistical analysis of bluegill sunfish data using linear

The next group of variables to enter the model was the

interaction effect. Both the 20-parameter 'parental father

crossed mother interactions' and the 20-parameter 'cuckolder

father crossed mother interactions' were significant when they

were entered separately into the model.

Also the 'parental father crossed mother interactions' were

significant when 'cuckolder father crossed mother interactions'

were in the model and vice versa.

Basing on the improvement chi-square statistics, the model

selected include the explanatory variables 'pond',

'father-type', 'individual cuckolder father', 'individual

mother' , 'parental father crossed mother interaction',

'cuckolder father crossed mother interaction' effects. The

number of parameters is 54 which is not a small fraction of the

number of effective covariate patterns. Thus it is desirable to

study the appropriateness of the improvement chi-square

statistic in analysing this data set.

Monte-Carlo Study of Improvement Chi-square Statistic

1. Objective

The objective of this study was to investigate the

reliability of the improvement chi-square statistic in the

selection of explanatory variables. The decision of whether

to include a parameter or a group of parameters in the

current model was based on the improvement chi-square

Page 31: Statistical analysis of bluegill sunfish data using linear

statistic which has approximately a chi-square distribution.

If the approximation is good, then one is more confident

about the selected model. Otherwise, this selection

procedure may have included more variables than it should,

or may have left out some variables which are important.

Moreover the quality of the chi-square approximation may

differ with the number of degrees of freedom relative to the

number of covariate patterns.

2. Simulation - I

The original data set was fitted to the model with 14

parameters : 1 for 'father-type', 3 for 'pond', 5 for

'individual cuckolder father', 4 for 'individual mother' and

1 for the general mean. The proportion precocious for each

cell predicted from this model was taken as the true

probability of success. These 'true' probabilities of

success, O i l together with the total number of male fish in

each cell, n , were used to simulate 200 binomial random i

variables with index n and parameter 8 for each cell. GGBN i i

routine of ISML library was used.

For each simulated sample, four goodness-of-fit

statistics were calculated for further analysis. Then the

data were fitted separately to two extended medels. Extended

Model 1 has 20 additional parameters (cuckolder father

crossed mother interactions) and Extended Model 2 has 40

additional terms (parental father and cuckolder father

Page 32: Statistical analysis of bluegill sunfish data using linear

crossed mother interactions). The improvement chi-square

statistic was calculated for each of the 200 data sets for

each model. As can be seen from Table 3 below, when the

nominal 0.05 level test was used, the Extended Model 1 (20

additional parameters) was significant 12% of the time while

the Extended Model 2 (40 additional parameters) was

significant 24% of the time. The result suggests that for

this analysis, the selection criterion using improvment

chi-square statistics tends to include more explanatory

variables than it actually should. A possible reason is that

the chi-square approximation to its distribution is not good

when the number of degrees of freedom is large.

Simulation - I1

To investigate further, another Monte-Carlo study was

carried out. This time, the original dataset was fitted to

the model with 10 explanatory variables : 1 for

'father-type', 3 for 'pond', 5 for 'individual cuckolder

father' and 1 for the grand mean. Following the same

procedure in Simulation I, simulated samples were generated.

For each simulated sample, the goodness-of-fit statistics

were computed. Also the data were fitted to an extended

model with 4 additional parameters ('individual mother'

effect). The improvement chi-square statistic was then

calculated for each data set. Since the analyses of these

simpler models cost much less than the ones involving

Page 33: Statistical analysis of bluegill sunfish data using linear

interaction terms, altogether 500 samples were simulated and

analysed, The nominal 5% level test was used. The

improvement chi-square statistics this time suggest that the

extra 4 parameters are needed 5.2% of the time.

T a b l e 3

Simulation Results of Improvement Chi-square Statistic i

Observed Siqnificance Level - - at a Nominal -- 5% Level

Simulation I Extended Model 1 12.0% ( 2 0 additional parameters)

Extended Model 2 24.0% ( 4 0 additional parameters)

Simulation I1 Extended Model ( 4 additional parameters)

4. Summary - of Results

The two Monte-Carlo studies suggest that the chi-square

approxima,tion for the improvement chi-square statistic works

fine when based on 4 degrees of freedom but is bad with 20

or more degrees of freedom. Thus it is a good selection

criterion for explanatory variables when the number of

additional parameters is small. However it tends to include

more parameters than necessary when the number of additional

Page 34: Statistical analysis of bluegill sunfish data using linear

r

parameters is large.

Reassessment - of Fitted Model

The selected model for the actual dataset was a logistic

model with explanatory variables 'pond1, 'father-type',

'individual cuckolder father', 'mother', 'parental father

crossed mother interaction' and 'cuckolder father crossed mother

interaction'.

Judging from the result of the Monte-Carlo studies, the

'individual parental father' effect should definitely be

excluded from the model while 'pond', 'father-type', 'individual

cuckolder father' and 'mother' should be in the model. However

there is doubt whether the 40 interaction terms should be

included since the Monte-Carlo studies showed that the 0 .

chi-square approximation of the improvement statistic tends to

understate the p-value when degrees of freedom are large. The

p-value for including the 'cuckolder father crossed mother

interaction' was 0.032. ~ssuming a 0.03 level of significance

test, the Monte-Carlo studies indicated that 9.2% of the times

the 20 more parameters were unnecessarily included. Thus there

is more evidence that 'cuckolder father crossed mother

interaction' can be left out of the model.

On the other hand, the p-value of the improvement of fit by

including the 'parental father crossed mother interaction' was

0.006, With a significance level of 0.006, the Monte-Carlo

Page 35: Statistical analysis of bluegill sunfish data using linear

P

studies indicated that 6 out of 200 times, or 3% of the time,

the 20 more parameters were unneccessarily included. Thus the

'parental father crossed mother interaction' effect was not as

significant as it appeared. It may be desirable to exclude it

from the model as well.

Therefore there are two possible models, A and B. Model-A

has larger improvement in log likelihood and has 54 parameters.

Model-B is more parsimonious with 14 parameters.

Genetic-Environmental Interaction

With Model-B, genetic-environmental interaction effect was

studied by including pond by father-type interaction parameters

in the model. The improvement chi-square statistic was not

significant at the 5% level (p-value = 0.09) . Thus this data set

cannot be said to reveal an effect of interactions upon

precociality.

Page 36: Statistical analysis of bluegill sunfish data using linear

CHAPTER IV

CHECKING ADEQUACY OF MODELS - RESIDUAL ANALYSIS

In the last Chapter, two models A and B were identified.

Model-A has explanatory variables 'pond', 'father-type',

'individual cuckolder father', 'individual mother', 'parental

- father crossed mother interaction' and 'cuckolder father crossed

mother interaction'. Model-B is similar to Model-A but with no

interaction effects.

In this chapter and the next, we examine whether the two

models adequately fit the data. Residual analysis and

goodness-of-fit tests are employed. The former is dealt with in

the following paragraphs and the latter will be discussed in

Chapter 5.

Residual analysis plays an important role in checking the

adequacy of a model. Residual plots can reveal points with large

residuals or patterns in the residuals. Points with large

residuals may indicate outliers that should be carefully checked

in the original data. Patterns in the plots suggest the need of

a better model.

Residuals - for Linear Loqistic Models

There are different ways of constructing residuals for

linear logistic models. McCullagh and Nelder [ 1 9 8 3 ] suggest the

Pearson residual and adjusted deviance residual. The PLR program

Page 37: Statistical analysis of bluegill sunfish data using linear

adjusted estimate of the variance.

1. Pearson Residual

Pearson residual is the simple residual scaled by the

estimated standard deviation of Y./n : 1 i

It ignores the variation in the estimate of 8 . i

2. Residual used in PLR ---

The residual used in the PLR program is the residual

scaled by the standard error of the residual:

standard error of residual

A

It takes into account the variation in 8 . To the first i

order of approximation, the square of the standard error of

residual is given by

A A

e.(i-0.1 1

a* -r ae 1 - (2) 1 - 1 (i)

n i a0 30

exp(X. 0) where 8 = 1

i I+exp(X.0) 1

and I is the information matrix of the data.

Page 38: Statistical analysis of bluegill sunfish data using linear

The standardized residuals are approximately standard

normal.

3. Adjusted Deviance Residual

McCullagh and Nelder [ 1 9 8 3 ] recommend that if n is A

i small or 8 is near 0 or 1 , the adjusted deviance residual

i defined as follows should be used:

A

where the sign is that of y -n.B . i i i

A program was written to compute these residuals for A

Model-A. It was found that when n is small and 8 is close i i

to 0 or 1 , the term

dominates. As a result the residual becomes very large even

when 6 is very close to y./n . Thus the adjusted deviance i 1 i

residual is not a very appropriate measure of model adequacy

for this data set.

Page 39: Statistical analysis of bluegill sunfish data using linear

?' checking Adequacy - of Models

In view of the above comparison, the standardized residual

provided by PLR was used in residual analysis for this data set.

Two residual plots were used. One plots residuals against fitted A

values, 0 . It can be used to check if the spread of residuals i

A

- is approximately constant and independent of 8 . The other plots i

ordered residuals against the quantiles of a standard normal

distribution (normal probability plot). It can be used to check

whether the residuals are approximately normal as the asymptotic

theory predicts.

1. Model-A

A

The plot of the residuals versus 6 for Model-A is shown i

in Figure A.1. Except for tws pints on t h e upper :eft

corner, the residuals do not exhibit significant variation

pattern with the fitted values.

The two cells with large variances were both fathered by

parental fish and were stocked in the same pond. The

residuals of these two cells are also the only residuals

that deviate significantly from the other points in the

normal probability plot (Figure A . 2 ) .

There is no evidence that these two cells are outliers

in the sense of misrecording or misclassification. They

indicate that the model under investigation cannot remove

Page 40: Statistical analysis of bluegill sunfish data using linear

Other than these two points, the residual plots show that

Model-A is adequate.

The explanatory variables in this model were 'pond',

'father-type', 'individual cuckolder father' and 'individual

mother'. Since there was no parental father effect involved,

the PLR program automatically collapsed data of individual

parental fathers resulting in 56 covariate patterns rather

than 96. As a result only 56 residuals were computed by the

PLR program and large errors, if they existed, in some of

the cells collapsed would not be revealed. Thus residuals

for those cells collapsed were computed individually using

the expression ( * * I . Plots with 56 cells and 96 cells are

presented in Figures B1.1, B1.2, B2.1, B2.2.

With 56 cells, the plot of residuals versus 6 does not i

demonstrate any obvious pattern. Also on the normal

probability plot, all residuals lie very closely to a

straight line.

With 96 cells, there is a point which lies far away from A

the rest of the points in the plot of residuals versus 8 . i

Also it does not lie close to a straight line as the others

do in the normal probability plot. This point is one of the

two points which have large variances in Model-A.

Page 41: Statistical analysis of bluegill sunfish data using linear

Other than this point, the residual plots suggest that

Model-B is also adequate.

Page 42: Statistical analysis of bluegill sunfish data using linear

CHAPTER V

CHECKING ADEQUACY OF MODELS - GOODNESS-OF-FIT TEST

Another powerful tool in checking the adequacy of a model is

the use of goodness-of-fit tests. Goodness of fit in this

context refers to a measure of how well the model fits the data,

- that is, how well the 8 . ' ~ are modelled in the form 1

It does not check whether the sampling distribution is a

binomial distribution or not.

In this chapter, some summary measures of goodness-of-fit

are described. Then the results of a Monte-Carlo study to assess

the quality of the chi-square approximation to the distribution

of these statistics are discussed. Lastly, the Models A and B

are checked using these goodness-of-fit statistics.

There are three summary statistics provided by the PLR

program for testing goodness-of-fit of the model to the data.

They are Hosmer's goodness-of-fit test, Brown's goodness-of-fit

test and likelihood ratio statistic. Apart from these three

statistics, Pearson's chi-square statistic was also included in

this process of checking the adequacy of the model.

Page 43: Statistical analysis of bluegill sunfish data using linear

Hosrner's Goodness-of-fit Statistic

Hosmer's goodness-of-fit statistic is similar to Pearson's

chi-square statistic except that it groups the N cells into A

fewer cells, say g. The cells whose predicted probabilities 6.

lie between C and C , are grouped into one cell where j - 1 3

A

The C.'s can depend on the data such that ~ / g values of Bi 3

in each interval or the C.'s can be fixed constants. 3

In the PLR program, g is taken to be 10 and C =j/g j

j=O,l,...,g.

-C

fall

for

The expected probability of each of these g cells is A

estimated by a weighted average of the 6 in the grouped cell. 3

i

Thus a 2xg contingency table can be formed:

Cells

1 2 . . . I

Total * . 2 . . . n n

Page 44: Statistical analysis of bluegill sunfish data using linear
Page 45: Statistical analysis of bluegill sunfish data using linear

-(ml+mz) d exp(6ml)(l+exp 8)

where - ~ ( 8 , m,, m2) = de B(m, ,m,)

and B(m,,m2) is the beta function.

This class of models include the logistic, probit, extreme

minimum value, extreme maximum value, double exponential,

exponential and reflected exponential models.

Brown's test assumes that X.p is correct but that the link 1

function g might be wrong.

It uses a score test in which the test statistic

asymptotically has a chi-square distribution with two degrees of

freedom. Readers are referred to a paper by ~rentice[l976].

Likelihood Ratio Statistic

The likelihood ratio statistic is defined as

where y is number of successes in cell i of size n , i i

and 8 is predicted probability of success in cell i. i

Page 46: Statistical analysis of bluegill sunfish data using linear

G 2 is twice the difference between the maximum log

likelihood achievable and that achieved by the model under

investigation.

As discussed in the previous chapter, the PLR program

automatically collapses cells basing on the parameters specified

in the model. The likelihood ratio statistic provided by the

program is then based on the collapsed data. The statistic thus

computed may have a better chi-square apprroximation when there

are many sparse cells. However, it may not be able to reflect

the true structure of the data. Very different observed

probabilities of success may be averaged out and thus important

information may be lost. Therefore a program has been written to

obtain the likelihood ratio statistic (and also the Pearson's

chi-square statistic) basing on all 96 cells, irrespective of

the parameters 'in the model.

The likelihood ratio statistic is distributed asymptotically

as a chi-square variate with degrees of freedom M-p where p is

the number of unknown parameters. When the statistic is computed

using the collapsed data, M is the number of effective covariate

patterns. Based on the uncollapsed data, M is the number of

non-empty cells.

Page 47: Statistical analysis of bluegill sunfish data using linear

Pearson's Chi-square Statistic

Pearson's chi-square statistic is defined as

A

where y , n and 8. are the same as for the likelihood ratio i i 1

statistic.

The asymptotic distribution of X2 is chi-square with degrees

of freedom M-p, same as for G2.

Monte-Carlo Study - of Quality - of Chi-square Approximation

. .

All four goodness-of-fit statistics are approximated

asymptotically by a chi-square distribution. Thus it is

desirable to assess the quality of the approximation using

Monte-Carlo simulation.

As described in Chapter 2, 500 samples were simulated using

the explanatory variables 'pond', 'father-type', 'individual

cuckolder father'. Another 200 samples were simulated using the

same set of explanatory variables together with 'individual

mother' effect. All four goodness-of-fit statistics were

computed and a nominal level of 0.05 was used to assess the

quality of the chi-square approximation.

Page 48: Statistical analysis of bluegill sunfish data using linear

The results of the Monte-Carlo study are summarized in Table

below.

T a b l e 4

Simulation Results - of Goodness-of-fit Statistics 1

Observed Siqnificance - - - Level at a Nominal 5% Level --

-

GOODNESS-OF-FIT STATISTIC SIMULATION

Brown's Statistic 5.0% 5.4%

Based on Collapsed Cells Pear son' s Statistic

Based on 96 Cells

Likelihood Ratio Statistic

Original 9.0% Based on Collapsed Cells Corrected 0.5%

Based on 96 Cells Original 19.5% 27.4%

Corrected 0 % 0.2%

Page 49: Statistical analysis of bluegill sunfish data using linear

1. Hosmer's Statistic

Results in Table 4 indicate that the Hosmer's statistic

is not a sensitive test for this data set. However it must

be noted %hat the PLR program assumes the statistic is

approximately chi-square distributed with 9-2 degrees of

freedom. It ignores the fact that when the number of

parameters p is greater than g, the first term in the

expression for the asymptotic distribution of the Hosmer's

statistic, x2(g-p), is meaningless.

The model used to simulate the first 200 samples has 14

parameters which exceed the value of 10 set for g by the PLR

program. Thus these 200 simulated datasets cannot be used to

judge the quality of Hosmer's statistic.

The model used to simulate the second 500 samples has 10

parameters. Hosmer's test rejects the true model only once

out of 500 times when the significance level is set to be

0.05. Thus Hosmer's statistic as provided by the PLR program

should not be used to test models with 10 or more

parameters.

Brown's Statistic

Brown's statistic achieved the desired level in the

Monte-Carlo study. At a nominal level of 0.05, it rejected

the true model 10 times out of the first 200 simulated

samples and 27 times out of the second 500 simulated

Page 50: Statistical analysis of bluegill sunfish data using linear

samples. The results based on this small-scale Monte-Carlo

showed that the chi-square approximation to Brown's

statistic was very good. Thus it is an appropriate statistic

to check adequacy of a model.

3. Pearson's Statistic

Based on the collapsed cells and at a nominal level of

5%, Pearson's chi-square test rejected the true model 1 1

times out of the first 200 simulated samples and 22 times

out of the second 500 simulated samples. Based on all 96

cells, the observed significance levels were 6.5% and 5.2%

in Simulations I and I1 respectively. Expected cell sizes as

small as 0.2 did not seem to jeopardize its performance in

this Monte-Carlo study. Thus a general guideline suggested

by some statisticians that Pearson's statistic be used only

when the minimal expected cell size exceeds 5 may appear too

conservative in this context.

4. Likelihood Ratio Statistic

Based on the collapsed cells and at a significance level

of 5%, the likelihood ratio test rejected the true model 9%

of the time in the first 200 simulated data sets and 9.4% of

the time in the second 500 simulated data sets. Based on all

96 cells, the respective observed significance levels were

19.5% and 27.4%.

Page 51: Statistical analysis of bluegill sunfish data using linear

It has been suggested in McCullagh and

that the chi-square approximation of the 1

Nelder

ikelihood ratio

statistic can be improved by means of a first-order

correction term (l+c)-l where

A

and N, p, n 8. are defined as before. i ' 1

This correction was included in the Monte-Carlo study. Based

on collapsed cells, the corrected likelihood ratio statistic

rejected the true model only 0.5% and 1.8% in Simulations I

and I1 respectively. Based on all 96 cells, the respective

observed significance levels were 0% and 0.2%. The results

suggest that the corrected statistic is not appropriate for

analysing this dataset. .

5. Summary - of Simulation Results

The following is a summary of the result of the

Monte-Carlo study. Hosmer's statistic should not be used

when the number of parameters in the model is 10 or more.

The PLR program should be able to adjust the value of g for

different models. Brown's goodness-of-fit statistic and

Pearson's chi-square statistic are well approximated by the

chi-square distribution. They provide reliable tests for

checking adequacy of a model. On the other hand, the

likelihood ratio statistic tends to reject a model more than

Page 52: Statistical analysis of bluegill sunfish data using linear

it should. Its p-value tends to be understated.

Unfortunately the correction factor does not seem to improve

the quality of the chi-square approximation. Thus the

likelihood ratio statistic should be used with caution.

Checkinq Adequacy - of Models

Taking into consideration the results of the Monte-Carlo

study, the various goodness-of-fit statistics are now used to

check the Models A and B.

Brown's goodness-of-fit statistic was not significant

(p-value = 0.65). This indicates that with the same set of

explanatory variables, a logistic model is appropriate

relative to the class of model with density given on Page

33.

Hosmer's goodness-of-fit statistic provided by the PLR

program could not be used because the number of parameters

is 54 which is much larger than the value of g of 10 set by

the PLR program.

The likelihood ratio statistic had a p-value of 0.08

suggesting that the logistic model with 54 parameters fits

the data.

Page 53: Statistical analysis of bluegill sunfish data using linear

Pearson's chi-square statistic was significant (p-value

= 0.03). Individual cells' contributions to the Pearson's

statistic were examined. The two cells with large errors

that were identified in the residual analysis account for

nearly 50% of the statistic. If these two cells were

ignored, the Pearson's statistic would have been

insignificant.

To conclude, other than the two cells with large errors,

Model-A can be considered adequate.

Brown's goodness-of-fit statistic was again not

significant (p-value = 0.16) suggesting the logistic model

is appropriate.

Hosmer's goodness-of-fit statistic could not be used to

test this model as well because the number of parameters,

14, is greater than the number of grouped cells of 10.

The likelihood ratio statistic provided by the PLR

program was based on collapsed cells (56) and had a p-value

of 0.04. This suggests the model does not fit the data

adequately. The corresponding Pearson's goodness-of-fit

statistic had a p-value of 0.15.

When these statistics were computed using 96 cells, they

were both very significant. Both p-values were about 0.001

Page 54: Statistical analysis of bluegill sunfish data using linear

with degrees of freedom of 73.

It may appear that Model-B is not adequate since

Pearson's statistic was significant. However, a careful

examination of individual cells showed that the cell

detected in the residual analysis to have a large error

accounted for 22% of Pearson's statistic. Ignoring this

cell, the Pearson's statistic would be insignificant.

The likelihood ratio statistic also had a highly

significant p-value of 0.001. The Monte-Carlo study results

showed that its p-value tends to be understated. The

observed significance level was 19.5% as compared to the

expected level of 5%. At a 0.001 significance level, the

observed level was 0.005. Moreover, the cell that had large

error accounted for "7% of the statistic. Thus there is

evidence that the Model-B under study is adequate as

suggested by Brown's and Pearson's statistics.

Comparison - of Model-A - and Model-B

From residual analyses and goodness-of-fit tests performed,

it was found that Model-A and Model-B were both adequate.

Model-A has a larger improvement in log-likelihood than Model-B.

Also by examining the plots of the observed versus the predicted

proportion precocious in Figures A.3, B1.3 and B2.3, Model-A

produced points which lie more closely to a straight line than

Model-B. Thus Model-A resulted in a closer fit to the data but

Page 55: Statistical analysis of bluegill sunfish data using linear

at the expense of using 54 parameters whereas Model-B is more

parsimonious with 14 parameters. They were both accepted and

tests were carried out on each of them.

Page 56: Statistical analysis of bluegill sunfish data using linear

CHAPTER VI

TESTS

Using the same notation as in Chapter 3, model-A is :

and Model-B is :

If a, and a2 denote parental and cuckolder father-types

respectively, then 'k(1)

= 0 for all k because there is no

individual parental father effect.

The following tests are performed individually for each

model.

1. Estimation - of Ratio of Odds of being --- was a Cuckolder to Odds given Father - --

Precocious given Father

was a Parental - -

One of the main objectives of the analysis is to test

whether cuckolder fathers produce more precocious sons than

parental fathers. This is equivalent to determining whether

the odds of being precocious given that the father was a

cuckolder is significantly higher than the odds of being

precocious given that the father was a parental. Let

Page 57: Statistical analysis of bluegill sunfish data using linear

Then w is the odds of being precocious for a male of i jkl

father k of father-type i and of mother 1 in pond j. Also

let

0 = n u i"' j A 1 i jkl

Then w l . . . is the odds of being precocious given father was

a parental. Similarly o , . . . is the odds of being precocious

given father was a cuckolder.

For both Models,

A

An estimate of a, is a, = 0.4154. Thus the ratio

w2 . . . h. . . is estimated to be 2.30. In other words the

odds of becoming precocious are approximately 2.3 times

higher with a cuckolder father than with a parental

father.

Page 58: Statistical analysis of bluegill sunfish data using linear

An approximate 95% confidence interval for a, is

given

where df = 96-9-54 = 33, A A '

and s.e.(a2) is the estimated standard error of a,,

Thus an approximate 95% confidence interval for

0 2 . . . h. . . is (1.27, 4.15). Since it does not include

1 , it can be concluded that cuckolder males produce more

precocious progeny than parental males,

The ratio w , . . / a , . . . is estimated to be 2.37 and

its approximate 95% confidence interval is given by

(1.50, 3.74). Since the confidence interval does not

include 1, it can be concluded that cuckolder males

produce more precocious progeny than parental males.

As compared to the confidence interval using the

Model-A, this one is tighter because of smaller standard

error and larger degrees of freedom.

c. Summary - of Results

Results of both models suggest that cuckolder males

produce more precocious progeny than parental males.

Page 59: Statistical analysis of bluegill sunfish data using linear

Thus there is evidence that precocial maturity in the

bluegill sunfish is genetically inherited.

2. Ranking Individual Cuckolder Fathers

For the purpose of identifying 'high lines' and 'low

lines' of cuckolder fathers for future breeding experiments,

it is desirable to rank individual cuckolder fathers by

their contribution to the odds of being precocious.

The estimated value of the parameter 'k (2

can be

regarded as a measure of the contribution of the k th

cuckolder father to the odds of being precocious.

To determine whether a cuckolder father is statistically

different from another in terms of their contribution to

odds of being precocious, Bonferroni confidence intervals

are calculated.

By comparing the estimated 0 k(2)

's, the individual

cuckolder fathers can be ranked, from largest to

smallest:

From the 90% Bonferroni confidence intervals

presented in Table 5, we can conclude that C5 is

Page 60: Statistical analysis of bluegill sunfish data using linear

different from C3 and C6 is different from 63.

Similarly by comparing the k(2)

's estimated for

Model-B, the individual cuckolder fathers can be ranked,

from largest to smallest:

In this model, C4 ranks higher than C2 while C2 ranks

higher than C4 in Model-A.

The 90% ~onferroni confidence intervals in Table 6

suggest that no pairs of cuckolder fathers are

significantly different. To have- some indication of

difference among the cuckolder fathers, the 80%

significance level were also calculated. This time, C5

is significantly different from C3.

3, Rankinq Individual Mothers

For future breeding purposes, the individual mothers

were also ranked according to their contribution to odds of

being precocious.

The sum of the parameters T and y is used as a j l(j!

measure of the contribution to odds of being precocious of

the 1 th th

mother in the i pond. By comparing the estimated

Page 61: Statistical analysis of bluegill sunfish data using linear

values of these parameters, both models yield the same

ranking from largest to smallest :

The Bonferroni confidence intervals are also calculated to

determine whether two mothers are statistically different.

The 90% ~onferroni intervals in Table 7 show that MI

is significantly different from M7, M4, M6, M5 and M3

whereas M3 is significantly different from M2, M8, M7

and M4.

The 90% Bonferroni int.ervals in Table 8 show that M3

is significantly different from MI, M2, M8 and M7.

From the results in ranking of cuckolder fathers and

mothers, Model-A seems to suggest more variation within

the fathers and the mothers than Model-B.

Page 62: Statistical analysis of bluegill sunfish data using linear

CHAPTER VII

ANALYSIS OF ENVIRONMENTAL EFFECTS

With either Model-A or Model-B, the environmental effect,

namely the 'pond' effect, is a very important effect. In this

chapter, it is further analysed.

Three candidates are explored as agents for the 'pond'

effect:

1. the density of fish in the pond,

2. the sex ratio of fish in the pond in terms of number of

males per 100 fish, and

3. growth rate due to the pond environment.

A true density variable is not available as the density of

fish in the pond changes with time. However, since the four

ponds are approximately of the same size, initial and final

populations for each pond can be used as substitutes for

density. Moreover it is likely that most of the mortality in the

pond occurred shortly after the initial stocking. Therefore the

final population may reflect the true densities in the ponds.

Previous analysis by Dr. M. Gross has indicated that

precocious sons are significantly larger than other offspring at

the age of two. Variables such as average body length of all

fish in a pond cannot be used as a measure of growth rate due to

the pond environment. This is because ponds with a large

proportion of precocious sons would be expected to have a longer

Page 63: Statistical analysis of bluegill sunfish data using linear

average body length. However by assuming that the effects of the

environment on how fast fish grow is the same for both males and

females, the average body length of females in a pond can be

used as a measure of growth rate for that pond. The average body

lengths of females in the four ponds were 55.50 cm., 55.97 cm.,

59.12 cm. and 54.87 cm. respectively.

For each of the Models A and B, the 'pond' effect is taken

out of the model, and replaced by 'initial population', 'final

population', 'sex ratio' and 'growth rate' one at a time.

The results for the two models are similar. 'Final

population' is found to be the most significant effect. With

'final population' in the model, are t h e other three factors

needed in the model? Adding in 'initial population', 'sex ratio'

or 'growth rate' after 'final population' is in the model

results in no significant improvement. But adding in 'final

population' after any of these three effects is in the model

results in a very significant improvement. Therefore, 'final

population' is definitely in the model and there is no point in

including any of the other three effects.

The result suggests that whether a male becomes precocious

or not depends on the number of fish in the pond, but does not

depend on how fast it grows or the sex ratio in the pond.

Page 64: Statistical analysis of bluegill sunfish data using linear

Since the estimated final population parameter in both

models is negative, the odds of being precocious increase as the

final population decreases.

Page 65: Statistical analysis of bluegill sunfish data using linear

CHAPTER VIII

CONCLUSION

Linear logistic regression is an appropriate model for

analysing this data set. The PLR program of the BMDP Statistical

Software is a powerful program for logistic models. However it

- has the drawback of collapsing cells, sometimes undesirably.

Moreover, in calculating the Hosmer's goodness-of-fit statistic,

it should be able to adjust the number of grouped cells, g, to

cater for different models.

From the results of the analysis of the data set, it is

found that precocial maturity in the bluegill sunfish is

genetically inherited, that is, cuckolder fathers tend to

produce more precocious sons than parental fathers. Findings

show that maternal and paternal genes are important and perhaps

maternal and paternal genetic effects interact. Environmental

effects exist and there is evidence that the odds of being

precocious increase as the final population decreases. However

there is no significant evidence for environmental crossed

genetic interaction.

The small scale Monte-Carlo studies suggest that the

improvement chi-square statistic is a good criterion for

selecting variables when the number of additional parameters is

small. However it tends to include too many high order

interactions because the chi-square approximation to its

distribution is not good when the number of degrees of freedom

Page 66: Statistical analysis of bluegill sunfish data using linear

is large.

Hosmer's goodness-of-fit statistic provided by the PLR

program of the BMDP Statistical Software is notl sensitive when

the number of parameters is large. Brown's goodness-of-fit

statistic and Pearson's chi-square statistic are well-behaved.

On the other hand, the likelihood ratio statistic tends to

reject a true model more than it should.

Thus the improvement chi-square statistic and the

goodness-of-fit statistics should be used with caution. The

appropriateness of these statistics is likely to depend on the

data set and the complexity of the model. It is advisable to

perform Monte-Carlo studies, whenever possible, to investigate

the appropriateness of these statistics to the data set being

analysed.

Page 67: Statistical analysis of bluegill sunfish data using linear

Fi g u r e A. 1

Plot of Residual versus Predicted Proportion Precocious -- Model-A

PREDICTED PROPORTI ON PRECOCIOUS

Page 68: Statistical analysis of bluegill sunfish data using linear

Fi gur e A. 2

Normal Probability Plot

QUANTILE OF NORMAL DISTRIBUTION

Page 69: Statistical analysis of bluegill sunfish data using linear

Fi g u r e A. 3

Plot of Observed versus Predicted Proportion Precocious --

0.0 0.20 0.40 0.60 0.80 1 .OO

PREDI CTED PROPORTI ON PRECOCIOUS

Page 70: Statistical analysis of bluegill sunfish data using linear

Plot of Residual versus Predicted Proportion Precocious -- Model-B (Collapsed Cells)

Fi gur e B l . I

PREDICTED PROPORTION PRECOCI OUS

Page 71: Statistical analysis of bluegill sunfish data using linear

F i g u r e B l . 2

Normal Probability Plot - Model-B (Collapsed Cells)

QUANTILE OF NORMAL DISTRIBUTION

Page 72: Statistical analysis of bluegill sunfish data using linear

Fi g u a e B 1 . 3

Plot Observed versus Predicted Proportion Precocious - Model-B (Collapsed Cells)

PREDICTED PROPORTION PRECOCIOUS

Page 73: Statistical analysis of bluegill sunfish data using linear

Fi g u r e B 2 . 1

Plot of Residual versus Predicted Proportion Precocious -- Model-B (96 Cells) -

PREDI CTED PROPORTI ON PRECOCI OUS

Page 74: Statistical analysis of bluegill sunfish data using linear

Fi g u r e B 2 . 2

Normal probability Plot

Model-B (96 cells) -

QUANTILE OF NORMAL DISTRIBUTION

Page 75: Statistical analysis of bluegill sunfish data using linear

Fi gur e B 2 . 3

Plot of Observed versus Predicted Proportion ~recocious -- Model-B (96 Cells) -

PREDICTED PROPORTION PRECOCIOUS

Page 76: Statistical analysis of bluegill sunfish data using linear

T a b l e 1 . 1

Number -- of Fish Survived Includinq Unsacrificed

M O T H E R

1 2 3 4 5 6 7 8 TOTAL

Page 77: Statistical analysis of bluegill sunfish data using linear

T a b l e 1 . 2

Number of Fish Survived and Sacrificed

M O T H E R

1 2 3 4 5 6 7 8 TOTAL

Page 78: Statistical analysis of bluegill sunfish data using linear

Number - of Female Fish Sacrificed

M O T H E R

1 2 3 4 5 6 7 8 TOTAL

Page 79: Statistical analysis of bluegill sunfish data using linear

Number --- of Male Fish Sacrificed

M O T H E R

1 2 3 4 5 6 7 8 TOTAL

Page 80: Statistical analysis of bluegill sunfish data using linear

T a b l e I . 5

Number - of Precocious Males Sacrificed

M O T H E R

1 2 3 4 5 6 7 8 TOTAL

Page 81: Statistical analysis of bluegill sunfish data using linear

T a b l e 5

Ranking of Individual Cuckolder Fathers I Model-A 90% ~onferroni Confidence Intervals -

Page 82: Statistical analysis of bluegill sunfish data using linear

T a b l e 6

Rankinq of Individual Cuckolder Fathers L Model-B

Bonferroni Confidence Intervals

90% Confidence 80% Confidence

Page 83: Statistical analysis of bluegill sunfish data using linear

Rankinq - of Individual Mothers - : Model-A 90% Bonferroni Confidence Intervals -

Page 84: Statistical analysis of bluegill sunfish data using linear

T a b l e 8

Rankinq - of Individual Mothers : Model-B - 90% Bonferroni Confidence Intervals -

Page 85: Statistical analysis of bluegill sunfish data using linear

APPENDIX A

To Find Variance of Residual used in PLR -- - --- th

Using the same notation as before, variance of the i

residual used in PLR is

Since

exp(X.p) A

1 exp(x.5 8 = and 8 = 1

i I+exp(X.P) i l+exp(X.p) I

1 1

by Taylor's Theorem,

Therefore,

and

The log of the likelihood of y ..., yN 1' is

N N log L = constant + Z y . 1 0 ~ ~ 8 ~ + Z (n.-Y.) log(1-6.)-

i=l 1 i=l 1 1 1

Page 86: Statistical analysis of bluegill sunfish data using linear

Again by Taylor's Theorem,

where the information matrix, I = - ~ [ a u ( p ) / a p ]

= -(au(p)/ap). . .

and finally,

Page 87: Statistical analysis of bluegill sunfish data using linear

BIBLIOGRAPHY

Bishop, P.M., Fienberg, S.E. and Holland, P.W. [1975]. Discrete Multivariate Analysis, MIT Press, Cambridge, MA.

Cox, D.R. [ 19701. The Analysis of Binary Data, Methuen, London.

Cox, D.R. and Snell, E.J. [1968]. "A General Definition of Residuals", J. R. St at ist. Soc., B, 30, 248-275.

Fienberg, S.E. [1980], The Analysis of Cross-Classified Data, 2nd Ed., MIT Press, Cambridge, MA.

Haberman, S.J. [1974]. The Analysis of Frequency Data, University of Chicago Press, Chicago.

Hosmer, D.W. and Lemeshow, S. [1980]. "Goodness of Fit Tests for the Multiple Logistic egression Model", Commun. Statist. - Part A Theor. Meth. ~9(10), 1043-1069.

Larntz, K. [1973]. "Small Sample Comparisons of Exact Levels for Chi-Squared Goodness-of-Fit Statistics", J. Amer. St at i st. Assoc. , 73, 253-263.

McCullagh, P. and Nelder, J.A. [1983]. Generalized Linear Models, Chapman and Hall, London.

- Plackett, R.L. [l98l]. The Analysis of Categorical Da.t a, Charles Griffin, London.

Prentice, R.L. [1976]. "A Generalization of the Probit and Logit Methods for Dose Response Curves", Bi omet r i cs, 32, 761-768.

UCLA [1983]. BMDP Statistical Software, University of California Press.


Top Related