spss

L.Camilleri 1

Introductory Course

in

SPSS

by

Prof. Liberato Camilleri

L.Camilleri 2

1. Methods of Sampling

Sampling Theory • Sampling theory is a study of relationships existing between a population and a

sample drawn from it. • Sampling theory is useful in estimating unknown quantities such as the population

mean and variance from knowledge of corresponding sample quantities. • Sampling theory is also useful in determining whether the observed differences

between two samples are due to chance variation or whether they are really significant.

• Statistical inference is a study of the inferences made about a population by using samples drawn from it.

Random Sampling

A sample of n items is said to be chosen by random sampling from a population if: • Every member of the population has the same chance of being included in the sample • The members of the sample are chosen independently of each other (the choice of a

member is not influenced by the other chosen member) Proper random sampling requires that we have a list of all N items in the population, so that we can assign each item one of the numbers from 1 to N. Such a list is called a sampling frame. The use of sampling frames, make it easy to draw random samples with the aid of computers or random number tables. Unfortunately, there are many situations in which it is not possible to construct a sampling frame. For instance, if we want to use a sample to estimate the mean height of the trees in a forest, it would be impossible to number the trees, choose random numbers and then locate and measure the corresponding trees. In these situations, the elements of a random sample must be chosen haphazardly. That is, we must not select or reject, any element of a population because of its seeming typicalness or lack of it, nor must we favour or ignore any part of a population because of its accessibility or lack of it.

Systematic Sampling

In some instances, the most practical way of sampling is to select, say, every 20th name on a list, every 12th house on one side of a street, every 50th component coming off an assembly line, and so on. This is called systematic sampling, and an element of randomness can be introduced into this kind of sampling by using random numbers to pick the unit with which to start. Although a systematic sample may not be a random sample in accordance with the definition, in some instances it actually provides an improvement over a random sample inasmuch as the sample is spread more evenly over the entire population. The real danger in systematic sampling lies in the possible presence of hidden periodicities. For instance, if we inspect every 40th piece made by a particular machine, the results would be very misleading if, because of a regularly recurring failure, every 10th piece produced by the machine has blemishes.

L.Camilleri 3

Stratified Sampling

If we have information about the composition of a population and this is of relevance to our investigation, we may be able to improve on random sampling by stratification . This is a procedure that consists of dividing the population into a number of non-overlapping sub-populations, or strata, and then taking a random sample from each stratum. In stratified sampling the strata are often sampled in proportion to their size, which means that the sizes of the samples from different strata are proportional to the sizes of the strata. Stratification is not restricted to a single variable of classification. Populations are often stratified according to several characteristics. For instance, in a survey to determine the opinion of the Maltese people towards the European Union, one might stratify his sample not only with respect to the 13 districts, but also with respect to the people’s sex and age. This is called cross stratification and is widely used in opinion sampling and market research because it increases the reliability (precision) of estimates. In stratified sampling, the cost of taking random samples from the individual strata is often so high that interviewers are simply given quotas to be filled from the different strata with few restrictions. This is called quota sampling and is very convenient however; the resulting samples do not normally have the essential features of random samples. In the absence of any controls, interviewers tend to select individuals who are most readily available - persons who work in the same building or perhaps reside in the same area. Quota samples are essentially judgement samples, and inferences based on such samples generally do not lend themselves to any sort of formal evaluation. Cluster Sampling Cluster sampling is particularly useful when the population members are widely scattered geographically. In cluster sampling, the total population is divided into a number of small subdivisions (clusters) and some of these clusters are randomly selected for inclusion in the overall sample. If the clusters are geographic subdivisions, this kind of sampling is also called area sampling. This kind of sampling is effective when random sampling is impossible because suitable lists are not available and the cost of contacting people scattered over a wide area is very high. It is easier and cheaper to interview people living close together in clusters rather than selecting them at random over a wide area. Although estimates based on cluster samples are usually not as reliable as estimates based on random samples, they are often more feasible with regards to costs. Multistage Sampling

Most large-scale surveys combine different types of sampling. For instance, if a government wants to study the attitude of teachers towards certain federal programs, statisticians might first stratify the country by towns, they might then use cluster sampling, subdividing each stratum into a number of subdivisions (schools) and finally they might use random sampling or systematic sampling to select a sample of teachers within each cluster.

L.Camilleri 4

2. Determining Sample Size The sample size calculator can be used to estimate the confidence interval given the sample size, confidence level and population size. If the population size is unknown leave the value blank. Example Find the confidence interval when conducting a study on a sample of size 350 respondents selected from a very large population, assuming a 95% confidence level. Confidence interval = 5.24%

Example Find the confidence interval when conducting a study on a sample of size 211 respondents selected from a population of size 1008 individuals, assuming a 95% confidence level. Confidence interval = 6%

L.Camilleri 5

The sample size calculator can be used to estimate the sample size given the confidence interval, confidence level and population size. If the population size is unknown leave the value blank. Example Find the sample size that should be selected from a very large population if the requested confidence interval is 3%. Assume a 95% confidence level. Required sample size = 1067

Example Find the sample size that should be selected from a population of size 1800 individuals if the requested confidence interval is 5%. Assume a 99% confidence level. Required sample size = 486

L.Camilleri 6

2. Overview of Data Analysis in SPSS

2.1 Data Entry In SPSS variables are defined in the Variable View output. These variables are generated by specifying a name for each variable. Factors are declared by specifying a label and a value for each level of the categorical variables. As an illustration we provide the following case study. In a study two groups of respondents were picked at random. The experimental group suffered from cardiac problems and the control group was not known to suffer from heart problems. All the members in the two groups were known to make daily use of a treadmill. These 22 respondents were asked to fill a questionnaire specifying their age, gender, weight, and mean duration of daily treadmill use, measured in minutes. They were also asked to indicate whether or not they had cardiac problems.

What is the mean duration of your daily use of a treadmill? ________ (minutes) Do you have cardiac problems? ______ Gender: _______ What is your weight? _____ (kg) What is your age? _____ (years)

Gender and cardiac condition are two factors (qualitative variables) each having two levels (categories). These levels have to be labelled and enumerated. For the cardiac condition, the value 1 represents an unhealthy respondent with heart problems and the value 2 represents a healthy respondent. For the gender, the value 1 represents a male respondent and the value 2 represents a female. No levels have to be specified for the variables age, weight and duration of daily treadmill use because they are covariates.

L.Camilleri 7

In SPSS the data is entered in the Data View output. Data files are presented in a rectangular arrangement where the rows represent the respondents and the columns represent the variables. A row contains the information elicited by a particular respondent for all the variables and a column contains the information for a particular variable elicited by all the respondents.

A further task was to generate another factor by classifying the respondents’ ages into three age categories. This could be done explicitly by SPSS using the Recode option. This option recodes any age value into an appropriate age category and then saves it in the generated factor ‘Age groups’.

L.Camilleri 8

2.2 Graphical Presentations A histogram is an important graphical presentation which shows the distribution of values of a covariate (quantitative variable). These values are first divided into groups of equally spaced intervals and then the frequency (count) of cases in each interval is plotted as a bar. A histogram can be created by choosing Graphs and Histogram from the menus. A normal curve can be superimposed onto the histogram with same mean and variance as the data. It can be used to assess the symmetry of the distribution.

The distribution (histogram) of the mean duration of daily treadmill use can be generated by moving the covariate ‘duration’ in the variable list and selecting Display Normal Curve. It is evident that a larger proportion of respondents are using the treadmill in the range 12 to 14 minutes daily. The distribution of the mean duration of daily treadmill use is fairly normal. It is possible, in SPSS, to modify the minimum, maximum and increment of the scale values. To conduct these modification activate the chart editor by double clicking on the graph, highlight the values on the axis and then select Scale from the Properties tab.

L.Camilleri 9

It is also possible to modify the number of bars in the histogram and change the style and colour of the inside of the histogram. To conduct this alteration, activate the chart editor by double clicking on the graph, highlight the histogram and then select Fill and Border from the Properties tab.

The number of intervals or the interval widths is modified by selecting the Histogram Option from the Properties tab.

It is possible to generate two separate histograms of the mean duration of daily treadmill use for the healthy or sick groups. This is conducted by moving the factor ‘cardiac’ in the Panel by rows. It is evident that healthy respondents use the treadmill for a longer period of time compared to sick respondents.

L.Camilleri 10

Pie charts are used to analyze factors (qualitative variables). In a pie chart the different levels of a factor are represented by the sectors of a circle. The size of each slice is proportional to the size of its respective category. For example, a pie chart showing the percentage of respondents in the experimental and controls groups can be created by selecting pie charts from the graphs menu. Slices can either represent frequencies or percentage of cases. To define the slices drag the categorical variable ‘Cardiac’ to slice by. Pie chart properties can be modified by clicking the right button. The counts or percentages can be displayed on the pie charts by selecting data labels.

54.55% 45.45%

NoYes

Do you have cardiac problems?

Using this property window it is possible to separate slices by selecting explode chart.

L.Camilleri 11

54.55%45.45%

NoYes


From this property window it is possible to change a pie chart to a bar chart, line chart or area chart. In a bar chart the frequency or percentage value of each factor level is represented by a vertical bar. Larger values are represented by longer bars. In a line chart each frequency or percentage value is represented by a point. These points are connected by straight lines. An area chart is a line chart with the space below the line filled in. Bar, line and area charts can also be created by selecting bar, line and area from the graphs menu. All graphs show a higher proportion of respondents in the control group.

NoYes

60

50

40

30

20

10

0

54.55%

45.45%

NoYes

60

50

40

30

20

10

0

54.55%

45.45%

One can also create pie, bar and area charts to display the proportion of respondents in the control and experimental groups separately for males and females. These charts can be generated by moving ‘Cardiac’ in the category axis and ‘Gender’ in the column panel.

L.Camilleri 12

femalemale

Gender

22.73%

36.36%

31.82%

9.09%

No

Yes


NoYes


40.0%

30.0%

20.0%

10.0%

0.0%

Per

cen

t

NoYes

femalemale

Gender

22.7%

36.4%

31.8%

9.1%

All graphs demonstrate a higher proportion of males in the experimental group who have cardiac problems compared to females. The graphs also demonstrate a higher proportion of females in the control group compared to males who do not reveal any problems.

L.Camilleri 13

NoYes

Cardiac

40.0%

30.0%

20.0%

10.0%

0.0%

Per

cen

t

NoYes

femalemale

Gender

22.7%

36.4%

31.8%

9.1%

Another way of representing categories is with clustered charts. Clustered area charts can be generated by selecting area and stacked from the chart menu. In these graphs the areas for all the factor levels have the same baseline. In the following clustered area chart the category axis is defined by ‘Cardiac’ and area is defined by ‘Gender’.

NoYes


120.0%

100.0%

80.0%

60.0%

40.0%

20.0%

0.0%

Per

cen

t

77.8%

38.5%

22.2%

61.5%

female

male

Gender

L.Camilleri 14

Clustered bar charts can be generated by selecting bar and clustered from the chart menu. In the following clustered bar chart the category axis is defined by ‘Cardiac’ and the clusters are defined by ‘Gender’. Error bars can be displayed on bar charts from the option menu. The bars display the 95% confidence interval and help the analyst visualize distributions and dispersion by indicating the variability of the measure being displayed.

NoYes


100.0%

80.0%

60.0%

40.0%

20.0%

0.0%

Per

cen

t

female

male

Gender

A bivariate scatter plot is used to analyze two covariates simultaneously and is plotted along two axes. This graphical presentation of data points reveals important relationships between the covariates. It can also reveal outliers and unusual combinations of data points. Points that do not fit a relationship well stand out in the plot. The procedure is to click on Graphs and Scatter/Dot… and select Simple Scatter. The axes of the scatter plot are defined by moving ‘duration’ in the y-axis and the respondent’s ‘age’ in the x-axis. The line of best fit can be obtained by double clicking on the graph to activate the chart editor. Select Add fit line at Total to produce the regression line.

L.Camilleri 15

It is evident from the plot that young respondents are more likely to use the treadmill for a longer daily duration compared to elderly ones.

The data points can be clustered either by cardiac condition or by gender. These two graphs can be produced by moving in turn the factors, ‘cardiac’ and ‘gender’, in set markers by. Separate line fits for the two clusters can be obtained by double clicking on the graph to activate the chart editor, and then select Add fit line at Subgroups to produce separate regression lines.

The daily treadmill use is on average longer for healthy respondents compared to sick ones. This difference becomes more conspicuous with an increase in the respondents’ age. The reduction of the mean daily treadmill use as the respondents get older applies to both experimental and control groups.

L.Camilleri 16

The second scatter plot does not demonstrate any gender bias with regards to the daily treadmill use. Duration of treadmill use decreases with age for both male and female respondents.

It is also possible to produce scatter plots for three covariates simultaneously; however, it is very difficult to visualize the relationships between all the three covariates unless the scatter plot is rotated along an axis. This is carried out by clicking on 3D scatter and then define the axes by respectively moving ‘duration’, ‘age’ and ‘weight’ in the y, x and z-axis. In a 3D space we get a plane of best fit rather than a line.

L.Camilleri 17

Simple box plots, sometimes called box-and-whisker plots, characterize the distribution and dispersion of a covariate, displaying its median and quartiles across the levels of a factor. The median is the 50th percentile and the interquartile range ranges from the 25th to the 75th percentile. Whiskers at the ends of the box show the distance from the end of the box to the largest and smallest observed values that are less than 1.5 box lengths from either end of the box. Data points that fall outside this range are labelled as outliers or extreme values and their position is identified. Box plots are created by choosing Graphs and Boxplot… from the menus. In a simple box plot the selected variable must be a covariate and the category axis must be defined by a factor. This simple box plot demonstrates the distribution of respondents’ weights for both the experimental and control groups. The median weights for the two groups are respectively 91.5kg and 73kg. This implies that half the respondents in the experimental group weigh more than 91.5kg and half the respondents in the control group weigh less than 73kg. An interesting observation is that the lower quartile (25th percentile) for the experimental group and the upper quartile (75th percentile) of the control group are almost equal. This implies that 75% of the respondents in the experimental group weigh more than 75% of the respondents in the control group.

L.Camilleri 18

It is possible to generate two separate box plots showing the distribution of respondents’ weights for both males and females. This is conducted by moving the factor ‘gender’ in the Panel by rows. An interesting observation is that male respondents weigh more than females for both the experimental and control groups. It is also evident that sick male respondents weigh significantly more than healthy ones but this is not so evident for females. Three data points are marked as outliers because they lie between 1.5 box lengths and 3 box lengths from the end of the box. Any data point which lies beyond 3 box lengths is marked with an asterisk.

It is possible to combine the two plots in a single clustered box plot. Clustered box plots display the distribution of a covariate across two factors. In the subsequent clustered box plot the selected variable is the respondent’s weight whereas the category axis and the clusters are respectively defined by cardiac condition and gender. The plot exhibits the same contrasts displayed in the preceding plot.

L.Camilleri 19

2.3 Analyzing multiple responses In the case study presented the 22 respondents were further asked to indicate the type of food that they prefer eating given four possible food categories. These food options were pasta, fish, meat and vegetables and the respondents were allowed to select more than one option. The four food options have to be defined explicitly by four categorical variables because each cell can allow only one data entry. The first categorical variable indicates whether the respondents prefer pasta or not. For instance, the second respondent prefers pasta and fish whereas the fourth respondent prefers pasta, meat and vegetables.

Multiple responses are analyzed through a multiple response frequency procedure. This produces frequency tables for multiple response sets.

To generate a single combined set of these four food categories choose Multiple Responses from the menus. The new set of food categories is defined by moving pasta,

L.Camilleri 20

fish, meat and vegetables in the new set which is labeled ‘Preferred Food’. The levels of this factor are defined by entering the range of categories from 1 to 4. Crosstabs are very useful when analyzing associations between factors. It is also possible to get cross-tabulations of any number of factors by choosing Multiple Responses and Crosstabs from the menus. To examine the association between the respondent’s health and preferred food, one need to specify which of these two factors is defined by the crosstab rows and columns. In this example we define the levels of preferred food by the crosstab rows and the health categories by the crosstab columns. It is evident from the crosstab that respondents with cardiac problems are more likely to eat vegetables and fish whereas healthy respondents are more likely to eat pasta and meat

4 9 13

8 7 15

3 9 12

9 2 11

10 12 22

Count

Count

Count

Count

Count

Pasta

Fish

Meat

Vegetables

PreferredFood

Total

Yes No


Total

An alternative method is to stack the entries of these four categorical variables pasta, fish, meat and vegetables to explicitly generate this new factor ‘Preferred Food’. Stack also the entries of the factor ‘Cardiac’ four times to generate a new expanded factor such that both factors have 88 entries. To obtain a crosstab select Descriptive Statistics and Crosstab from the menus. Since the numbers of respondents in the two health categories are unequal it is advisable to produce column percentage to make correct associations of the preferred food categories for the two health groups. A clustered bar graph can also be produced to display, graphically, these associations.

L.Camilleri 21

4 9 13

16.7% 33.3% 25.5%

8 7 15

33.3% 25.9% 29.4%

3 9 12

12.5% 33.3% 23.5%

9 2 11

37.5% 7.4% 21.6%

24 27 51

100.0% 100.0% 100%

Count

Percentage

Count

Percentage

Count

Percentage

Count

Percentage

Count

Percentage

Pasta

Fish

Meat

Vegetables

PreferredFood

Total

Yes No


Total

For each preferred food category the bar lengths vary considerably between the two health-groups demonstrating graphically the association described above. 2.4 Methods for describing data sets Numerical descriptive measures are very useful to make inferences for a population about the corresponding measures. A number of numerical methods are available to describe quantitative data sets. These methods measure one of these four data characteristics. 1. Measures of central tendency (location) Central tendency is the tendency of the data to cluster about a certain numerical value. The most popular measure of central tendency is the sample mean. The sample mean x is simply the average of the n observationsix .

1

1 n

ii

x xn =

= ∑

The median is another measure of central tendency. This is the middle observation when all the observations are arranged in ascending order. The third measure of central tendency is the mode. This is the observation in the sample which occurs most frequently. 2. Measures of dispersion (variability) Dispersion is the extent to which the given data is different from the mean. The sample standard deviation, s, is the most popular measure of dispersion. It is the square root of the sample variance given by

( )22

1

1

1

n

is x xn

= −− ∑

L.Camilleri 22

The range is another measure of dispersion and it is the difference between the largest and the smallest observations. This is a rather plain, insensitive measure of dispersion and is hardly ever used. 3. Measures of relative standing Measures of relative standing describe the placement of an observation to the rest of the data. One measure of the relative standing of an observation is its percentile ranking. The observations are ranked from smallest to largest and the pth percentile is the number such that p% of the observations fall below this value. The lower quartile, median and upper quartile are respectively the 25th, 50th and 75th percentiles. The interquartile range is the distance between the lower and upper quartiles. Percentile rankings are of practical value only for large data sets. 4. Measures of the distribution of the data set The skewness characterizes the degree of asymmetry of a distribution around its mean. Negative skewness indicates a distribution which is skewed to the left. Positive skewness indicates a distribution which is skewed to the right. Many naturally occurring continuous variables, such people’s heights and examination marks have a normal distribution which is symmetric. This is the most widely used distribution in Statistics. The kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. The skewness and kurtosis of the normal distribution are both zero. Negative kurtosis indicates a relatively flat distribution compared to the normal distribution, whereas positive kurtosis indicates a relatively peaked distribution. The Frequency procedure of SPSS provides the most important summary statistics. Some of these statistics require that the data follow a normal distribution (or at least that the shape of the variable’s histogram be symmetric). In particular, the mean, standard deviation, variance and skewness should be used with caution unless the distribution is fairly symmetric and has no extreme outlier. A descriptive statistic is called robust if the calculations are insensitive to violations of the assumption of normality. This category includes the median, mode, minimum and maximum values, range and quartiles. It is necessary to use graphics such as histograms with normal curve to determine whether the variables summarized have approximately a normal distribution.

9 12 15 21 24 26 31 31 38 39 43 44 45 47 47 49 52 52 54 56 56 56 57 58 58 63 64 64 65 67 68 68 70 73 73 74 77 79 80 82 84 84 86 87 88 88 90 93 95 96

L.Camilleri 23

The above table shows the marks obtained by 50 students in a Mathematics examination. The sample was chosen randomly from large school population.

From the menus select Descriptive Statistic and Frequencies and move the vector of raw marks into the variables list. From the dialogue box select the statistics mean, median, mode, quartiles, standard deviation, variance, range, skewness and kurtosis to measure central tendency, variability, symmetry and peakedness of the distribution of marks.

marks59.56

60.50

56

23.037

530.700

-.404

-.574

87

44.75

60.50

79.25

Mean

Median

Mode

Std. Deviation

Variance

Skewness

Kurtosis

Range

25

50

75

Percentiles

The three measures of central tendency indicate that the average mark is approximately 60. In a perfectly symmetric distribution the mean, median and mode should be equal. The fact that these three measures differ from each other indicate that the distribution is skewed. The bigger the difference between these three measures the less symmetric the distribution will be. The marks range from 9 to 96 explaining why the standard deviation is large. If the marks had to be clustered closer to the mean one would expect a smaller standard deviation. Both the skewness and the kurtosis have a negative value indicating that the distribution of marks is skewed to the left and is flatter than the normal distribution. This can be verified by plotting a histogram and displaying the normal curve.

L.Camilleri 24

The lower and upper quartiles are respectively 44.75 and 79.25. This implies that 25% of the students got a mark less than 45 and another 25% of the sample got a mark higher than 79. 2.5 Types of Reliability

There are two classes of reliability testing and each estimates reliability in a different way. These include:

• Inter-rater, intra-rater and test-retest reliability • Internal consistency reliability

Inter-rater, intra rater and test-retest reliabilit y Inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters and assesses the homogeneity in the raters’ evaluations of the same item. In other words it assesses the consistency with which different raters produce similar evaluations in judging the same abilities or characteristics in the same target person or object.

Intra-rater reliability is the degree of agreement among repeated administrations of a test performed by a single rater. Test-retest reliability is a form of intra-rater reliability and assesses the homogeneity or agreement in the rater’s evaluations when the same test is administered on two different occasions. There are several tests to assess inter-rater, intra-rater and test-retest reliability; however, these tests depend on the evaluation scale of the raters rather than the type of reliability. If the evaluation scale is nominal (true, false or good, bad or present, absent) then the Kappa

L.Camilleri 25

test is recommended. If the evaluation scale is ordinal (poor, moderate, good, excellent or never, rarely, sometimes, often, always) then the Gamma and Kendall tau b and c tests are all appropriate and yield similar results. If the evaluations have an interval or metric scale then the absolute agreement intra class correlation is recommended to assess inter-rater, intra-rater or test-retest reliability. The following examples illustrate the procedure.

Example 1 Two doctors A and B assess 30 patients independently on 4 mental disorders (anxiety, psychotic, personality and obsessive compulsive disorder).

Patient Doctor A Doctor B

1 Anxiety disorder Psychotic disorder

2 Psychotic disorder Psychotic disorder

3 Personality disorder Personality disorder

4 Obsessive disorder Obsessive disorder




8 Anxiety disorder Anxiety disorder

9 Obsessive disorder Psychotic disorder












21 Anxiety disorder Obsessive disorder








29 Obsessive disorder Personality disorder


To assess inter-rater reliability the Kappa test will be used because the doctors’ evaluations have a nominal scale. To get the output, click on Analyze, Descriptive Statistics and Crosstabs. Move the variables ‘A’ and ‘B’ in the row and column slots, click on Statistics and select Kappa. Click on Continue and OK to get the output.

L.Camilleri 26

Doctor B Anxiety disorder

Psychotic disorder

Personality disorder

Obsessive disorder

Doctor A Anxiety disorder 4 1 0 1 Psychotic disorder 0 9 0 0 Personality disorder 0 0 7 0 Obsessive disorder 0 1 1 6

Symmetric Measures

Value Std. Error Approx. T P-value

Measure of Agreement Kappa .820 0.083 7.715 0.000 Number of valid cases 30

The crosstab shows a large percentage agreement (26/30 x 100% = 86.7%) indicating a strong inter-rater reliability. Kappa values of greater than 0.75 indicate excellent agreement beyond chance; values in the range 0.4 to 0.75 indicate fair to good; and values below 0.4 indicate poor agreement. The p-value (approximately 0) is less than the 0.05 criterion and indicates that the Kappa values (0.820) is significantly different from 0 indicating excellent inter-rater reliability.

Example 2 A teacher was asked to assess a child on 20 socio-emotional behaviour difficulties using a 5-point Likert (ordinal) scale, ranging from ‘Strongly disagree’ to Strongly agree’. The child was assessed by the teacher on two separate occasions, allowing a one week period.

Socio emotional behaviour difficulties Before After

Often complains of headaches, stomach-aches Strongly agree Agree

Many worries, often seems worried Agree Strongly agree

Often unhappy, downhearted or tearful Strongly agree Agree

Nervous or clingy in new situations Neutral Agree

Many fears, easily scared Agree Neutral

Often has temper tantrums or hot tempers Disagree Strongly disagree

Generally disobedient Strongly disagree Strongly disagree

Often fights with other children or bullies them Strongly disagree Disagree

Often lies or cheats Disagree Strongly disagree

Steals from home, school or elsewhere Strongly disagree Disagree

Restless, overactive, cannot stay still for long Neutral Agree

Constantly fidgeting or squirming Strongly disagree Disagree

Easily distracted, concentration wanders Disagree Disagree

Acts hastily without thinking Disagree Strongly disagree

Never finishes a task, poor attention span Strongly disagree Strongly disagree

Rather solitary, tends to play alone Agree Strongly agree

Has no friends Disagree Disagree

Generally disliked by other children Strongly disagree Disagree

Picked on or bullied by other children Agree Neutral

Gets on better with adults than children his age Agree Neutral

L.Camilleri 27

To assess test-retest or intra-rater reliability the Kendall tau b and c and Gamma tests will be used because the teacher’s evaluations have a ordinal scale. To get the output, click on Analyze, Descriptive Statistics and Crosstabs. Move the variables ‘Before’ and ‘After’ in the row and column slots, click on Statistics and select Kendall tau b and c, and Gamma. Click on Continue and OK to get the output. These three measures are based on concordant pairs. If the values of one case are both larger (or smaller) than those for the other member of the pair, the pair is concordant. If the direction is reversed for the second factor, the pair is discordant. When the cases have the same value for one or both variables, the pair is tied. If the number of concordant pairs is similar to the number of discordant pairs then the Kendall tau b and c and Gamma values will be close to 0 and their respective p-values exceed the 0.05 criterion indicating poor test-retest reliability. If the number of concordant pairs is considerably larger than the number of discordant pairs then the Kendall tau b and c and Gamma values will be close to 1 and their respective p-values will be less than the 0.05 criterion indicating satisfactory test-retest reliability.

Second evaluation

Strongly disagree Disagree Neutral Agree

Strongly agree

First evaluation Strongly disagree 2 4 0 0 0

Disagree 3 2 0 0 0

Neutral 0 0 0 2 0

Agree 0 0 3 0 2

Strongly agree 0 0 0 2 0

Symmetric Measures

Value Std. Error Approx. T P-value

Ordinal by Ordinal Kendall's tau-b 0.591 0.067 7.696 0.000

Kendall's tau-c 0.569 0.074 7.696 0.000

Gamma 0.674 0.081 7.696 0.000

The heaviest concentration of responses occurs near the principal diagonal. There is no item in which the teacher agreed in one evaluation and disagreed in the other, as evidenced by the zero counts near the lower left and upper right corners. All three tests indicate good test-retest reliability since the p-values are all less than the 0.05 criterion Example 3 Two examiners were asked to correct the scripts of twenty students where the marks ranged from 0 to 100. To assess inter-rater reliability the intraclass correlation will be used to measure both consistency and absolute agreement.

L.Camilleri 28

Examiner A Examiner B Examiner A Examiner B

67 36 83 56

49 48 53 28

91 84 98 76

84 71 90 60

97 52 69 42

49 19 82 50

38 29 45 33

59 27 40 35

65 38 61 34

79 46 76 41

To get the output, click on Analyze, Scale and Reliability Analysis. Move the variables ‘A’ and ‘B’ in the items slot, click on Statistics, select Intraclass correlation coefficient and choose Consistency for type. Click on Continue and OK to get the output. Repeat the whole procedure but choose Absolute Agreement for type.

Intraclass Correlation Coefficient measuring consistency

Intraclass Correlation

95% Confidence Interval

Lower Bound Upper Bound

Single Measures 0.793 0.549 0.913 Average Measures 0.885 0.709 0.954

Intraclass Correlation Coefficient measuring absolute agreement

Intraclass Correlation

95% Confidence Interval

Lower Bound Upper Bound

Single Measures 0.435 -0.087 0.794 Average Measures 0.606 -0.192 0.885

While the examiners seem to be quite different in their methods of scoring there may be similar patterns in the way they are scoring. Both examiners provide higher scores for good performances and lower scores for poor performances; however, the two examiners differ in the precise mark that should be assigned to a particular performance. Examiner B is more stringent in his marking scheme than A. The average measure intraclass correlation for consistency (0.885) is larger than the average measure intraclass correlation for absolute agreement (0.606), which implies that the marks provided by the examiners are correlated but differ in size. Internal Consistency Reliability In internal consistency reliability estimation we use our single measurement instrument administered to a group of respondents on one occasion to estimate reliability. Basically, we are judging the reliability of the instrument by estimating how well the items that reflect the construct yield similar results. There are mainly two measures to assess internal consistency reliability - Cronbach’s Alpha and Guttman Split-Half measures.

L.Camilleri 29

Cronbach’s Alpha is equal to the average measure intraclass correlation for consistency. In Split-Half Reliability the items, that are assumed to measure the same construct, are divided into two sets. The Guttman split coefficient is computed using the same formula for Cronbach’s Alpha for two items. For both measures, values greater than 0.9 indicate excellent reliability; values between 0.8 and 0.9 imply good reliability; values between 0.7 and 0.8 indicate acceptable reliability; values between 0.5 and 0.7 indicate questionable reliability and values less than 0.5 imply unacceptable reliability. Consider the following example as an illustration of the procedure. An observer was asked to assess 24 children on 6 items related to prosocial behaviour. Not true corresponds to a score of 0; somewhat true corresponds to 1 and certainly true corresponds to 2. Variables P1, P2, P3, P4, P5, and P6 comprise the scores elicited by the observer for the 6 items. The task is to measure internal consistency reliability using the two methods described above.

Prosocial Behaviour Not True Somewhat True Certainly True

Considerate of other people’s feeling

Shares readily with other children

Helpful if someone is hurt

Upset of feeling ill

Kind to younger children

Often volunteers to help others

P1 P2 P3 P4 P5 P6 Not true Somewhat true Not true Somewhat true Not true Not true Somewhat true Somewhat true Not true Somewhat true Not true Not true Certainly true Certainly true Somewhat true Certainly true Certainly true Certainly true Certainly true Somewhat true Somewhat true Certainly true Certainly true Somewhat true Somewhat true Not true Somewhat true Not true Not true Somewhat true Not true Not true Somewhat true Somewhat true Somewhat true Not true Somewhat true Somewhat true Somewhat true Somewhat true Somewhat true Not true Not true Not true Not true Somewhat true Not true Somewhat true Somewhat true Not true Not true Not true Somewhat true Not true Not true Somewhat true Somewhat true Somewhat true Not true Not true Certainly true Somewhat true Certainly true Certainly true Certainly true Somewhat true Somewhat true Not true Somewhat true Somewhat true Somewhat true Not true Not true Somewhat true Somewhat true Not true Not true Somewhat true Somewhat true Not true Not true Not true Not true Not true Not true Not true Not true Not true Not true Somewhat true Somewhat true Not true Somewhat true Somewhat true Somewhat true Somewhat true Certainly true Certainly true Certainly true Somewhat true Certainly true Certainly true Not true Somewhat true Not true Somewhat true Somewhat true Not true Somewhat true Somewhat true Not true Not true Somewhat true Not true Not true Not true Somewhat true Not true Not true Somewhat true Certainly true Certainly true Certainly true Somewhat true Certainly true Certainly true Somewhat true Certainly true Certainly true Certainly true Somewhat true Certainly true Not true Not true Not true Somewhat true Somewhat true Somewhat true Somewhat true Certainly true Somewhat true Certainly true Certainly true Somewhat true

L.Camilleri 30

To compute a split-half reliability measure between items 1, 2 and 3 (set 1) and items 4, 5 and 6 (set 2), click on Analyze, Scale and Reliability Analysis. Move P1, P2, P3, P4, P5 and P6 in the items list and select Split-half for the model option. Click on Statistics and select Inter item correlation . A similar procedure is used to compute Cronbach’s Alpha but select Alpha for the model option.

Inter-Item Correlation Matrix

P1a P2a P3a P4a P5a P6a

P1a 1.000 0.525 0.568 0.451 0.752 0.465 P2a 0.525 1.000 0.539 0.590 0.586 0.511 P3a 0.568 0.539 1.000 0.469 0.558 0.634 P4a 0.451 0.590 0.469 1.000 0.665 0.370 P5a 0.752 0.586 0.558 0.665 1.000 0.462 P6a 0.465 0.511 0.634 0.370 0.462 1.000

L.Camilleri 31

Reliability Statistics

Guttman Split-Half Coefficient 0.902

Cronbach’s Alpha 0.877

All inter-item correlations are positive indicating that the variables (prosocial behaviours) are positively related. Moreover, both the Cronbach’s Alpha and Split half coefficients exceed the 0.7 criterion indicating good internal consistency (reliability).