spss workshop 2014 tutorial - wchri

Sung Hyun Kang, Biostatistician WCHRI, University of Alberta

3/5/2014

SPSS Workshop 2014 Tutorial

WCHRI, University of Alberta

1 SPSS Workshop 2014 Tutorial

Table of Contents 1. Outline: SPSS Workshop 2014 .............................................................................................................................3 2. What is SPSS? .....................................................................................................................................................4 3. Introducing the SPSS interface ............................................................................................................................5 3.1. SPSS Data Editor: Data View ........................................................................................................................................... 5 3.2. SPSS Data Editor: Variable View...................................................................................................................................... 5 3.3. SPSS Output window ....................................................................................................................................................... 7 3.4. SPSS Syntax window ........................................................................................................................................................ 7 4. Getting familiar with SPSS Menu and Icon ...........................................................................................................8 5. Data Import/Export .......................................................................................................................................... 10 5.1. Create Data File (Entering Data) ................................................................................................................................... 10 5.2. Opening Data File (Import data) ................................................................................................................................... 15 5.2.1. Opening SPSS data: File > Open>Data… (Select SPSS statistics (*.sav) as File of type :) ..................................... 15 5.2.2. Opening Text File: Fixed width .............................................................................................................................. 15 5.2.3. Opening Text File: (Tab) Delimited ....................................................................................................................... 17 5.2.4. Opening EXCEL (or CSV) File .................................................................................................................................. 19 5.2.5. Opening SAS data file ............................................................................................................................................ 20 5.3. Export Data File (Save as different type of data) .......................................................................................................... 21 5.4. Saving Data File with selected variables ....................................................................................................................... 21 6. Manipulating data1 (SPSS Menu: Data) ............................................................................................................. 22 6.1. Data Menu: Sort Cases… ............................................................................................................................................... 22 6.2. Data Menu: Identify Duplicate Cases… ......................................................................................................................... 23 6.3. Data Menu: Merge Files > Add Cases… ......................................................................................................................... 24 6.4. Data Menu: Merge Files > Add Variables… ................................................................................................................... 25 6.5. Data Menu: Aggregate… ............................................................................................................................................... 26 6.6. Data Menu: Restructure… ............................................................................................................................................. 27 6.7. Data Menu: Split into Files ............................................................................................................................................ 29 6.8. Data Menu: Split Files…................................................................................................................................................. 30 6.9. Data Menu: Select Cases… ............................................................................................................................................ 31 6.10. Data Menu: Weight Cases… .................................................................................................................................. 32 7. Manipulating data2 (SPSS Menu: Transform)..................................................................................................... 33 7.1. Transform Menu: Compute Variable… ......................................................................................................................... 33 7.2. Transform Menu: Recode into Same Variables… .......................................................................................................... 33 7.3. Transform Menu: Recode into Different Variables… .................................................................................................... 34 7.4. Transform Menu: Automatic Recode… ......................................................................................................................... 34 7.5. Transform Menu: Create Dummy Variables ................................................................................................................. 35 7.6. Transform Menu: Visual Binning… ................................................................................................................................ 35 7.7. Transform Menu: Rank Cases…..................................................................................................................................... 37 7.8. Transform Menu: Date and time Wizard… ................................................................................................................... 37 7.9. Transform Menu: Replace missing values… .................................................................................................................. 38 8. Descriptive statistics ......................................................................................................................................... 39 8.1. Descriptive statistics for continuous data (Interval, Ratio) ........................................................................................... 39 8.2. Descriptive statistics for categorical data (Nominal, Ordinal) ...................................................................................... 44 8.3. Generating graphs (or charts) for continuous data (Interval, Ratio) ............................................................................ 47 8.4. Generating graphs (or charts) for categorical data (Nominal, Ordinal) ........................................................................ 50 8.5. Using Chart Builder ....................................................................................................................................................... 52



8.6. Chart Edit Window ........................................................................................................................................................ 53 9. Compare Means (T-test) ................................................................................................................................... 54 9.1. Independent sample t-test with two groups ................................................................................................................ 54 9.2. Paired samples t-test .................................................................................................................................................... 55 10. Compare proportions (Analysis of contingency table) and association ................................................................ 56 10.1. Pearson’s Chi-Square test ..................................................................................................................................... 56 10.2. Fisher’s exact test.................................................................................................................................................. 57 10.3. Cochran-Mantel-Haenszel (CMH) Statistics .......................................................................................................... 58 10.4. McNemar’s test for matched pairs data ............................................................................................................... 61 10.5. Measure of Agreement (Cohen’s Kappa) .............................................................................................................. 62 10.6. Studies in medical science (Review) ..................................................................................................................... 63 11. ANOVA, ANCOVA, and MANOVA ...................................................................................................................... 64 11.1. One-way ANOVA ................................................................................................................................................... 64 11.2. Two-way ANOVA (With interaction) ..................................................................................................................... 66 11.3. ANCOVA (Analysis of Covariance) ......................................................................................................................... 68 11.4. MANOVA (Multivariate ANOVA) ........................................................................................................................... 69 12. Nonparametic method ...................................................................................................................................... 72 12.1. Wilcoxon rank-sum test (Mann-Whitney U test) .................................................................................................. 72 12.2. Wilcoxon signed-rank test for paired data ........................................................................................................... 73 12.3. Kruskal-Wallis test ................................................................................................................................................. 74 13. Correlation and Regression analysis .................................................................................................................. 75 13.1. Correlation analysis ............................................................................................................................................... 75 13.2. Linear Regression model ....................................................................................................................................... 76 14. Logistic regression analysis ............................................................................................................................... 78



1. Outline: SPSS Workshop 2014

The first day objectives are:

• Learning about SPSS1

• Opening and reviewing layouts of SPSS

• Becoming familiar with menus and icons

• Manipulating data files

• Calculating descriptive statistics

• Comparing means and proportions

• Calculating association

• Creating graphs

• Working with SPSS syntax

The second day objectives are:

• ANOVA, ANCOVA, MANOVA

• Repeated measure ANOVA

• Correlation analysis & (ordinary) linear regression

• Logistic regression

• Nonparametric method

1 Note that this tutorial was created using IBM SPSS Statistics Version 22.



2. What is SPSS?

• Windows based program that can be used to perform data entry and analysis and to create tables and graphs. • Capable of handling large amounts of data and can perform all of the analyses covered in the text and much

more. • Commonly used in the Social Sciences and in the business world. • SPSS is updated often.

Figure 1 Running SPSS with welcome dialogue

• SPSS Extension file name:

o SPSS data file: *.sav o SPSS output file: *.spo o SPSS syntax file: *.sps



3. Introducing the SPSS interface 3.1. SPSS Data Editor: Data View

Many of the features of Data View are similar to the features that are found in spreadsheet applications. There are,

however, several important distinctions: • Rows are cases. Each row represents a case or an observation. For example, each individual respondent to a

questionnaire is a case. • Columns are variables. Each column represents a variable or characteristic that is being measured. For example, each

item on a questionnaire is a variable. • Cells contain values. Each cell contains a single value of a variable for a case. The cell is where the case and the variable

intersect. Cells contain only data values. Unlike spreadsheet programs, cells in the Data Editor cannot contain formulas. • The data file is rectangular. The dimensions of the data file are determined by the number of cases and variables. You

can enter data in any cell. If you enter data in a cell outside the boundaries of the defined data file, the data rectangle is extended to include any rows and/or columns between that cell and the file boundaries. There are no "empty" cells within the boundaries of the data file. For numeric variables, blank cells are converted to the system-missing value. For string variables, a blank is considered a valid value.

Figure 2 SPSS Data Editor: Data View

3.2. SPSS Data Editor: Variable View

Variable View contains descriptions of the attributes of each variable in the data file. In Variable View:

• Rows are variables. • Columns are variable attributes.



You can add or delete variables and modify attributes of variables, including the following attributes: • Variable name • Data type • Number of digits or characters • Number of decimal places • Descriptive variable and value labels • User-defined missing values • Column width • Measurement level All of these attributes are saved when you save the data file.

In addition to defining variable properties in Variable View, there are two other methods for defining variable properties: • The Copy Data Properties Wizard provides the ability to use an external IBM® SPSS® Statistics data file or another

dataset that is available in the current session as a template for defining file and variable properties in the active dataset. You can also use variables in the active dataset as templates for other variables in the active dataset. Copy Data Properties is available on the Data menu in the Data Editor window. See the topic Copying Data Properties for more information.

• Define Variable Properties (also available on the Data menu in the Data Editor window) scans your data and lists all unique data values for any selected variables, identifies unlabeled values, and provides an auto-label feature. This method is particularly useful for categorical variables that use numeric codes to represent categories--for example, 0 = Male, 1 = Female. See the topic Defining Variable Properties for more information.

Figure 3 SPSS Data Editor: Variable View



3.3. SPSS Output window

Figure 4 SPSS output

3.4. SPSS Syntax window

Figure 5 SPSS syntax



4. Getting familiar with SPSS Menu and Icon

Figure 6 SPSS Menu and Icon

SPSS MENU • File includes all of the options you typically use in other programs, such as open, save, exit. Notice, that you can open

or create new files of multiple types as illustrated to the right. • Edit includes the typical cut, copy, and paste commands, and allows you to specify various options for displaying data

and output. o Click on Options, and you will see the dialog box to the left. You can use this to format the data, output, charts,

etc. These choices are rather overwhelming, and you can simply take the default options for now. The author of your text (me) was too dumb to even know these options could easily be set.

Figure 7 SPSS Menu - Edit (Options)

SPSS ICONs

Status Bar

SPSS MENUs



• View allows you to select which toolbars you want to show, select font size, add or remove the gridlines that separate each piece of data, and to select whether or not to display your raw data or the data labels.

• Data allows you to select several options ranging from displaying data that is sorted by a specific variable to selecting certain cases for subsequent analyses.

• Transform includes several options to change current variables. For example, you can change continuous variables to categorical variables, change scores into rank scores, add a constant to variables, etc.

• Analyze includes all of the commands to carry out statistical analyses and to calculate descriptive statistics. Much of this book will focus on using commands located in this menu.

• Graphs includes the commands to create various types of graphs including box plots, histograms, line graphs, and bar charts.

• Utilities allows you to list file information which is a list of all variables, there labels, values, locations in the data file, and type.

• Add-ons are programs that can be added to the base SPSS package. You probably do not have access to any of those. • Window can be used to select which window you want to view (i.e., Data Editor, Output Viewer, or Syntax). Since we

have a data file and an output file open, let’s try this. o Select Window/Data Editor. Then select Window/SPSS Viewer.

• Help has many useful options including a link to the SPSS homepage, a statistics coach, and a syntax guide. Using topics, you can use the index option to type in any key word and get a list of options, or you can view the categories and subcategories available under contents. This is an excellent tool and can be used to troubleshoot most problems.

SPSS ICON • The Icons directly under the Menu bar provide shortcuts to many common commands that are available in specific

menus. Take a moment to review these as well.

STATUS Bar The status bar at the bottom of each IBM® SPSS® Statistics window provides the following information: • Command status. For each procedure or command that you run, a case counter indicates the number of cases

processed so far. For statistical procedures that require iterative processing, the number of iterations is displayed. • Filter status. If you have selected a random sample or a subset of cases for analysis, the message Filter on indicates

that some type of case filtering is currently in effect and not all cases in the data file are included in the analysis. • Weight status. The message Weight on indicates that a weight variable is being used to weight cases for analysis. • Split File status. The message Split File on indicates that the data file has been split into separate groups for analysis,

based on the values of one or more grouping variables.



5. Data Import/Export 5.1. Create Data File (Entering Data)

Variable names

The following rules apply to variable names: • Each variable name must be unique; duplication is not allowed. • Variable names can be up to 64 bytes long, and the first character must be a letter or one of the characters @, #, or $.

Subsequent characters can be any combination of letters, numbers, non-punctuation characters, and a period (.). In code page mode, sixty-four bytes typically means 64 characters in single-byte languages (for example, English, French, German, Spanish, Italian, Hebrew, Russian, Greek, Arabic, and Thai) and 32 characters in double-byte languages (for example, Japanese, Chinese, and Korean). Many string characters that only take one byte in code page mode take two or more bytes in Unicode mode. For example, é is one byte in code page format but is two bytes in Unicode format; so résumé is six bytes in a code page file and eight bytes in Unicode mode.

Note: Letters include any non-punctuation characters used in writing ordinary words in the languages supported in the platform's character set.

• Variable names cannot contain spaces. • A # character in the first position of a variable name defines a scratch variable. You can only create scratch variables

with command syntax. You cannot specify a # as the first character of a variable in dialog boxes that create new variables.

• A $ sign in the first position indicates that the variable is a system variable. The $ sign is not allowed as the initial character of a user-defined variable.

• The period, the underscore, and the characters $, #, and @ can be used within variable names. For example, A._$@#1 is a valid variable name.

• Variable names ending with a period should be avoided, since the period may be interpreted as a command terminator. You can only create variables that end with a period in command syntax. You cannot create variables that end with a period in dialog boxes that create new variables.

• Variable names ending in underscores should be avoided, since such names may conflict with names of variables automatically created by commands and procedures.

• Reserved keywords cannot be used as variable names. Reserved keywords are ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, and WITH.

• Variable names can be defined with any mixture of uppercase and lowercase characters, and case is preserved for display purposes.

• When long variable names need to wrap onto multiple lines in output, lines are broken at underscores, periods, and points where content changes from lower case to upper case.

Variable type

Variable Type specifies the data type for each variable. By default, all new variables are assumed to be numeric. You can use Variable Type to change the data type. The contents of the Variable Type dialog box depend on the selected data type. For some data types, there are text boxes for width and number of decimals; for other data types, you can simply select a format from a scrollable list of examples. The available data types are as follows:

• Numeric. A variable whose values are numbers. Values are displayed in standard numeric format. The Data Editor

accepts numeric values in standard format or in scientific notation.



• Comma. A numeric variable whose values are displayed with commas delimiting every three places and displayed with the period as a decimal delimiter. The Data Editor accepts numeric values for comma variables with or without commas or in scientific notation. Values cannot contain commas to the right of the decimal indicator.

• Dot. A numeric variable whose values are displayed with periods delimiting every three places and with the comma as a decimal delimiter. The Data Editor accepts numeric values for dot variables with or without periods or in scientific notation. Values cannot contain periods to the right of the decimal indicator.

• Scientific notation. A numeric variable whose values are displayed with an embedded E and a signed power-of-10 exponent. The Data Editor accepts numeric values for such variables with or without an exponent. The exponent can be preceded by E or D with an optional sign or by the sign alone--for example, 123, 1.23E2, 1.23D2, 1.23E+2, and 1.23+2.

• Date. A numeric variable whose values are displayed in one of several calendar-date or clock-time formats. Select a format from the list. You can enter dates with slashes, hyphens, periods, commas, or blank spaces as delimiters. The century range for two-digit year values is determined by your Options settings (from the Edit menu, choose Options, and then click the Data tab).

• Dollar. A numeric variable displayed with a leading dollar sign ($), commas delimiting every three places, and a period as the decimal delimiter. You can enter data values with or without the leading dollar sign.

• Custom currency. A numeric variable whose values are displayed in one of the custom currency formats that you have defined on the Currency tab of the Options dialog box. Defined custom currency characters cannot be used in data entry but are displayed in the Data Editor.

• String. A variable whose values are not numeric and therefore are not used in calculations. The values can contain any characters up to the defined length. Uppercase and lowercase letters are considered distinct. This type is also known as an alphanumeric variable.

• Restricted numeric. A variable whose values are restricted to non-negative integers. Values are displayed with leading zeros padded to the maximum width of the variable. Values can be entered in scientific notation.

Figure 8 SPSS Variable View: Variable Type

Measure of Variables

• Nominal variable is one that has two or more categories, but there is no intrinsic ordering to the categories. e.g., gender, ethnicity etc.



• Ordinal variable is similar to nominal variable with clear ordering of the categories but the spacing between the values may not be the same.

e.g. Socio-economic status, Severity of disease etc. • Interval variable is similar to ordinal variable with intervals between values are equally spaced.

e.g. Height, weight, age etc.

Figure 9 SPSS Variable View: Measure of Variable

Missing values

• If you do not enter any data in a field, it will be considered as missing and SPSS will enter a period for you. • Or you can define specific value as missing value

Figure 10 SPSS Variable View: Define missing values



Example) Enter the following data in SPSS and save file PatientID Gender Age Weight Height Ethnicity

1 1 18 175 155 A 2 2 31 156 150 W 3 1 12 141 136 B 4 9 31 160 177 O

• For Gender, o 1=”Male” o 2=”Female” o 9=”User defined Missing value”

• For Ethnicity, o A=”Aboriginal” o W=”White” o B=”Black” o O=”Others”

• Entering data in SPSS (Variable name, define value labels, and define missing value)



• Save data file as SPSS data format (File name: Example1.sav)



5.2. Opening Data File (Import data) Data files come in a wide variety of formats, and this software is designed to handle many of them, including:

• Spreadsheets created with Excel and Lotus • Database tables from many database sources, including Oracle, SQLServer, Access, dBASE, and others • Tab-delimited and other types of simple text files • Data files in IBM® SPSS® Statistics format created on other operating systems • SYSTAT data files. SYSTAT SYZ files are not supported. • SAS data files • Stata data files • IBM Cognos Business Intelligence data packages and list reports

5.2.1. Opening SPSS data: File > Open>Data… (Select SPSS statistics (*.sav) as File of type :) 5.2.2. Opening Text File: Fixed width

o Raw data

Figure 11 Text Data File: Fixed width

o Open in SPSS o File > Read Text Data… o File > Open>Data…



5.2.3. Opening Text File: (Tab) Delimited o Raw data

Figure 12 Text Data File: (Tab) Delimited

o Open in SPSS o File > Read Text Data… o File > Open>Data…



5.2.4. Opening EXCEL (or CSV) File o Raw data

Figure 13 EXCEL data file

Figure 14 CSV data file

o Open in SPSS o File > Open>Data…

Select EXCEL as “Files of type:” to open EXCEL file Select TEXT as “Files of type:” to open CSV file

o Or simply drag EXCEL (or CSV) file to SPSS program



5.2.5. Opening SAS data file o Raw data

Figure 15 SAS data file

o Open in SPSS

o File > Open>Data… Select SAS as “Files of type:” to open SAS data file

o Or simply drag SAS data file to SPSS program



5.3. Export Data File (Save as different type of data) o Data export in SPSS

o File >Save as… Select “Save as type” to export data

Figure 16 Export data (Save as different data format)

5.4. Saving Data File with selected variables

o File >Save as… (click “Variables…” in Save Data As window)

Figure 17 Save data with selected variables



6. Manipulating data1 (SPSS Menu: Data) Even though, you have cleaned final dataset. During the data analysis, you need to manipulate data. For instance, if you have several datasets from several different resources, you might need to merge or cut data. And no matter how carefully you planned your data design, you’ll probably want to work with some variables in different forms. If you collected income or age data, for example, you might want to group the continuous variables into categories. Or you might want to create a variable that combines various conditions, say, all minority managers by gender. This section 6 & 7 will deal with Data and Transform menu in SPSS

Figure 18 SPS Menu: Data Figure 19 SPSS Menu: Transform

6.1. Data Menu: Sort Cases…

Figure 20 Data Menu: Sort Cases...



6.2. Data Menu: Identify Duplicate Cases… (Sample data: Example_Identify_Duplicated.sav)

Figure 21 Data with duplicate cases

Figure 22 Data Menu: Identify Duplicate Cases...

o SPSS output

Indicator of each last matching case as Primary

Frequency Percent Valid Percent Cumulative

Percent Valid Duplicate Case 1 3.2 3.2 3.2

Primary Case 30 96.8 96.8 100.0 Total 31 100.0 100.0



6.3. Data Menu: Merge Files > Add Cases… (Sample data: Example_MergeByCase_01.sav, Example_MergeByCase_02.sav)

Figure 23 SPSS data files to be merged (Add cases)

o Both SPSS datasets above have same variables with same name, but different cases (One is data for Female, the other is for Male).

Figure 24 Data Menu: Merge Files > Add Cases...



6.4. Data Menu: Merge Files > Add Variables… (Sample data: Example_MergeByVariables_01.sav, Example_MergeByVariable_02.sav)

Figure 25 SPSS datasets to be merged (Add variables)

o Note that both SPSS datasets above must have key variable (Unique Identifier). o Note that both SPSS datasets above must be sorted by key variable before merging files.

Figure 26 Data Menu: Merge Files > Add Variables...



6.5. Data Menu: Aggregate… (Sample data: DataExcel.sav)

Figure 27 Data Menu: Aggregate...

Figure 28 Aggregated data by Gender



6.6. Data Menu: Restructure… (Sample data: Example1_RepeatedMeasureANOVA.sav)

o Data structure for Repeated Measures ANOVA (Wide format)

subject group dv1 dv2 dv3 dv4

1 1 3 4 7 3 2 1 6 8 12 9 3 1 7 13 11 11 4 1 0 3 6 6 5 2 5 6 11 7 6 2 10 12 18 15 7 2 10 15 15 14 8 2 5 7 11 9

o Data structure for Mixed Model (Long format)

subject group dv trial

1 1 3 1 1 1 4 2 1 1 7 3 1 1 3 4 : : : : 8 2 5 1 8 2 7 2 8 2 11 3 8 2 9 4

o Using “Restructure” menu in SPSS (Step1 – 7), we can convert wide format data into long format (or Vice versa)



6.7. Data Menu: Split into Files (Sample data: DataExcel.sav)

Figure 29 Data Menu: Split into Files

o This procedure will generate two separate SPSS dataset by gender



6.8. Data Menu: Split Files… (Sample data: DataExcel.sav)

Figure 30 Data Menu: Split Files...

o If you run above, you can see “Split by Gender” in Status bar o SPSS output before splitting files by Gender.

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation Age 30 12 31 21.60 6.360 Weight 30 100 200 147.13 34.309 Height 30 121 188 144.00 17.388 Valid N (listwise) 30

o SPSS output after splitting files by Gender.

Descriptive Statistics Gender N Minimum Maximum Mean Std. Deviation F Age 18 12 31 20.06 5.955

Weight 18 100 199 142.61 36.934 Height 18 121 188 143.94 18.479 Valid N (listwise) 18

M Age 12 12 31 23.92 6.487 Weight 12 111 200 153.92 30.189 Height 12 122 177 144.08 16.412 Valid N (listwise) 12

o If you analyze all cases, then select “Analyze all cases, do not create groups” in “Split Files…” menu (See Figure 30)



6.9. Data Menu: Select Cases… (Sample data: DataExcel.sav)

Figure 31 Data Menu: Select Cases...

Figure 32 SPSS dataset with selected cases

o From the Figure 32, Female data was selected, so all of male data will be excluded from the analysis. o If you want to use all cases again, then select “All cases” in “Select Cases…” menu in Figure 31



6.10. Data Menu: Weight Cases…

Figure 33 Data Menu: Weight Cases...

Suppose we have cross (or contingency) table below to see association between X and Y.

Variable YY

Total 1 2

X 1 15 20 35 2 25 35 60

Total 40 55 95

o Data input in SPSS.

Figure 34 Data input for weight cases

o SPSS output before weighting cases. x * y Crosstabulation

Count

y

Total 1 2 x 1 1 1 2

2 1 1 2 Total 2 2 4

o SPSS output after weighting cases (“Weight on” in Status bar).

x * y Crosstabulation

Count

y

Total 1 2 x 1 15 20 35

2 25 35 60 Total 40 55 95



7. Manipulating data2 (SPSS Menu: Transform) 7.1. Transform Menu: Compute Variable…

(Sample data: DataExcel.sav)

Figure 35 Transform Menu: Compute Variable

7.2. Transform Menu: Recode into Same Variables… (Sample data: DataExcel.sav)



7.3. Transform Menu: Recode into Different Variables… (Sample data: DataExcel.sav)

o From the example dataset, we want to generate categorize age variable (<20 years, 20-29 years, >=30 years)

7.4. Transform Menu: Automatic Recode… (Sample data: DataExcel.sav)



7.5. Transform Menu: Create Dummy Variables (Sample data: DataExcel.sav)

7.6. Transform Menu: Visual Binning… (Sample data: DataExcel.sav)

o From the example dataset, we want to generate same categorized age variable (<20 years, 20-29 years, >=30 years) in

Section 7.4.



o Make Cutpoints…



7.7. Transform Menu: Rank Cases…

(Sample data: DataExcel.sav)

7.8. Transform Menu: Date and time Wizard… (Sample data: DataExcel.sav)

o In the example dataset, “Date” variable is string variable. Let’s generate date type variable using string (or text).



7.9. Transform Menu: Replace missing values… (Sample data: DataExcel.sav)



8. Descriptive statistics

• With the dataset specified and labeled it is ready for analysis. • The first thing that would be done before conducting the analysis would be to present descriptive statistics for each of the

variables in the study. • The descriptive statistics that will be presented or frequency distributions, measures of central tendency and comparing

means with different groups etc.

Figure 36 Analyze Menu: Descriptive Statistics

8.1. Descriptive statistics for continuous data (Interval, Ratio)

- Central tendency (Mean, Median, Mode etc.) - Dispersion (Variance, Standard deviation, Range, IQR etc.) - Distribution (Skewness, kurtosis etc.) • SPSS Menu to perform descriptive analysis for continuous

o Analyze > Descriptive Statistics>Descriptive… o Analyze > Descriptive Statistics>Explore… o Analyze > Compare Means>Means… o Analyze > Descriptive Statistics>Frequencies…



• Analyze > Descriptive Statistics>Descriptive… (Example dataset: DataExcel.sav)

Figure 37 Analyze>Descriptive Statistics>Descriptives...

o SPSS output:


N Minimum Mean Std. Deviation Variance Skewness Kurtosis

Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error Age 30 12 21.60 6.360 40.455 .067 .427 -1.380 .833 Weight 30 100 147.13 34.309 1177.085 .087 .427 -1.414 .833 Height 30 121 144.00 17.388 302.345 .872 .427 .358 .833 Valid N (listwise) 30

o SPSS output (by Gender): Splitting File by Gender


Gender N Minimum Mean Std. Deviation Variance Skewness Kurtosis

Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error F Age 18 12 20.06 5.955 35.467 .527 .536 -.805 1.038

Weight 18 100 142.61 36.934 1364.134 .314 .536 -1.596 1.038 Height 18 121 143.94 18.479 341.467 1.119 .536 .902 1.038 Valid N (listwise) 18

M Age 12 12 23.92 6.487 42.083 -.688 .637 -.852 1.232 Weight 12 111 153.92 30.189 911.356 -.141 .637 -.674 1.232 Height 12 122 144.08 16.412 269.356 .435 .637 -.338 1.232 Valid N (listwise) 12



• Analyze > Descriptive Statistics>Explore… (Example dataset: DataExcel.sav)

Figure 38 Analyze>Descriptive Statistics>Explore...

o SPSS output:

(Note that if you do analysis by group variable, then add group variable into “Factor List:” on the menu)

Descriptives Statistic Std. Error Height Mean 144.00 3.175

95% Confidence Interval for Mean

Lower Bound 137.51 Upper Bound 150.49

5% Trimmed Mean 142.96 Median 144.00 Variance 302.345 Std. Deviation 17.388 Minimum 121 Maximum 188 Range 67 Interquartile Range 24 Skewness .872 .427 Kurtosis .358 .833

Tests of Normality

Kolmogorov-Smirnova Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Height .111 30 .200* .927 30 .041 *. This is a lower bound of the true significance. a. Lilliefors Significance Correction



Height Stem-and-Leaf Plot Frequency Stem & Leaf 6.00 12 . 124579 8.00 13 . 00013568 6.00 14 . 355679 5.00 15 . 02455 2.00 16 . 03 1.00 17 . 7 2.00 18 . 08 Stem width: 10 Each leaf: 1 case(s)

• Analyze > Compare Means>Means… (Example dataset: DataExcel.sav)

Figure 39 Analyze > Compare Means > Means…



o SPSS output: Generate descriptive statistics by Age (categorized) and Gender

Report Weight Age (categorical variable) Gender N Mean Std. Deviation Median Minimum Maximum < 20 years F 10 131.60 36.056 111.00 100 197

M 3 169.00 30.050 167.00 140 200 Total 13 140.23 37.343 140.00 100 200

20-29 years F 6 162.33 35.770 169.00 115 199 M 7 146.29 34.369 161.00 111 198 Total 13 153.69 34.541 161.00 111 199

>= 30 years F 2 138.50 38.891 138.50 111 166 M 2 158.00 2.828 158.00 156 160 Total 4 148.25 25.171 158.00 111 166

Total F 18 142.61 36.934 133.00 100 199 M 12 153.92 30.189 160.50 111 200 Total 30 147.13 34.309 158.00 100 200

• Analyze > Descriptive Statistics>Frequencies… (Example dataset: DataExcel.sav)

Figure 40 Analyze > Descriptive Statisticis>Frequencies...



o SPSS output: Statistics

Weight Height N Valid 30 30

Missing 0 0 Mean 147.13 144.00 Median 158.00 144.00 Percentiles 10 102.80 124.10

25 111.00 130.00 30 111.90 130.30 50 158.00 144.00 70 166.70 151.40 75 169.75 154.25

8.2. Descriptive statistics for categorical data (Nominal, Ordinal) - Frequency table, Cross table • SPSS Menu to perform descriptive analysis for categorical data

o Analyze > Descriptive Statistics>Frequencies… o Analyze > Descriptive Statistics>Crosstabs…

• Analyze > Descriptive Statistics>Frequencies…

(Example dataset: DataExcel.sav)

Figure 41 Analyze > Descriptive Statistics>Frequencies...



o SPSS output: Gender


Percent Valid F 18 60.0 60.0 60.0

M 12 40.0 40.0 100.0 Total 30 100.0 100.0

Ethnicity


Percent Valid A 11 36.7 36.7 36.7

B 5 16.7 16.7 53.3 O 5 16.7 16.7 70.0 W 9 30.0 30.0 100.0 Total 30 100.0 100.0

Age (categorical variable)


Percent Valid < 20 years 13 43.3 43.3 43.3

20-29 years 13 43.3 43.3 86.7 >= 30 years 4 13.3 13.3 100.0 Total 30 100.0 100.0



• Analyze > Descriptive Statistics>Crosstabss… (Example dataset: DataExcel.sav)

Figure 42 Analyze >Descriptive Statistics>Crosstabs...

o SPSS output:

Gender * Ethnicity Crosstabulation

Ethnicity

Total A B O W Gender F Count 6 3 3 6 18

% within Gender 33.3% 16.7% 16.7% 33.3% 100.0% % within Ethnicity 54.5% 60.0% 60.0% 66.7% 60.0% % of Total 20.0% 10.0% 10.0% 20.0% 60.0%

M Count 5 2 2 3 12 % within Gender 41.7% 16.7% 16.7% 25.0% 100.0% % within Ethnicity 45.5% 40.0% 40.0% 33.3% 40.0% % of Total 16.7% 6.7% 6.7% 10.0% 40.0%

Total Count 11 5 5 9 30 % within Gender 36.7% 16.7% 16.7% 30.0% 100.0% % within Ethnicity 100.0% 100.0% 100.0% 100.0% 100.0% % of Total 36.7% 16.7% 16.7% 30.0% 100.0%



8.3. Generating graphs (or charts) for continuous data (Interval, Ratio) - Histogram, Box-plot, Stem-and-Leaf plot - Error bar chart, Scatter plot etc.


Figure 43 SPSS Menu-Graphs

• Histogram: Graphs > Legacy Dialogs > Histogram…

Figure 44 Graph menu - Histogram



• Box-plot: Graphs > Legacy Dialogs > Boxplot…

Figure 45 Graph menu - Boxplot

• Error bar: Graphs > Legacy Dialogs > Errorbar…

Figure 46 Graph menu - Errorbar



• Scatter/Dot plot: Graphs > Legacy Dialogs > Scatter/Dot…

Figure 47 Graph menu – Scatter/Dot

• Stem-and-Leaf: Analyze > Descriptive Statistics>Explore… (see page 40)



8.4. Generating graphs (or charts) for categorical data (Nominal, Ordinal) - Bar, Pie chart, Line, Area chart etc.


• Bar chart: Graphs > Legacy Dialogs > Bar… Analyze > Descriptive Statistics>Explore…

Figure 48 Graph menu – Bar chart

• Line chart: Graphs > Legacy Dialogs > Line…

Figure 49 Graph menu – Line chart



• Area chart: Graphs > Legacy Dialogs > Area…

Figure 50 Graph menu – Area chart

• Pie chart: Graphs > Legacy Dialogs > Pie…

Figure 51 Graph menu – Pie chart



8.5. Using Chart Builder

o SPSS Menu: Graphs > Chart Builder

o Example (Data: DataExcel.sav)



8.6. Chart Edit Window • If you want to edit chart in SPSS output window, just double-click the chart that you want to edit, then you can edit

chart in “Chart Edit Window”.

Figure 52 Chart Edit Window

• Example:



9. Compare Means (T-test) 9.1. Independent sample t-test with two groups

o SPSS Menu: Analyze > Compare means > Independent-samples T-Test… o Example (Graze.sav): T aken from Huntsberger and Billingsley (1989).Compares two grazing methods using 32

cattle. Half of the cattle are allowed to graze continuously while the other half are subjected to controlled grazing time. The researchers want to know if these two grazing methods affect weight gain differently

Figure 53 Independent-samples T test

o SPSS output:

Group Statistics GrazeType N Mean Std. Deviation Std. Error Mean WeightGain continuous 16 75.19 33.812 8.453

controlled 16 83.13 30.535 7.634

Independent Samples Test

Levene's Test for Equality of Variances t-test for Equality of Means

F Sig. t df Sig. (2-tailed)

Mean Difference

Std. Error Difference

95% Confidence Interval of the

Difference Lower Upper

WeightGain Equal variances assumed .085 .773 -.697 30 .491 -7.938 11.390 -31.198 15.323

Equal variances not assumed -.697 29.694 .491 -7.938 11.390 -31.208 15.333

o Interpretation: A group test statistic for the equality of means is reported for both equal and unequal

variances. Both tests indicate a lack of evidence for a significant difference between grazing methods (and for the pooled test-equal variance assumed), and for the Satterthwaite test-equal variance not assumed). The equality of variances test does not indicate a significant difference in the two variances (Levene’s Test). This test assumes that the observations in both groups are normally distributed.



9.2. Paired samples t-test

o SPSS Menu: Analyze > Compare means > Paired-samples T-Test… o Example (Pressure.sav): A stimulus is being examined to determine its effect on systolic blood pressure.

Twelve men participate in the study. Each man’s systolic blood pressure is measured both before and after the stimulus is applied.

Figure 54 Paired-samples T test

o SPSS output:

Paired Samples Statistics

Mean N Std. Deviation Std. Error Mean Pair 1 SBPbefore 128.67 12 6.933 2.001

SBPafter 130.50 12 5.916 1.708

Paired Samples Correlations

N Correlation Sig.

Pair 1 SBPbefore & SBPafter 12 .598 .040

Paired Samples Test

Paired Differences

t df Sig. (2-tailed) Mean

Std. Deviation

Std. Error Mean

95% Confidence Interval of the Difference

Lower Upper Pair 1 SBPbefore -

SBPafter -1.833 5.828 1.683 -5.536 1.870 -1.090 11 .299

o Interpretation: The variables SBPbefore and SBPafter are the paired variables with a sample size of 12.

The summary statistics of the difference are displayed (mean, standard deviation, and standard error) along with their confidence limits. The minimum and maximum differences are also displayed. The test is not significant (t=-1.09, p=0.299), indicating that the stimuli did not significantly affect systolic blood pressure.



10. Compare proportions (Analysis of contingency table) and association 10.1. Pearson’s Chi-Square test

o SPSS Menu: Analyze > Descriptive Statistics > Crosstab… o Example (Color.sav): The eye and hair color of children from two different regions of Europe are recorded in

the data set Color. Instead of recording one observation per child, the data are recorded as cell counts, where the variable Count contains the number of children exhibiting each of the 15 eye and hair color combinations. The data set does not include missing combinations.

Figure 55 Pearson’s Chi-square test

o SPSS output: Chi-Square Tests

Value df Asymp. Sig. (2-

sided) Pearson Chi-Square 20.925a 8 .007 Likelihood Ratio 25.973 8 .001 Linear-by-Linear Association 3.229 1 .072 N of Valid Cases 762 a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 5.75.

Eye Color * Hair Color Crosstabulation

Hair Color

Total black dark fair medium red Eye Color blue Count 6 51 69 68 28 222

Expected Count 6.4 53.0 66.4 63.2 32.9 222.0 % within Eye Color 2.7% 23.0% 31.1% 30.6% 12.6% 100.0% % within Hair Color 27.3% 28.0% 30.3% 31.3% 24.8% 29.1%

brown Count 16 94 90 94 47 341 Expected Count 9.8 81.4 102.0 97.1 50.6 341.0 % within Eye Color 4.7% 27.6% 26.4% 27.6% 13.8% 100.0% % within Hair Color 72.7% 51.6% 39.5% 43.3% 41.6% 44.8%

green Count 0 37 69 55 38 199 Expected Count 5.7 47.5 59.5 56.7 29.5 199.0 % within Eye Color 0.0% 18.6% 34.7% 27.6% 19.1% 100.0% % within Hair Color 0.0% 20.3% 30.3% 25.3% 33.6% 26.1%

Total Count 22 182 228 217 113 762 Expected Count 22.0 182.0 228.0 217.0 113.0 762.0 % within Eye Color 2.9% 23.9% 29.9% 28.5% 14.8% 100.0% % within Hair Color 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

o Interpretation: The SPSS output displays the chi-square statistics. The alternative hypothesis for this analysis

states that eye color is associated with hair color. With p-value=0.007, the alternative hypothesis is supported



10.2. Fisher’s exact test

o SPSS Menu: Analyze > Descriptive Statistics > Crosstab… o Example (FatComp.sav): This example computes chi-square tests and Fisher’s exact test to compare the

probability of coronary heart disease for two types of diet. It also estimates the relative risks and computes exact confidence limits for the odds ratio. The data set “FatComp.sav” contains hypothetical data for a case-control study of high fat diet and the risk of coronary heart disease. The data are recorded as cell counts, where the variable Count contains the frequencies for each exposure and response combination. The data set is sorted in descending order by the variables Exposure and Response, so that the first cell of the 2 by 2 table contains the frequency of positive exposure and positive response.

Figure 56 Fisher’s exact test

o SPSS output:

Heart Disease * Exposure Crosstabulation

Exposure

Total No Yes Heart Disease Low Cholesterol Diet Count 6 4 10

Expected Count 3.5 6.5 10.0 High Cholesterol Diet Count 2 11 13

Expected Count 4.5 8.5 13.0 Total Count 8 15 23

Expected Count 8.0 15.0 23.0

Chi-Square Tests

Value df Asymp. Sig. (2-

sided) Exact Sig. (2-


sided) Pearson Chi-Square 4.960a 1 .026 Continuity Correctionb 3.188 1 .074 Likelihood Ratio 5.098 1 .024 Fisher's Exact Test .039 .037 Linear-by-Linear Association 4.744 1 .029 N of Valid Cases 23 a. 2 cells (50.0%) have expected count less than 5. The minimum expected count is 3.48. b. Computed only for a 2x2 table



Risk Estimate

Value 95% Confidence Interval Lower Upper

Odds Ratio for Heart Disease (Low Cholesterol Diet / High Cholesterol Diet)

8.250 1.154 59.003

For cohort Exposure = No 3.900 .989 15.373 For cohort Exposure = Yes .473 .214 1.045 N of Valid Cases 23

o Interpretation: SPSS output displays the chi-square statistics. Because the expected counts in some of the

table cells are small, Output gives a warning that the asymptotic chi-square tests might not be appropriate. In this case, the exact tests are appropriate. The alternative hypothesis for this analysis states that coronary heart disease is more likely to be associated with a high fat diet, so a one-sided test is desired. Fisher’s exact right-sided test analyzes whether the probability of heart disease in the high fat group exceeds the probability of heart disease in the low fat group; because this p-value is small, the alternative hypothesis is supported. The odds ratio, displayed in “Risk estimate” table, provides an estimate of the relative risk when an event is rare. This estimate indicates that the odds of heart disease is 8.25 times higher in the high fat diet group; however, the wide confidence limits indicate that this estimate has low precision.

10.3. Cochran-Mantel-Haenszel (CMH) Statistics

o SPSS Menu: Analyze > Descriptive Statistics > Crosstab… o Example (Migraine.sav): The data set Migraine contains hypothetical data for a clinical trial of migraine

treatment. Subjects of both genders receive either a new drug therapy or a placebo. Their response to treatment is coded as 'Better' or 'Same'. The data are recorded as cell counts, and the number of subjects for each treatment and response combination is recorded in the variable Count.

Figure 57 CMH test



o SPSS output:

Treatment * Response * Gender Crosstabulation

Gender Response

Total Better Same female Treatment Active Count 16 11 27

Expected Count 10.9 16.1 27.0 Placebo Count 5 20 25


Expected Count 21.0 31.0 52.0 male Treatment Active Count 12 16 28



Expected Count 19.0 35.0 54.0 Total Treatment Active Count 28 27 55



Expected Count 40.0 66.0 106.0

Chi-Square Tests

Gender Value df Asymp. Sig. (2-



sided) female Pearson Chi-Square 8.310c 1 .004

Continuity Correctionb 6.759 1 .009 Likelihood Ratio 8.633 1 .003 Fisher's Exact Test .005 .004 N of Valid Cases 52

male Pearson Chi-Square 1.501d 1 .221 Continuity Correctionb .884 1 .347 Likelihood Ratio 1.515 1 .218 Fisher's Exact Test .264 .174 N of Valid Cases 54

Total Pearson Chi-Square 8.443a 1 .004 Continuity Correctionb 7.318 1 .007 Likelihood Ratio 8.626 1 .003 Fisher's Exact Test .005 .003 N of Valid Cases 106

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 19.25. b. Computed only for a 2x2 table c. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 10.10. d. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 9.15.

Risk Estimate

Gender Value 95% Confidence Interval Lower Upper

female Odds Ratio for Treatment (Active / Placebo) 5.818 1.676 20.203

For cohort Response = Better 2.963 1.274 6.891 For cohort Response = Same .509 .310 .836 N of Valid Cases 52

male Odds Ratio for Treatment (Active / Placebo) 2.036 .648 6.398

For cohort Response = Better 1.592 .741 3.418 For cohort Response = Same .782 .526 1.163



N of Valid Cases 54 Total Odds Ratio for Treatment

(Active / Placebo) 3.370 1.462 7.772

For cohort Response = Better 2.164 1.237 3.783 For cohort Response = Same .642 .471 .875 N of Valid Cases 106

Tests of Homogeneity of the Odds Ratio

Chi-Squared df Asymp. Sig. (2-

sided) Breslow-Day 1.493 1 .222 Tarone's 1.491 1 .222

Tests of Conditional Independence

Chi-Squared df Asymp. Sig. (2-

sided) Cochran's 8.465 1 .004 Mantel-Haenszel 7.198 1 .007 Under the conditional independence assumption, Cochran's statistic is asymptotically distributed as a 1 df chi-squared distribution, only if the number of strata is fixed, while the Mantel-Haenszel statistic is always asymptotically distributed as a 1 df chi-squared distribution. Note that the continuity correction is removed from the Mantel-Haenszel statistic when the sum of the differences between the observed and the expected is 0.

Mantel-Haenszel Common Odds Ratio Estimate

Estimate 3.313 ln(Estimate) 1.198 Std. Error of ln(Estimate) .423 Asymp. Sig. (2-sided) .005 Asymp. 95% Confidence Interval

Common Odds Ratio Lower Bound 1.446 Upper Bound 7.593

ln(Common Odds Ratio) Lower Bound .369 Upper Bound 2.027

The Mantel-Haenszel common odds ratio estimate is asymptotically normally distributed under the common odds ratio of 1.000 assumption. So is the natural log of the estimate.

o Interpretation: SPSS output above displays the CMH statistics.

Breslow-Day test: The large p-value for the Breslow-Day test (p-value=0.222) in Output indicates no significant gender difference in the odds ratios.

CMH test: The significant p-value (p=0.004) indicates that the association between treatment and response remains strong after adjusting for gender.

The CMH statistics option in Statistics window (See Figure 57) also produces a table of overall relative risks. Because this is a prospective study, the relative risk estimate assesses the effectiveness of the new drug; the "For cohort response=Better” values are the appropriate estimates for the first column (the risk of improvement). The probability of migraine improvement with the new drug is just over two times the probability of improvement with the placebo (Relative risk=2.164).



10.4. McNemar’s test for matched pairs data • Common subjects being observed under 2 conditions (2 treatments, before/after, 2 diagnostic tests) in a crossover

setting • Two possible outcomes (Presence/Absence of Characteristic) on each measurement • Four possibilities for each subjects with respect to outcome:

– Present in both conditions – Absent in both conditions – Present in Condition 1, Absent in Condition 2 – Absent in Condition 1, Present in Condition 2

o SPSS Menu: Analyze > Descriptive Statistics > Crosstab… o Example (PrimeMinister.sav): From the data, we want to compare the probabilities of approval for the prime

minister’s performance at the times of two surveys.

Figure 58 McNemar’s test

o SPSS output: First survey * Second survey Crosstabulation

Second survey

Total Approve Disapprove First survey Approve Count 794 150 944

% within First survey 84.1% 15.9% 100.0% Disapprove Count 86 570 656

% within First survey 13.1% 86.9% 100.0% Total Count 880 720 1600

% within First survey 55.0% 45.0% 100.0%

Symmetric Measures

Value Asymp. Std. Errora Approx. Tb Approx. Sig. Interval by Interval Pearson's R .702 .018 39.396 .000c Ordinal by Ordinal Spearman Correlation .702 .018 39.396 .000c N of Valid Cases 1600 a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis. c. Based on normal approximation.

o Interpretation: SPSS output above displays the result of McNemar’s test for matched pair data. We can see

that there is big difference of the probabilities of approval for the prime minister’s performance at the times of two surveys (p-value <0.001). i.e., we have strong evidence to support a drop in rating



10.5. Measure of Agreement (Cohen’s Kappa) o Cohen's kappa measures the agreement between the evaluations of two raters when both are rating the same

object. A value of 1 indicates perfect agreement. A value of 0 indicates that agreement is no better than chance. Kappa is based on a square table in which row and column values represent the same scale. Any cell that has observed values for one variable but not the other is assigned a count of 0. Kappa is not computed if the data storage type (string or numeric) is not the same for the two variables. For string variable, both variables must have the same defined length.

o A value of kappa higher than 0.75 will indicate excellent agreement while lower than 0.4 will indicate poor agreement.

o SPSS Menu: Analyze > Descriptive Statistics > Crosstab… o Example (Dermatology.sav): Two dermatologists evaluate the skin condition of 88 people. From the data, we

want to know whether two dermatologists’ evaluation is same or not.

Figure 59 Cohen’s Kappa

o SPSS output:

Derm1 * Derm2 Crosstabulation Count

Derm2

Total clear marginal poor terrible Derm1 clear 13 6 2 0 21

marginal 5 12 4 2 23 poor 2 12 10 5 29 terrible 0 1 4 10 15

Total 20 31 20 17 88

Symmetric Measures

Value Asymp. Std. Errora Approx. Tb Approx. Sig. Measure of Agreement Kappa .345 .072 5.637 .000 N of Valid Cases 88 a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis.

o Interpretation: From the SPSS output, estimated Cohen’s Kappa=0.345 and Test for the test of symmetry is

significant (p-value<0.001) implying low agreement.



10.6. Studies in medical science (Review) Prospective studies A prospective study watches for outcomes, such as the development of a disease, during the study period and relates this to other factors such as suspected risk or protection factor(s). The study usually involves taking a cohort of subjects and watching them over a long period. The outcome of interest should be common; otherwise, the number of outcomes observed will be too small to be statistically meaningful (indistinguishable from those that may have arisen by chance). All efforts should be made to avoid sources of bias such as the loss of individuals to follow up during the study. Prospective studies usually have fewer potential sources of bias and confounding than retrospective studies. Retrospective studies A retrospective study looks backwards and examines exposures to suspected risk or protection factors in relation to an outcome that is established at the start of the study. Many valuable case-control studies, such as Lane and Claypon's 1926 investigation of risk factors for breast cancer, were retrospective investigations. Most sources of error due to confounding and bias are more common in retrospective studies than in prospective studies. For this reason, retrospective investigations are often criticised. If the outcome of interest is uncommon, however, the size of prospective investigation required to estimate relative risk is often too large to be feasible. In retrospective studies the odds ratio provides an estimate of relative risk. You should take special care to avoid sources of bias and confounding in retrospective studies. Prospective investigation is required to make precise estimates of either the incidence of an outcome or the relative risk of an outcome based on exposure. Case-Control studies Case-Control studies are usually but not exclusively retrospective; the opposite is true for cohort studies. The following notes relate case-control to cohort studies:

• outcome is measured before exposure • controls are selected on the basis of not having the outcome • good for rare outcomes • relatively inexpensive • smaller numbers required • quicker to complete • prone to selection bias • prone to recall/retrospective bias • related methods are risk (retrospective), chi-square 2 by 2 test, Fisher's exact test, exact confidence

interval for odds ratio, odds ratio meta-analysis and conditional logistic regression. Cohort studies Cohort studies are usually but not exclusively prospective; the opposite is true for case-control studies. The following notes relate cohort to case-control studies:

• outcome is measured after exposure • yields true incidence rates and relative risks • may uncover unanticipated associations with outcome • best for common outcomes • expensive • requires large numbers • takes a long time to complete • prone to attrition bias (compensate by using person-time methods) • prone to the bias of change in methods over time • related methods are risk (prospective), relative risk meta-analysis, risk difference meta-analysis and

proportions



11. ANOVA, ANCOVA, and MANOVA 11.1. One-way ANOVA

- The One-Way ANOVA procedure produces a one-way analysis of variance for a quantitative dependent variable by a single factor (independent) variable. Analysis of variance is used to test the hypothesis that several means are equal. This technique is an extension of the two-sample t test.

- The assumptions of analysis of variance are that treatment effects are additive and experimental errors are independently random with a normal distribution that has mean zero and constant variance.

- Once you have determined that differences exist among the means, post hoc range tests and pairwise multiple comparisons can determine which means differ. Range tests identify homogeneous subsets of means that are not different from each other. Pairwise multiple comparisons test the difference between each pair of means and yield a matrix where asterisks indicate significantly different group means at an alpha level of 0.05.

o SPSS Menu: Analyze > Compare Means > One-Way ANOVA... o Example (Clover.sav): The following example studies the effect of bacteria on the nitrogen content of red

clover plants. The treatment factor is bacteria strain, and it has six levels. Five of the six levels consist of five different Rhizobium trifolii bacteria cultures combined with a composite of five Rhizobium meliloti strains. The sixth level is a composite of the five Rhizobium trifolii strains with the composite of the Rhizobium meliloti. Red clover plants are inoculated with the treatments, and nitrogen content is later measured in milligrams.

Figure 60 One-way ANOVA

o SPSS output:

Descriptives Nitrogen

N Mean Std. Deviation Std. Error 95% Confidence Interval for Mean

Minimum Maximum Lower Bound Upper Bound 3DOK1 5 28.8200 5.80017 2.59392 21.6181 36.0219 19.40 33.00 3DOK13 5 13.2600 1.42759 .63844 11.4874 15.0326 11.60 14.40 3DOK4 5 14.6400 4.11619 1.84082 9.5291 19.7509 9.10 19.40 3DOK5 5 23.9800 3.77717 1.68920 19.2900 28.6700 17.70 27.90 3DOK7 5 19.9200 1.13004 .50537 18.5169 21.3231 18.60 21.00 COMPOS 5 18.7000 1.60156 .71624 16.7114 20.6886 16.90 20.80 Total 30 19.8867 6.24217 1.13966 17.5558 22.2175 9.10 33.00

Test of Homogeneity of Variances Nitrogen

Levene Statistic df1 df2 Sig. 3.145 5 24 .025



ANOVA Nitrogen Sum of Squares df Mean Square F Sig. Between Groups 847.047 5 169.409 14.371 .000 Within Groups 282.928 24 11.789 Total 1129.975 29

Multiple Comparisons

Dependent Variable: Nitrogen Tukey HSD

(I) Strain (J) Strain Mean Difference

(I-J) Std. Error Sig. 95% Confidence Interval

Lower Bound Upper Bound 3DOK1 3DOK13 15.56000* 2.17151 .000 8.8458 22.2742

3DOK4 14.18000* 2.17151 .000 7.4658 20.8942 3DOK5 4.84000 2.17151 .262 -1.8742 11.5542 3DOK7 8.90000* 2.17151 .005 2.1858 15.6142 COMPOS 10.12000* 2.17151 .001 3.4058 16.8342

3DOK13 3DOK1 -15.56000* 2.17151 .000 -22.2742 -8.8458 3DOK4 -1.38000 2.17151 .987 -8.0942 5.3342 3DOK5 -10.72000* 2.17151 .001 -17.4342 -4.0058 3DOK7 -6.66000 2.17151 .053 -13.3742 .0542 COMPOS -5.44000 2.17151 .162 -12.1542 1.2742

3DOK4 3DOK1 -14.18000* 2.17151 .000 -20.8942 -7.4658 3DOK13 1.38000 2.17151 .987 -5.3342 8.0942 3DOK5 -9.34000* 2.17151 .003 -16.0542 -2.6258 3DOK7 -5.28000 2.17151 .185 -11.9942 1.4342 COMPOS -4.06000 2.17151 .443 -10.7742 2.6542

3DOK5 3DOK1 -4.84000 2.17151 .262 -11.5542 1.8742 3DOK13 10.72000* 2.17151 .001 4.0058 17.4342 3DOK4 9.34000* 2.17151 .003 2.6258 16.0542 3DOK7 4.06000 2.17151 .443 -2.6542 10.7742 COMPOS 5.28000 2.17151 .185 -1.4342 11.9942

3DOK7 3DOK1 -8.90000* 2.17151 .005 -15.6142 -2.1858 3DOK13 6.66000 2.17151 .053 -.0542 13.3742 3DOK4 5.28000 2.17151 .185 -1.4342 11.9942 3DOK5 -4.06000 2.17151 .443 -10.7742 2.6542 COMPOS 1.22000 2.17151 .993 -5.4942 7.9342

COMPOS 3DOK1 -10.12000* 2.17151 .001 -16.8342 -3.4058 3DOK13 5.44000 2.17151 .162 -1.2742 12.1542 3DOK4 4.06000 2.17151 .443 -2.6542 10.7742 3DOK5 -5.28000 2.17151 .185 -11.9942 1.4342 3DOK7 -1.22000 2.17151 .993 -7.9342 5.4942

*. The mean difference is significant at the 0.05 level.



11.2. Two-way ANOVA (With interaction)

o SPSS Menu: Analyze > General Linear Model > Univariate... o Example (Drug.sav): Thiis example uses data from Kutner (1974, p. 98) to illustrate a two-way analysis of

variance. The original data source is Afifi and Azen (1972, p. 166).

Figure 61 Unbalanced two-way ANOVA with interaction

o SPSS output:

Tests of Between-Subjects Effects

Dependent Variable: y

Source Type III Sum of

Squares df Mean Square F Sig. Corrected Model 4259.339a 11 387.213 3.506 .001 Intercept 20037.613 1 20037.613 181.414 .000 drug 2997.472 3 999.157 9.046 .000 disease 415.873 2 207.937 1.883 .164 drug * disease 707.266 6 117.878 1.067 .396 Error 5080.817 46 110.453 Total 30013.000 58 Corrected Total 9340.155 57 a. R Squared = .456 (Adjusted R Squared = .326)



Multiple Comparisons

Dependent Variable: y Tukey HSD

(I) drug (J) drug Mean

Difference (I-J) Std. Error Sig. 95% Confidence Interval

Lower Bound Upper Bound 1 2 .53 3.838 .999 -9.70 10.76

3 17.32* 4.070 .001 6.47 28.17 4 12.57* 3.777 .009 2.50 22.63

2 1 -.53 3.838 .999 -10.76 9.70 3 16.78* 4.070 .001 5.93 27.63 4 12.03* 3.777 .013 1.97 22.10

3 1 -17.32* 4.070 .001 -28.17 -6.47 2 -16.78* 4.070 .001 -27.63 -5.93 4 -4.75 4.013 .640 -15.45 5.95

4 1 -12.57* 3.777 .009 -22.63 -2.50 2 -12.03* 3.777 .013 -22.10 -1.97 3 4.75 4.013 .640 -5.95 15.45

Based on observed means. The error term is Mean Square(Error) = 110.453. *. The mean difference is significant at the .05 level.



11.3. ANCOVA (Analysis of Covariance) Two general applications exist for ANCOVA:

• Remove Error Variance in the Randomized Experiment: Participants are assigned to treatment and control groups in any ANOVA-type design. ANCOVA is then used as the statistical technique to eliminate irrelevant y variance.

• Equating Non-Equivalent (Intact) Groups: A very controversial use of ANCOVA is to correct for initial group differences (prior to assigned to x) that exists on y among several intact, state variable groups.

o SPSS Menu: Analyze > General Linear Model > Univariate... o Example (Cholesterol.sav): Cholesterol levels [mg/ml] for 30 women from two US states, Iowa and Nebraska.

Age [years] may be a relevant covariate.

Figure 62 ANCOVA

o SPSS output:

Tests of Between-Subjects Effects Dependent Variable: cholesterol

Source Type III Sum of

Squares df Mean Square F Sig. Corrected Model 54432.754a 2 27216.377 14.965 .000 Intercept 16901.543 1 16901.543 9.293 .005 age 53820.058 1 53820.058 29.593 .000 state 5456.450 1 5456.450 3.000 .095 Error 49103.913 27 1818.663 Total 1473140.000 30 Corrected Total 103536.667 29 a. R Squared = .526 (Adjusted R Squared = .491)

o Exercise (Goat .sav): Experiments were carried out on six commercial goat farms to determine whether the

standard worm drenching program was adequate. Forty goats were used in each experiment. Twenty of these, chosen completely at random, were drenched according to the standard program, while the remaining twenty were drenched more frequently. The goats were individually tagged, and weighed at the start and end of the year-long study. For the first farm in the study the resulting liveweight gains are given along with the initial liveweights. In each experiment the main interest was in the comparison of the liveweight gains between the two treatments.



11.4. MANOVA (Multivariate ANOVA) Multivariate (>1 dependent variable) tests for differences among groups. ANOVA is a special case of MANOVA • MANOVA - This is a good option if there are two or more continuous dependent variables and one categorical predictor

variable. • Discriminant function analysis - This is a reasonable option and is equivalent to a one-way MANOVA. • The data could be reshaped into long format and analyzed as a multilevel model. • Separate univariate ANOVAs - You could analyze these data using separate univariate ANOVAs for each response

variable. The univariate ANOVA will not produce multivariate results utilizing information from all variables simultaneously. In addition, separate univariate tests are generally less powerful because they do not take into account the inter-correlation of the dependent variables.

Assumption of MANOVA • One of the assumptions of MANOVA is that the response variables come from group populations that are multivariate

normal distributed. This means that each of the dependent variables is normally distributed within group, that any linear combination of the dependent variables is normally distributed, and that all subsets of the variables must be multivariate normal. With respect to Type I error rate, MANOVA tends to be robust to minor violations of the multivariate normality assumption.

• The homogeneity of population covariance matrices (a.k.a. sphericity) is another assumption. This implies that the population variances and covariances of all dependent variables must be equal in all groups formed by the independent variables.

• Small samples can have low power, but if the multivariate normality assumption is met, the MANOVA is generally more powerful than separate univariate tests.

o SPSS Menu: Analyze > General Linear Model > Multivariate... o Example (MANOVA_Dietary .sav): A researcher randomly assigns 33 subjects to one of three groups. The first

group receives technical dietary information interactively from an on-line website. Group 2 receives the same information from a nurse practitioner, while group 3 receives the information from a video tape made by the same nurse practitioner. The researcher looks at three different ratings of the presentation, difficulty, usefulness and importance, to determine if there is a difference in the modes of presentation. In particular, the researcher is interested in whether the interactive website is superior because that is the most cost-effective way of delivering the information.



Figure 63 MANOVA

o SPSS output:

Descriptive Statistics GROUP Mean Std. Deviation N USEFUL Treatment 18.1182 3.90380 11

Control1 15.5273 2.07562 11 Control2 15.3455 3.13827 11 Total 16.3303 3.29246 33

DIFFICULTY Treatment 6.1909 1.89971 11 Control1 5.5818 2.43426 11 Control2 5.3727 1.75903 11 Total 5.7152 2.01760 33

IMPORTANCE Treatment 8.6818 4.86309 11 Control1 5.1091 2.53119 11 Control2 5.6364 3.54691 11 Total 6.4758 3.98513 33

Multivariate Testsa Effect Value F Hypothesis df Error df Sig. Intercept Pillai's Trace .986 657.857b 3.000 28.000 .000

Wilks' Lambda .014 657.857b 3.000 28.000 .000 Hotelling's Trace 70.485 657.857b 3.000 28.000 .000 Roy's Largest Root 70.485 657.857b 3.000 28.000 .000

GROUP Pillai's Trace .477 3.025 6.000 58.000 .012 Wilks' Lambda .526 3.538b 6.000 56.000 .005 Hotelling's Trace .897 4.038 6.000 54.000 .002 Roy's Largest Root .892 8.623c 3.000 29.000 .000

a. Design: Intercept + GROUP b. Exact statistic c. The statistic is an upper bound on F that yields a lower bound on the significance level.



Tests of Between-Subjects Effects

Source Dependent Variable Type III Sum of

Squares df Mean Square F Sig. Corrected Model USEFUL 52.924a 2 26.462 2.701 .083

DIFFICULTY 3.975b 2 1.988 .472 .628 IMPORTANCE 81.830c 2 40.915 2.879 .072

Intercept USEFUL 8800.400 1 8800.400 898.106 .000 DIFFICULTY 1077.878 1 1077.878 256.054 .000 IMPORTANCE 1383.869 1 1383.869 97.371 .000

GROUP USEFUL 52.924 2 26.462 2.701 .083 DIFFICULTY 3.975 2 1.988 .472 .628 IMPORTANCE 81.830 2 40.915 2.879 .072

Error USEFUL 293.965 30 9.799 DIFFICULTY 126.287 30 4.210 IMPORTANCE 426.371 30 14.212

Total USEFUL 9147.290 33 DIFFICULTY 1208.140 33 IMPORTANCE 1892.070 33

Corrected Total USEFUL 346.890 32 DIFFICULTY 130.262 32 IMPORTANCE 508.201 32

a. R Squared = .153 (Adjusted R Squared = .096) b. R Squared = .031 (Adjusted R Squared = -.034) c. R Squared = .161 (Adjusted R Squared = .105)

o Exercise (Pottery .sav): This example employs multivariate analysis of variance (MANOVA) to measure

differences in the chemical characteristics of ancient pottery found at four kiln sites in Great Britain. The data are from Tubb, Parker, and Nickless (1980), as reported in Hand et al. (1994). For each of 26 samples of pottery, the percentages of oxides of five metals are measured.



12. Nonparametic method • A statistical method is called non-parametric if it makes no assumption on the population distribution or sample size. • This is in contrast with most parametric methods in elementary statistics that assume the data is quantitative, the

population has a normal distribution and the sample size is sufficiently large. • In general, conclusions drawn from non-parametric methods are not as powerful as the parametric ones. However, as

non-parametric methods make fewer assumptions, they are more flexible, more robust, and applicable to non-quantitative data.

12.1. Wilcoxon rank-sum test (Mann-Whitney U test) o SPSS Menu: Analyze > Nonparametic Tests >Legacy Dialogs> 2 Independent Samples … o Corresponding parametric method: two-sample t-test o Example (Graze.sav): T aken from Huntsberger and Billingsley (1989). Compares two grazing methods using

32 cattle. Half of the cattle are allowed to graze continuously while the other half are subjected to controlled grazing time. The researchers want to know if these two grazing methods affect weight gain differently

Figure 64 Wilcoxon rank-sum (or Mann-Whitney U test)

o SPSS output: Ranks

GrazeType N Mean Rank Sum of Ranks WeightGain continuous 16 15.19 243.00

controlled 16 17.81 285.00 Total 32

Test Statisticsa

WeightGain Mann-Whitney U 107.000 Wilcoxon W 243.000 Z -.792 Asymp. Sig. (2-tailed) .429 Exact Sig. [2*(1-tailed Sig.)] .445b a. Grouping Variable: GrazeType b. Not corrected for ties.



12.2. Wilcoxon signed-rank test for paired data o SPSS Menu: Analyze > Nonparametic Tests >Legacy Dialogs> 2 Related Samples … o Corresponding parametric method: paired t-test o Example (Pressure.sav): A stimulus is being examined to determine its effect on systolic blood pressure.

Twelve men participate in the study. Each man’s systolic blood pressure is measured both before and after the stimulus is applied.

Figure 65 Wilcoxon signed-rank test

o SPSS output:

Ranks N Mean Rank Sum of Ranks SBPafter - SBPbefore Negative Ranks 3a 8.17 24.50

Positive Ranks 9b 5.94 53.50 Ties 0c Total 12

a. SBPafter < SBPbefore b. SBPafter > SBPbefore c. SBPafter = SBPbefore

Test Statisticsa

SBPafter - SBPbefore

Z -1.143b Asymp. Sig. (2-tailed) .253 a. Wilcoxon Signed Ranks Test b. Based on negative ranks.



12.3. Kruskal-Wallis test o SPSS Menu: Analyze > Nonparametic Tests >Legacy Dialogs> K independent Samples … o Corresponding parametric method: One-way ANOVA o Example (Clover.sav): The following example studies the effect of bacteria on the nitrogen content of red

clover plants. The treatment factor is bacteria strain, and it has six levels. Five of the six levels consist of five different Rhizobium trifolii bacteria cultures combined with a composite of five Rhizobium meliloti strains. The sixth level is a composite of the five Rhizobium trifolii strains with the composite of the Rhizobium meliloti. Red clover plants are inoculated with the treatments, and nitrogen content is later measured in milligrams.

Figure 66 Kruskal-Wallis test

o SPSS output:

Ranks Strain N Mean Rank Nitrogen 3DOK1 5 26.00

3DOK13 5 4.60 3DOK4 5 8.00 3DOK5 5 22.20 3DOK7 5 17.60 COMPOS 5 14.60 Total 30

Test Statisticsa,b

Nitrogen Chi-Square 21.659 df 5 Asymp. Sig. .001 a. Kruskal Wallis Test b. Grouping Variable: Strain



13. Correlation and Regression analysis

13.1. Correlation analysis - Linear relation between bivariate variables

o SPSS Menu: Analyze > Correlate > Bivariate... o Example (Cholesterol.sav): Cholesterol levels [mg/ml] for 30 women from two US states, Iowa and Nebraska.


Figure 67 Correlation analysis

o SPSS output:

Correlations

cholesterol age cholesterol Pearson Correlation 1 .688**

Sig. (2-tailed) .000 N 30 30

age Pearson Correlation .688** 1 Sig. (2-tailed) .000 N 30 30

**. Correlation is significant at the 0.01 level (2-tailed).

Correlations

cholesterol age Spearman's rho cholesterol Correlation Coefficient 1.000 .749**

Sig. (2-tailed) . .000 N 30 30

age Correlation Coefficient .749** 1.000 Sig. (2-tailed) .000 . N 30 30

**. Correlation is significant at the 0.01 level (2-tailed).



13.2. Linear Regression model – Linear regression is the most widely used of all statistical techniques: it is the study of linear (i.e., straight-line)

relationships between variables, usually under an assumption of normally distributed errors

o SPSS Menu: Analyze > Regression > Linear... o Example (Cholesterol.sav): Cholesterol levels [mg/ml] for 30 women from two US states, Iowa and Nebraska.


Figure 68 Linear regression model

o SPSS output:

Model Summaryb

Model R R Square Adjusted R

Square Std. Error of the

Estimate 1 .725a .526 .491 42.646 a. Predictors: (Constant), State1, age b. Dependent Variable: cholesterol

ANOVAa

Model Sum of Squares df Mean Square F Sig. 1 Regression 54432.754 2 27216.377 14.965 .000b

Residual 49103.913 27 1818.663 Total 103536.667 29

a. Dependent Variable: cholesterol b. Predictors: (Constant), State1, age

Coefficientsa

Model Unstandardized Coefficients

Standardized Coefficients

t Sig. B Std. Error Beta 1 (Constant) 93.141 24.799 3.756 .001

age 2.698 .496 .738 5.440 .000 State1 -28.651 16.541 -.235 -1.732 .095

a. Dependent Variable: cholesterol



o Exercise (BrainSize.sav): Are the size and weight of your brain indicators of your mental capacity? In this

study by Willerman et al. (1991) the researchers use Magnetic Resonance Imaging (MRI) to determine the brain size of the subjects. The researchers take into account gender and body size to draw conclusions about the connection between brain size and intelligence. Willerman et al. (1991) conducted their study at a large southwestern university. They selected a sample of 40 right-handed Anglo introductory psychology students who had indicated no history of alcoholism, unconsciousness, brain damage, epilepsy, or heart disease. These subjects were drawn from a larger pool of introductory psychology students with total Scholastic Aptitude Test Scores higher than 1350 or lower than 940 who had agreed to satisfy a course requirement by allowing the administration of four subtests (Vocabulary, Similarities, Block Design, and Picture Completion) of the Wechsler (1981) Adult Intelligence Scale-Revised. With prior approval of the University's research review board, students selected for MRI were required to obtain prorated full-scale IQs of greater than 130 or less than 103, and were equally divided by sex and IQ classification. The MRI Scans were performed at the same facility for all 40 subjects. The scans consisted of 18 horizontal MR images. The computer counted all pixels with non-zero gray scale in each of the 18 images and the total count served as an index for brain size. Variable Information: Gender: Male or Female FSIQ: Full Scale IQ scores based on the four Wechsler (1981) subtests VIQ: Verbal IQ scores based on the four Wechsler (1981) subtests PIQ: Performance IQ scores based on the four Wechsler (1981) subtests Weight: body weight in pounds Height: height in inches MRI_Count: total pixel Count from the 18 MRI scans



14. Logistic regression analysis Logistic regression, also called a logit model, is used to model dichotomous outcome variables. In the logit model the log odds of the outcome is modeled as a linear combination of the predictor variables.

o SPSS Menu: Analyze > Regression > Binary Logistic... o Example (Logistic.sav): A researcher is interested in how variables, such as GRE (Graduate Record Exam

scores), GPA (grade point average) and prestige of the undergraduate institution, affect admission into graduate school. The outcome variable, admit/don't admit, is binary. This data set has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise. There are three predictor variables: gre, gpa, and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. We start out by looking at some descriptive statistics.

Figure 69 Logistic regression model

o SPSS output:

Model Summary

Step -2 Log likelihood Cox & Snell R

Square Nagelkerke R

Square 1 458.517a .098 .138 a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.



Classification Tablea

Observed

Predicted ADMIT Percentage

Correct Not admitted Admitted Step 1 ADMIT Not admitted 254 19 93.0

Admitted 97 30 23.6 Overall Percentage 71.0

a. The cut value is .500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper Step 1a GRE .002 .001 4.284 1 .038 1.002 1.000 1.004

GPA .804 .332 5.872 1 .015 2.235 1.166 4.282 RANK 20.895 3 .000 RANK(1) 1.551 .418 13.787 1 .000 4.718 2.080 10.702 RANK(2) .876 .367 5.706 1 .017 2.401 1.170 4.927 RANK(3) .211 .393 .289 1 .591 1.235 .572 2.668 Constant -5.541 1.138 23.709 1 .000 .004

a. Variable(s) entered on step 1: GRE, GPA, RANK.

o Interpretation: From the output, GRE, GPA, and Rank variables are associated with response variable (Admit or not). In logistics regression. Exp(B) is useful for interpretation. For the coefficients for GPA, the odds ratio can be computed by raising e to the power of the logistic coefficient, OR = eb = e0.804 = 2.235. This means that a one unit change in GPA results in 2.235 times chance to get admission.

spss workshop 2014 tutorial - wchri

Documents