slide 1 analyzing patterns of missing data while spss contains a rich set of procedures for...

Analyzing Patterns of Missing Data

While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University. However, we can replicate much of the analysis with other SPSS procedures.

The first set of tasks in the missing data analysis involve the creation of diagnostic variables that support the analysis: first, a variable that counts the number of variables with missing data for each case; second, one new dichotomous variable for each original variable that indicates whether or not the original variable had a missing data value; and third, a single pattern variable for each case that summarizes the missing or valid status of values for all of the variables in the analysis.

Using the diagnostic variable that counts the missing values for each case, we can identify cases with large concentrations of missing data as candidates for elimination from the analysis. After we remove specific cases with large numbers of missing variables, we do a frequency distribution for the remaining cases to see if any variables have so many missing cases that the variable should be considered a candidate for exclusion.

Next, we compute a frequency distribution for the pattern variable to identify patterns that occur often in the data, indicating a problematic missing data process.

Next, using the valid/missing variables as a grouping variable, we examine whether or not the missing cases are statistically different from the valid cases for all of the other variables in the analysis. If the variable is metric, we do a t-test for group differences; if the variable is non-metric, we do a chi-square test of independence to detect group differences.

Finally, we do a correlation matrix of the valid/missing variables to detect concentrations of missing data across multiple variables.Analyzing Patterns of Missing Data

1. Download the data set

Download the HATMISS data set from the course web page and save it in your C:\SW388R7 folder.


2. Tallying the Number of Missing Variables

One of the major information items we need for the missing data analysis is the number of variables that have missing data for each case in the sample.

We will create a new variable which we will name num_miss that will contain the number of variables from the first ten in the data set, x1 through x10. We include only the first ten variables in this calculation to maintain consistency with the text.

The SPSS function NMISS counts the number of variables that have missing values. We will use this function to calculate the value for our NUM_MISS variable for each case.


Computing the Number Missing by Case

First, select the 'Compute…' command from the 'Transform' menu.

Second, type the name of the variable we want to create, 'num_miss', in the 'Target Variable: ' text box.

Third, scroll down the list of functions and highlight the 'NMISS' function.

Fourth, click on the move arrow to move the function to the 'Numeric Expression: ' text area.


Specifying the Variables in the Function

First, type the names of the variables to include in the function as a comma-delimited list between the parentheses after the function.

Second, click on the OK button to produce the new variable.

Third, the new variable appears in a column to the right of the existing columns of data.


3. Creating Dichotomous Valid/Missing Variables for Diagnosing Missing Data

To determine whether or not the pattern of missing data is random, we create a special diagnostic variable that indicates whether the variable is missing or valid for each case in the data set. Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and the value 0 for 'Missing'

Since we may need to refer back to the original variables in the course of the missing data analysis, I recommend a naming convention for the diagnostic variables that makes it easy to identify the original variable. If the original variable name is less than eight characters, an underscore is appended to the end of the original variable name, e.g. the diagnostic variable for race would be race_. If the original variable name is eight characters, the last character is replaced with an underscore, e.g. the diagnostic variable name for response would be respons_. If replacing the last character with an underscore duplicates the name assigned to another diagnostic variable for an eight-character variable name, we drop the last two characters from the original name and append an underscore followed by a sequence letter or digit, e.g. the diagnostic variable name for response would be respon_1 if we had already used the name respons_ for a diagnostic variable.

When we assign variable labels to the diagnostic variables, we can add a keyword to the original variable label to designate it as a missing/valid diagnostic variable, e.g. the variable label for the diagnostic variable that had an original variable label of Grade Level could be Grade Level (Valid/Missing).

We will demonstrate the process of creating dichotomous Valid/Missing variables for diagnosing missing data using the variables in the HATMISS.SAV data set. If the copy of HATMISS.SAV that you are working with does not have variable labels and value labels, do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is to position the mouse over a variable name in the data editor. If a variable label appears in a yellow tips box, a variable label has been added for that variable.


Recoding Diagnostic Variables for Missing Data

First, select the 'Recode | Into DifferentValues…' command from the 'Transform' menu.

Second, move thefirst original variable,x1, from the input listto the list box'Numeric Variable ->Output Variable.' Third, type the new

variable name, x1_ ,into the 'Name:' textbox on the 'OutputVariable' panel.

Fourth, type the variablelabel for the new x1_variable, 'Delivery Speed(Valid/Missing)' in theLabel text box on the'Output Variable' panel.

Fifth, click on theChange button to movethe new name to the'Numeric Variable ->Output Variable' list.


Opening the Dialog for Old and New Values

To specify which old values are to be recoded into new values, click on the 'Old and New Values…' button.


Add the Value for Missing Data

First, click on the'System- or user-missing' option buttonon the 'Old Value'panel.

Second, type 0 intothe 'Value: ' text box inthe 'New Value' panel.

Third, to add thesevalue changes tothe list of recodes,click on the 'Add'button. The changeis added to the'OldNew:' list.


Add the Value for Valid Data

First, click on the 'All other values' option button on the 'Old Value' panel.

Second, type 1 into the 'Value: ' text box in the 'New Value' panel.

Third, to add these value changes to the list of recodes, click on the 'Add' button. The change is added to the 'OldNew:' list.


Completing the Values Dialog Box

Since this is the lastvalue specification,we click on thecontinue button toclose the dialog box.


Adding Diagnostic Variables for the Remaining Variables

First, add the original name, the new diagnostic variable name,and the variable label for the diagnostic variable for all of theother variables through x14. The same value changes which wespecified for x1_ will be applied to these variables.

Second, click on theOK button to completethe recode request.


Adding Value Labels to the Diagnostic Variables

To add value labels to the diagnostic variables, first we go to the Variable View worksheet in the Data Editor.

Second, highlight the Values cell for the variable we want to work with.

Third, click on the gray dialogue box which appears in the cell to bring up the Value Labels dialogue box.


Adding the Value Label for Missing

First, we type a 0 in the 'Value' text box on the 'Value Labels' Panel.

Second, we type 'Missing' in the 'Value Label' text box on the 'Value Labels' Panel.

Third, we click on the Add button to add this value label to the list box.


Add the Value Label for Valid

First, we type a 1 in the 'Value' text box on the 'Value Labels' Panel.

Second, we type 'Valid' in the 'Value Label' text box on the 'Value Labels' Panel.

Third, we click on the Add button to add this value label to the list box.


Apply the Value Labels

First click on the OK button to apply the value labels.


Displaying the Value Labels for the Variables

To display the value labels in the SPSS Data Editor window, we first return to the Data View worksheet of the Data Editor. There we select the 'Value Labels' command from the View menu. When the command is in effect, a check mark will appear before the command. To restore the display to the numeric code display, we select the 'Value Labels' command a second time to toggle it off.


The Diagnostic Variables

The value labels for the variables appear in the SPSS Data Editor. The display would be improved by adjusting the width of the data columns. This display can be used to examine the pattern of missing values as the text does in table 2.3.


4. Adding a Pattern Variable to the Data Set

Another indication of a problematic missing data process is the frequent occurrence of the same pattern of missing data among the variables. While patterns can be detected by sorting and scanning the data set, this task is facilitated by the creation of a pattern variable. The pattern variable is a string variable containing one character for each variable in the data set. Each character in the pattern variable is set to a character indicating missing data or a character indicating valid data. To make the pattern more visually intuitive, the characters selected should have the same width when printed. If we do not use same width characters, we cannot scan down values to compare them because the column alignment of the characters is not the same from one value to the next. We will use an X for missing data and a tilde, ~, for valid data, because both are full width characters.

To create the pattern variable, we first create a one-character string variable for each of the original variables. Then, we use the SPSS 'CONCAT' function to add the string variables together into a single variable.


Recode the Original Variables into String Variables

First, select the 'Recode | Into Different Variables…' command from the Transform menu.

Second, click on the Reset button to clear the previously recoded variables.

Fourth, type the name for the new variable 'x1_x' in the 'Name:' text box on the 'Output Variable' panel.

Fifth, click on the Change button to move the name 'x1_x' to the 'Numeric Variable -> Output Variable' list box.

Third, move the variable 'Delivery Speed [x1]' to the 'Numeric Variable -> Output Variable' list box.


Opening the Dialog for Old and New Values

To specify which old values are to be recoded into new values, click on the 'Old and New Values…' button.


Add the Value for Missing Data

First, click onthe 'System- oruser-missing'option buttonon the 'OldValue' panel.

Second, click on the'Output variables arestrings' check box.

Third, set the 'Width'of the output variablesto 1 character.

Fourth, type 'X' intothe 'Value: ' text box inthe 'New Value' panel.

Fifth, to add thesevalue changes to thelist of recodes, click onthe 'Add' button. Thechange is added to the'OldNew:' list.


Add the Value for Valid Data

First, click on the 'Allother values' optionbutton on the 'OldValue' panel.

Second, type '~' (a tilde)into the 'Value: ' text box inthe 'New Value' panel. Ichose a tilde rather than ablank because they will beeasier to see.

Third, to add thesevalue changes to thelist of recodes, click onthe 'Add' button. Thechange is added to the'OldNew:' list.


Completing the Values Dialog Box

Since this is the last valuespecification, we click onthe continue button toclose the dialog box.


Adding String Variables for the Other Original Variables

First, add theoriginal name andthe new stringvariable name for allof the other variablesthrough x10. Thesame value changeswhich we specifiedfor x1 will be appliedto these variables. Second, click on the

OK button to completethe recode request.


The String Variables

The recoded string variables for variables Delivery Speed (x1) through Satisfaction Level (x10) are added to the data editor window.


Create the Variable Containing the Concatenated Data

First, select the 'Compute…' command from the Transform menu to create a new variable.

Second, after clicking on the Reset button to clear the last recoded variable, type the name for the new variable 'miss_str' into the 'Target Variable: ' text box.

Third, click on the 'Type&Label…' button to set the type of variable to string.

Fourth, in the 'Type' panel mark the 'String' option button.

Fifth, set the 'Width: ' of the new variable to 10 characters, one for each of the ten string variables.

Sixth, click on the 'Continue' button to close the 'Type and Label' dialog box.


Enter the Formula for the Concatenated Variable

First, highlight the 'CONCAT' function in the 'Functions: ' list box and move it to the 'String Expression: ' text area.

Second, type the names of the string variables as a comma delimited list between the parentheses following the CONCAT function name.

Third, click the OK button to complete the compute variable function.


The Missing Data Pattern Variable

One variable now contains a string that has one character for each string variable. This variable contains the pattern of missing and valid data for each case in the data set. We have made a lot of changes to the HATMISS data set that we should save, so we click on the Save File tool. This completes the creation of the diagnostic variables we need to conduct the missing data analysis.


5. Removing Cases with a Large Proportion of Missing Variables

To identify the cases that we should consider removing, we will sort the data set in descending order by the number of missing variables. The candidates for elimination will appear at the top of the data set.

Once we have located the cases that we want to eliminate, we specify a filter condition to eliminate the cases from further analysis. The cases are not deleted from the data set, so we can include them in later analysis should we desire to do so.


Sorting the Cases

It will be easier to identify problem cases if we sort the cases by the 'num_miss' variable. First, select the 'Sort Cases…' command from the Data menu.

Second, click on the 'Descending' option button in the 'Sort Order' panel so that the cases with the largest number of missing values appear at the top of the data set.

Third, in the 'Sort Cases' dialog, move the 'num_miss' variable to the 'Sort by: ' list box.

Fourth, click on the OK button to sort the data set.


The Cases Sorted by Number Missing

At the top of the sorted data set, we see the six cases which hadmissing values on 5, 6, or 7 of the original ten values (missing 50% ormore of the data). These are the cases that will be excluded fromfurther analysis.


Excluding the Cases

We exclude the cases with too many missing values by not selecting them for inclusion in later analyses. First, we select the 'Select Cases…' command from the Data menu.

Second, we mark the 'If condition is satisfied' option button in the Select panel.

Third, we click on the 'If…' button to specify the condition for inclusion.


Specifying the If Condition

First, move the'num_miss' variable tothe condition text areaon the right.

Second, complete the conditionby type '< 5' (less than 5) afterthe variable name. This 'I fcondition' specifies that a casewill be included if the value of its'num_miss' variable is less than5, i.e. 4, 3, 2, 1 or 0. Cases thathave a 'num_miss' value equalto five or greater than 5 will notbe included.

Third, click on the Continuebutton to signal completionof the IF condition.


Specify Filtering for Unselected Cases

Second, click of the OK button to complete the selection process.

We have two options for removing cases that do not satisfy the selection criteria: deletion from the data set and filtering from the data set. Deletion physically removes the cases from the data set permanently. Filtering leaves the cases in the data set, but marks them for exclusion from the analyses. With the cases still in the data set, we can choose to include them in a later analysis. First, mark the 'Filtered' option button on the 'Unselected Cases Are' panel.


The Data Set with Filtered Cases

The cases that did not meet the selection criteria are marked with a diagonal line or slash through their case number. In addition, SPSS added a new variable to the data set, 'filter_$', which has a value of 1 if the case is included, and a value of 0 if the case is not included. When applying a selection criteria, it is good practice to spot check our cases to make certain we specified the 'IF' condition correctly. In this problem, we see that cases 1 through 6, which have num_miss values greater than 4, all have a slash through their case number. Cases 7 through 11, which have num_miss values less than 4, do not have a slash and will still be included in the analyses.


6. Summary Statistics for the Unfiltered Cases

Filtering cases with 50% or more missing data removed six cases from the data set, reducing our effective sample size to 64 cases. We next look at a frequency distribution for each variable to see if any variables have such a high proportion of missing data that they should be considered candidates for removal from the analysis.

We can see the distribution of missing data on each of our variables by using the Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page 56 of the text. We will use a Frequencies command instead of a Descriptives command, because the Frequencies command will provide a count of the remaining missing cases for each variable.


Requesting the Frequency Distributions

First, select the 'Descriptive Statistics | Frequencies…' command from the Analyze menu.

Second, move the variables Delivery Speed through Satisfaction Level (x1 through x10) to the 'Variable(s): ' list box.

Third, clear the check mark from the 'Display frequency tables' check box. Frequency tables for continuous variables would generate a large volume of output that we do not need.

Fourth, click on the 'Statistics…' button to request the mean and standard deviation.


Requesting Specific Statistics

First, mark the checkboxes for 'Mean' and'Std. Deviation'.

All other check boxesshould be clear.

Second, click on theContinue button to closethe 'Frequencies:Statistics' dialog box.

Third, when the'Frequencies:Statistics' dialogis closed, click onthe OK button torequest theoutput.


The Frequencies Output

The frequencies table contains all of the information items in table 2.2. of the text. The horizontal orientation of the table makes it difficult to read. We will change its orientation.


Changing the Orientation of the Table

First, double click on the table to activate it for editing. When the table is activated, it displays a hatched line border.

Second, select the 'Transpose Rows and Columns' command from the Pivot menu.


The Transposed Frequencies Table

The number of cases in the column labeled Valid are the number of cases that are not missing data for that variable. From studying this column, we see than Delivery Speed, Price Level, and Price Flexibility have the lowest number of valid cases, and thus the largest number of missing cases. For each of these variables, there are still a large number of cases that do not have missing data, so we would not automatically eliminate these variables from the analysis. There is no specific number for the proportion of missing cases that would require the variable to be eliminated. A variable that has 50% or more missing data would not have much credibility, and probably a variable with 40% missing data should be eliminated. However, a variable with 20 to 30% missing data might or might not be retained depending on its importance to the research question. Whatever we decide about missing data, we should identify our decisions in the research report.


7. Tabulating Missing Data Patterns

In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern variable that contained a single string of ten characters representing valid or missing data for the first ten variables in the data set. To create table 2.4 on page 58, we do frequency distribution on the pattern variable. This frequency distribution will tell us if there are one or two patterns of missing data that occur with sufficient frequency to require further investigation.


Request a Frequency Distribution for the Pattern Variable

First, select the 'Descriptive Statistics | Frequencies…' command from the Analyze menu.

Second, in the Frequencies dialog box, move the pattern variable, 'miss_str' to the 'Variable(s): ' list box. Also be sure the Display ‘frequency tables box’ is checked in this box.

Third, click on the Format… button in the Frequencies dialog box.

Fourth, in the 'Frequencies: Format' dialog, mark the 'Descending counts' option in the 'Order by' panel. This will order the frequency table to be from highest count to lowest.

Fifth, click on the Continue button to close the 'Frequencies: Format' dialog.

Sixth, click on the OK button to complete the frequency request.


The Frequency of Different Patterns

The results in the frequency table shows the incidence of different patterns. I t agrees with the data in table 2.4 of the text, though the patterns are in a different order. As the text identifies, the most prevalent pattern is X1 missing and all other non-missing, with a frequency of 6. Followed by that is X1 and X3 missing, with a frequency of 4. All other patterns have a lower frequency of occurrence. This analysis tells us that we do not have a single missing-data pattern that occurs with sufficient frequency to impact the statistical analysis.


8. T-tests and Chi-square Tests for Diagnosing Randomness of Missing Data

In previous exercises, we created dichotomous grouping variables for the variables X1 through X10, where the grouping variable was assigned a 1 if the data was valid and a 0 if the data was missing. We will use these grouping variables to determine whether the valid and missing groups differ in their relationship to other variables in the data set. If the missing and valid groups are statistically equivalent on other variables, then the missing cases can be characterized as random, and of no consequence to our analysis. If the missing group shows a statistically significant relationship to the other variable, it suggests that there is a missing data process that requires further understanding.

The statistical tests that we use in this analysis are chi-square tests of independence, if the variable to be tested is nonmetric, or t-tests for two independent samples, if the variable to be tested is metric. The authors use the separate variance output for all t-tests instead of examining individual tests of homogeneity. We will follow this practice.

When this analysis is conducted, there are usually a large number of statistical relationships tested. We know that using an alpha level of 0.05 in these tests implies that we will make an incorrect inference in one out of every twenty tests. With a large number of tests, we will get some statistically significant relationships even when there is no serious problem with our data. We are not looking at the individual test results, as much as we are concerned with an overall pattern of relationships.

NOTE. I cannot reconcile the findings on these tests to the discussion of findings on page 58 of the text. The statistical results are consistent with table 2.5 on page 59, while the text discussion appears to be a carryover from the fourth edition of the text, which does not contain the same statistical results as the fifth edition.


The Statistical Tests to Be Computed

We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore differences among the next nine variables in the data set, 'Price Level' through 'Satisfaction Level' (X2 through X10). In each statistical test, we are testing the null hypothesis of no relationship associated with the grouping variable, 'Delivery Speed (Valid/Missing)'. If we reject the null hypothesis, we would conclude that persons who did not answer the question on Delivery Speed had a different pattern of responses than did persons who did provide Delivery Speed.

The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of independence for this variable.

The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) are all metric and we will do t-tests for these variables.


The Chi-square Test of Independence

First, we select the 'Descriptive Statistics Crosstabs' command from the Analyze menu.

Second, we move the dependent variable, 'Firm Size (x8)', to the 'Row(s)' list box.

Third, we move the independent, or grouping, variable 'Delivery Speed (Valid/Missing)' to the 'Column(s)' list box.

Fourth, we click on the 'Statistics…' button at the bottom of the Crosstabs dialog to request the statistical test.


Requesting the Chi-square Test

First, we mark the Chi-square test check box to request the statistical test. For this problem, we clear all of the other check boxes in this dialog box.

Second, click on the Continue button to complete the request for statistical options.


Specifying Cell Contents

First, we click on the'Cells…' button tospecify what we wantin the cells of thecrosstabs table.

Second, we mark the checkboxes for 'Observed' Countsand 'Column' Percentages. I fany other check boxes aremarked, we clear them.

Third, click on theContinue button toconclude ourspecifications forcell contents.

Fourth, click on the OKbutton in the Crosstabs dialogto request the output.


Chi-square Test Results

First, the chi-square statistical test produced a significant Sig value, so we reject the null hypothesis and conclude that designation of firm size was different for missing cases than for valid cases.

Second, looking at the column percents in the crosstabulation table, we see that subjects who had a missing value for delivery speed were much more likely to be large firms than were subjects who had valid data for delivery speed (68.4% to 17.8%). This relationship requires further consideration as a missing data process that could affect our analysis.


Requesting the T-tests

First, we select the 'Compare Means | Independent-Samples T Test…' command from the Analyze menu.

Second, we move the variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) to the list box for 'Test Variable(s): '.

Third, we move the variable 'Delivery Speed (Valid/Missing)' to the text box for 'Grouping Variable: '. SPSS lists the name of the variable, 'x1_ '.

Fourth, we click on the 'Define Groups…'


Specifying the Groups by Code Number

First, we enter 0 inthe 'Group 1: ' textbox. 0 indicatesmissing data on theoriginal DeliverySpeed variable.

Second, we enter 1 inthe 'Group 2: ' text box.1 indicates valid dataon the original DeliverySpeed variable.

Third, we click on theContinue button toclose the 'DefineGroups' dialog box.

Fifth, we click onthe OK button torequest the t-testresults.

Fourth, we note thatSPSS completed thegroup identifiers in the'Grouping Variable: 'text box.


Results of the T-tests

Using the 'Equal variances not assumed' rows of the table, we see that there is a significantdifference in average score for the variables 'Manufacturer Image' and 'Service.' There is nosignificant difference in means for 'Price Level' and 'Price Flexibility.' I f we scroll down the list,we find that there are significant relationships also with 'Usage Level' and 'Satisfaction Level.'These significant findings reinforce the notion that 'Delivery Speed' might be involved in amissing data process that requires further understanding before proceeding with the analysis.


9. The Correlation Matrix for Diagnosing Randomness of Missing Data

To continue our missing data analysis, we run a correlation matrix for the dichotomous grouping variables: 'Delivery Speed (Valid/Missing)', 'Price Level (Valid/Missing)', 'Price Flexibility (Valid/Missing)', 'Manufacturer Image (Valid/Missing)', 'Service (Valid/Missing)', 'Salesforce Image (Valid/Missing)', 'Product Quality (Valid/Missing)', 'Usage Level (Valid/Missing)', and 'Satisfaction Level (Valid/Missing)'.

We examine the pattern of correlations to see if there is are large correlations among multiple pairs of variables that do not have an obvious explanation. An obvious explanation would be that subjects only answered these questions if their answer to another question were some value, e.g. only answer the question about job satisfaction if you are employed.

If there are variables that show a strong pattern of systematic missing data without an obvious explanation, we should evaluate the impact that this pattern has on our research questions, and make our decision about including, eliminating, or substituting for these variables.


Requesting the Correlation Matrix

First, select the 'Correlate | Bivariate…' command from the Analyze menu.

Second, move the Valid/Missing diagnostic variables for the metric variables to the 'Variables: ' list box.

Third, accept the defaults of 'Pearson' for 'Correlation Coefficients', 'Two-tailed' for 'Test of significance', and 'Flag significant correlations'.

Fourth, click on the OK button to produce the correlation matrix.


The Correlation Matrix Output

Our correlation matrix shows the same pattern as shown in the text in table 2.6 on page 60 of the text. As discussed on page 60 of the text, there is only one moderate correlation in this table, Salesforce Image and Satisfaction level. The pattern for missing data is restricted to these variables, so we do not have a serious problem.


slide 1 analyzing patterns of missing data while spss contains a rich set of procedures for...

Documents