dissertation - statpacstatpac.org/research-library/researcher-bias.doc · web viewtable 3:...
TRANSCRIPT
ABSTRACT
DO RESEARCHERS INFLUENCE SURVEY RESULTS WITH THEIR QUESTION WORDING CHOICES?
By
David S. Walonick
May, 1994
Abstract
A review of the literature revealed that there can be large differences in the way that
people respond to public opinion surveys depending on the phraseology of the questions.
This purpose of this study was to determine whether or not survey researchers
unknowingly influence the results of a survey through their question wording choices. A
survey was mailed to 953 people who had some involvement in the survey research
process. A short vignette told respondents that they had been hired to design a public
opinion survey. One third of the respondents were told that the sponsor for their study
was a conservative anti-spending group, another third were told that their sponsor was a
liberal pro-spending group, and for the last third, no sponsor was identified. The survey
presented a pair of question wording alternatives for six different social issues.
Respondents were asked to select the one they would use in a public opinion survey.
They were also asked how they would personally answer the question wording that they
had selected. A total of 361 researchers responded to the survey. A variety of different
statistical techniques were used to test each of the hypotheses, including the chi-square
statistic, gamma, the student's t-test between means, the one-sample t-test between
proportions, and logistic regression analysis. The results indicated that there was a fairly
strong tendency for researchers to consistently favor one form of question wording or the
other. In addition, researchers decidedly selected the question wording that would sway
public response to support their personal opinion. Only minor relationships were found
between question wording choices and the persuasion of the study sponsor or the self-
assessed experience level of the researcher.
ii
Table of Contents
I. Introduction....................................................................................................... 1
Statement of the Problem..................................................................................... 2
Purpose of the Proposed Research....................................................................... 3
Research Questions.............................................................................................. 4
Null Hypotheses.................................................................................................. 4
Significance of the Problem................................................................................. 5
Definitions of Terms............................................................................................ 6
II. Review of the Literature............................................................................ 9
Interviewer Effects.............................................................................................. 9
Interviewer Errors....................................................................................... 10
Social Distance Between the Interviewer and the Respondent..................... 10
Interviewer Verbal Cues.............................................................................. 11
Response Effects.................................................................................................. 12
Question Order............................................................................................ 13
Middle Alternatives..................................................................................... 15
"Don't Know" Alternatives.......................................................................... 16
Presenting One or Two Sides of an Issue.................................................... 17
Assertion Versus Interrogation Format........................................................ 20
Vague Quantifiers....................................................................................... 22
Wording of Questions and Response Alternatives....................................... 26iii
Attitude of the Survey Designer........................................................................... 29
III. Methodology................................................................................................. 30
Research Design.................................................................................................. 30
Sample Selection................................................................................................. 33
Analysis............................................................................................................... 36
Liberal and Conservative t-Test between Proportions................................. 38
One-Way Classification Chi-Square Test.................................................... 40
Two-Way Classification Chi-Square Test................................................... 42
t-Test between Means................................................................................. 42
Logistic Regression..................................................................................... 43
Multivariate Models.................................................................................... 47
Analyses of the Hypotheses Broken Down by QuestionWording Pairs............................................................................................. 49
Methodological Limitations................................................................................. 49
Validity and Reliability........................................................................................ 50
Procedures and Timetable.................................................................................... 53
IV. Results.............................................................................................................. 54
Response and Non-Response............................................................................... 54
Respondents' Comments...................................................................................... 56
Crime.......................................................................................................... 60
Drugs.......................................................................................................... 60
iv
Welfare....................................................................................................... 60
Cities........................................................................................................... 61
Blacks......................................................................................................... 62
Social Security............................................................................................ 62
Summary of Comments............................................................................... 63
Null Hypothesis Testing...................................................................................... 63
Null Hypothesis 1....................................................................................... 64
Null Hypothesis 2....................................................................................... 70
Null Hypothesis 3....................................................................................... 76
Multivariate Models to Test the First Three Hypotheses............................. 81
Null Hypothesis 4....................................................................................... 84
Additional Findings..................................................................................... 93
V. Conclusions and Recommendations.................................................... 97
Summary............................................................................................................. 97
Conclusions......................................................................................................... 99
Discussion and Recommendations....................................................................... 102
VI. References...................................................................................................... 110
VII. Appendices
Appendix A: Cover Letter and Questionnaire that was Sent to Researchers.......................................................................................................... 115
Appendix B: Vitae of David S. Walonick............................................................ 118
v
Tables
Table 1: A comparison of Simpson's and Hakel's findings on the meaning of vague quantifiers......................................................................................... 24
Table 2: Summary of Rasinski's results on issue labeling (1984-1986)..................... 28
Table 3: Rasinski's (1989) issue labels that were replicated in this study................... 32
Table 4: Construction of the liberal and conservative dichotomous scales................. 39
Table 5: Construction of the unbiased (neutral) scale................................................ 41
Table 6: Construction of the 2 x 3 contingency table for the two-way classification chi-square test........................................................................ 42
Table 7: Summary of nine statistical tests used for testing the first two hypotheses................................................................................................... 47
Table 8: Response rate information broken down by the sponsorship variable.......... 54
Table 9: Number of valid responses and completion rates for all items on the survey......................................................................................................... 55
Table 10: Number of comments and response percent ranked by frequency of mention....................................................................................................... 63
Table 11: Hypothesis 1 - Liberal t-test between proportions....................................... 64
Table 12: Hypothesis 1 - Conservative t-test between proportions.............................. 65
Table 13: Hypothesis 1 - One-way classification chi-square test................................. 65
Table 14: Hypothesis 1 - Two-way classification chi-square test................................ 66
Table 15: Hypothesis 1 - t-test between means............................................................ 67
Table 16: Hypothesis 1 - Logistic regression with dummy IV's (t-test)....................... 67
Table 17: Hypothesis 1 - Logistic regression with an interval IV (t-test)..................... 68
Table 18: Hypothesis 1 - Summary of conclusions for the first null hypothesis.......... 69
vi
Table 19: Hypothesis 1 - Probability levels for each of the nine test broken down by issue....................................................................................................... 70
Table 20: Hypothesis 2 - Liberal t-test between proportions....................................... 71
Table 21: Hypothesis 2 - Conservative t-test between proportions.............................. 71
Table 22: Hypothesis 2 - One-way classification chi-square test................................. 72
Table 23: Hypothesis 2 - Two-way classification chi-square test................................ 72
Table 24: Hypothesis 2 - t-test between means............................................................ 73
Table 25: Hypothesis 2 - Logistic regression with dummy IV's (t-test) and logistic regression with an interval IV (t-test)............................................. 74
Table 26: Hypothesis 2 - Summary of conclusions for the second null hypothesis...... 75
Table 27: Hypothesis 2 - Probability levels for each of the nine tests broken down by issue............................................................................................. 76
Table 28: Hypothesis 3 - Two-way classification chi-square test................................ 77
Table 29: Hypothesis 3 - t-test between means............................................................ 78
Table 30: Hypothesis 3 - Logistic regression with dummy IV's (t-test)....................... 78
Table 31: Hypothesis 3 -Logistic regression with an interval IV (t-test)...................... 79
Table 32: Hypothesis 3 - Summary of conclusions for the third null hypothesis......... 80
Table 33: Hypothesis 3 - Probability levels for each of the six tests broken down by issue....................................................................................................... 81
Table 34: Hypotheses 1-3 - Multivariate logistic regression with dummy IV's............ 82
Table 35: Hypotheses 1-3 - Multivariate logistic regression with interval IV's............ 83
Table 36: Hypotheses 1-3 - Logistic regression coefficients for the multivariate model with interval IV's.............................................................................. 83
Table 37: Hypothesis 4, part 1 - Two-way classification chi-square test using the liberal definition of the relationship............................................................ 85
vii
Table 38: Hypothesis 4, part 1 - Two-way classification chi-square test using the conservative definition of the relationship................................................... 85
Table 39: Hypothesis 4, part 1 - Two-way classification chi-square test using the unbiased definition of the relationship......................................................... 86
Table 40: Hypothesis 4, part 1 - t-test between means using the unbiased definition of the relationship....................................................................... 87
Table 41: Hypothesis 4, part 1 - One-way ANOVA using the unbiased definition of the relationship....................................................................................... 88
Table 42: Hypothesis 4, part 1 - Summary of conclusions for the fourth null hypothesis................................................................................................... 88
Table 43: Hypothesis 4, part 2 - Two-way classification chi-square test using the liberal definition of the relationship............................................................ 89
Table 44: Hypothesis 4, part 2 - Two-way classification chi-square test using the conservative definition of the relationship................................................... 90
Table 45: Hypothesis 4, part 2 - Two-way classification chi-square test using the unbiased definition of the relationship......................................................... 91
Table 46: Hypothesis 4, part 2 - t-test between means using the unbiased definition of the relationship....................................................................... 92
Table 47: Hypothesis 4, part 2 - One-way ANOVA using the unbiased definition of the relationship....................................................................................... 92
Table 48: Hypothesis 4, part 2 - Summary of conclusions for the fourth null hypothesis................................................................................................... 93
Table 49: Other findings - Two-way contingency table comparing responses to the binomial distribution............................................................................. 95
viii
CHAPTER IIntroduction
Is survey research valid? Most social scientists see the written survey as a way of
measuring the attitudes and beliefs of a population or subpopulation. Great care is taken
in the preparation of the survey instrument. Yet, little is known about the ways in which
a researcher's own beliefs might affect the results of a study.
In the last fifty years, thousands of studies have been conducted to examine the specific
characteristics of survey research. Clearly, the way people respond to questions (or
whether they respond at all) depends to some extent upon the characteristics of the
survey itself.
Do researchers unconsciously incorporate their personal beliefs or those of the sponsor
into a survey? Scholarly research involves a concentrated effort to maintain objectivity,
but in spite of the best of intentions, are researchers subconsciously "rigging" the results?
Is objectivity compromised by the choice or wording of questions?
Objectivity may be an illusion in surveys. Relativity may be a more appropriate model.
Obviously, understanding the relationship between the investigator and the respondent is
of paramount importance to the survey research community. Countless decisions are
made on the basis of survey research. It is imperative that we know if the research is
valid.
49Statement of the Problem
To what extent do survey researchers unknowingly incorporate their personal beliefs (or
those of the sponsor) into a survey, through their question wording choices?
There is substantial evidence to demonstrate that peoples' answers to questions are
influenced by a number of factors, including form effects, interviewer effects, and
response effects. Some studies have shown that these can dramatically alter the
conclusions that a researcher would draw from the results. Recent studies have
demonstrated that even minor changes in question wording can affect the way that people
respond. The implication is unavoidable: The validity of all survey research is in
question. Possibly, the concept of validity itself needs to be examined.
All survey researchers strive for objectivity. It might be considered the signature of the
scientific method. Yet, we are forced to wonder whether or not it is possible to remain
truly objective. How can we be sure that a questionnaire does not reflect the beliefs of
its creator(s)? Is it possible that researchers unknowingly incorporate their own attitudes
or those of the sponsor into the construction of their questionnaires? If so, are there
differences depending on the experience of the researcher? These are important
questions that strike at the heart of survey research.
There are many ways in which investigators might unknowingly influence the results of
their research. The dramatic impact of interviewer effects is well-documented. Form
effects and response effects have also been shown to be present in surveys. Working
together, these factors can create staggering changes in response distributions.
49Of particular interest are the effects of question wording. Seemingly minor alterations in
question wording can produce significant differences in respondents' answers. When a
survey is designed to predict behavior, question wording can be validated by correlating
respondents' answers with their behaviors. However, attitude and public opinion surveys
generally lack an objective standard by which to judge the validity of question wording.
In the absence of an objective reference, there is no way for a researcher to determine
which form of question wording produces the most accurate barometer of public opinion.
In other words, it may not be possible to determine which is the "best" form of question
wording.
Survey designers often make choices about question wording without validating
information. Why do they choose one form of a question over another? Do they
subconsciously select the question wording that will support their own views? Question
wording may be a form of modeling, where investigators consciously or subconsciously
project their own views onto those being studied.
Purpose of the Proposed Research
The purpose of this study was to understand the degree to which researchers
unknowingly incorporate their personal beliefs or those of the sponsor into a survey,
through their question wording choices.
Research Questions
Do survey researchers unknowingly influence the results of a survey through their
question wording choices?
49
Do the personal opinions of researchers influence their question wording choices?
Does the persuasion of the sponsor of a survey influence researchers' question wording
choices?
Are there differences between experienced and inexperienced researchers in their
question wording choices?
Are there differences between beginning and advanced researchers with respect to the
degree to which they incorporate their personal opinions and those of the sponsor into
their question wording choices?
Null Hypotheses
1. Researchers' choices for question wording are not related to their personal
opinions.
2. Researchers' choices for question wording are not related to the persuasion of the
study sponsor.
3. Researchers' choices for question wording are not related to their self-assessed
experience in questionnaire design.
494. There are no differences between beginning and advanced researchers with
respect to the degree to which they incorporate their personal opinions and those
of the sponsor into their question wording choices.
Significance of the Problem
Every day, a large number of decisions are made based on the results of survey research.
Companies use surveys to develop products and marketing strategies. Public service
organizations use surveys to identify subpopulations and to document need. Government
officials rely heavily on surveys to tap public opinion. Most members of our society
have been affected by decisions made on the basis of surveys. The assumption is that
survey research is valid. The "margin of error" is usually reported for surveys and this
might lull the users of survey research into believing the results are valid. However,
sampling error (i.e., margin of error) is only one form of bias. There are other sources of
bias with far greater potential to reduce our confidence in the results of a survey.
Many forms of research bias have been identified and studied. Very little is known;
however, about how question wording might reflect the personal beliefs of a researcher,
or those of the study sponsor. Our concept of validity encompasses the idea that the
researcher is independent of the phenomena being studied. The problem is that this
might not be true. This study will investigate whether or not researchers might
unknowningly affect the results of surveys through their question wording choices.
This study will benefit all survey researchers and the recipients of research results. If the
null hypotheses are not rejected, researchers will have more confidence in the validity of
survey research and their abilities to design objective and unbiased surveys. If the null
49hypotheses are rejected, then the validity of most surveys will be called into question and
researchers will be forced to closely examine the survey technique itself.
Definitions of Terms
Many terms common to survey research will be used in this study. It may be helpful to
define them at the onset.
Acquiescence: The tendency in respondents to agree more often then disagree.
Anchor: The words used to define the endpoints of an ordinal scale.
Change effect: The act of being interviewed promotes attitude formation or change
which otherwise might not have occurred.
Freezing effect: The act of being interviewed inhibits a change in a respondent which
otherwise might have occurred.
Interviewer effect: A change in a subject's responses due to some characteristic of the
interviewer, such as race, social distance, intonation, gestures, expectations, etc..
Modeling effect: A change in respondents' answers due to the conscious or unconscious
projection of the investigator's (laboratory experimenter, clinician, survey designer,
or interviewer) own views.
49Panel survey: A survey in which the same respondents are asked the same questions on
two or more occasions. A panel survey is also referred to as a repeated measures
experiment.
Probing: An attempt by an interviewer to get more information from the respondent by
asking the respondent to clarify or give a more detailed answer.
Response effects: A change in subjects' responses due to some characteristic of the
survey itself.
Response rate: The percentage of respondents that completes the survey, or that answer a
question.
Salient (items): Items on a survey that are meaningful or important to the respondent.
Split-ballot experiment: An experimental design where two or more groups of similar
respondents are exposed to different treatment conditions.
Vague quantifier: An adjective or adverb without precise meaning, that describes
frequency, quantity, or intensity.
Vignette: An elaborated description of a concrete situation.
49CHAPTER II
Review of the Literature
The survey process has been the subject of substantial research. Many of these studies
have focused on various ways of increasing response rates. This is important because it
defines an upper limit on the confidence we can place in a survey's results. Other studies
have examined the effects of the survey process itself. These are particularly interesting
because they directly address the issue of validity in survey research.
There is considerable evidence to show that the results of a survey can be influenced by
an interviewer. These include interviewer errors, the social distance between the
interviewer and the respondent, and interviewer verbal and visual cues. There are also
many studies that show that the results of a survey might be influenced by the serial order
and format of the questions and response alternatives. Strong response effects have been
documented from the inclusion of middle and "don't know" alternatives, the format of the
questions (e.g., assertion versus interrogation, or presenting one or both sides of an
issue), and question wording.
Interviewer Effects
"Interviewer effects" refer to changes in subjects' responses due to some characteristic of
the interviewer. The potential of an interviewer to affect peoples' responses has been
extensively studied. Effects have been documented as a result of an interviewer's race,
social distance from the respondent, intonations, gestures, and expectations. Early
studies concentrated on face-to-face interviews. For example, Skelly (1954) discussed
49the potential of bias when an interviewer had a stereotypical appearance. More recent
studies using telephone interviews have confirmed the effect of verbal cues.
Interviewer Errors
The most obvious interviewer effect is interviewer error. Hanson and Marks (1958)
found that interviewers frequently omitted or altered question wording. In addition, they
discovered that interviewer probing tended to alter respondents' initial replies to a
question. Schyberger (1967) reported that interviewers often deviated from instructions
and only small differences were found between experienced and inexperienced
interviewers in the degree of deviation.
Social Distance Between the Interviewer and the Respondent
In 1968, three different studies proposed a social distance model to describe the
relationship between the interviewer and the respondent. All three researchers reported
that respondents tended to acquiesce (i.e., response marginals changed nearly twenty
percent); however, they differed in their findings regarding the social distance condition
that would create the acquiescence. Weiss (1968) interviewed welfare mothers and
found that a smaller social distance between the interviewer and the respondent produced
a greater bias. Williams (1968) found that a greater social distance produced a greater
bias. Dohrenwend, Colombotos, and Dohrenwend (1968) concluded that the best
interview data is obtained when the interviewer is not too close, or not too far away, in
social distance to the respondent. They argued that a deviation in either direction could
introduce bias (i.e., interviewer effects).
49One of the most important findings, reported by Williams (1968), was that even when
social distance was held constant, the interviewer's role performance (i.e., rapport and
objectivity) still affected responses. "This suggests that objectivity is not only related to
interview bias but may be as significant as race of interviewer" (p. 291). Social distance
created some interviewer effects, but other interviewer effects were also present.
Interviewer Verbal Cues
Collins (1970) reported that the interviewers' verbal idiosyncrasies, such as vocabulary
and verbosity could strongly influence respondents and concluded by expressing strong
doubt in the validity of many survey interviews.
Phillips and Clancy (1972) studied 25 white females who were all experienced telephone
interviewers. They developed a survey to measure nine areas of interest to social
scientists--general happiness, religiosity, number of friends, current health status,
prejudice, doctor visits, mental health status, need for social approval, and dissimulation.
Dissimulation was measured through people's responses regarding their use of ten
nonexistent consumer products (e.g., books, movies, products, etc.). Ostensibly, as part
of their training, the interviewers were asked to complete the same interview that they
would be administering to respondents. A sample of 404 telephone customers was
selected and randomly assigned to the interviewers. Because of the small number of
interviewers, the researchers recoded the responses into "low" and "high" dichotomies on
each of the items. In eight of the nine measures, respondents' attitudes showed positive
correlations with the interviewers' attitudes, although the effect was small and not
statistically significant. They conclude that "because modeling effects are such a minor
source of bias, they are not worthy of further consideration" (p. 253).
49
In contrast, Barath and Cannell (1976) found that interviewers' voice intonations had
strong effects on respondents answers to questions. However, this effect did not seem to
apply to "yes-no" dichotomous questions (Blair, 1977). Blair's findings suggest that
Phillips and Clancy (1972) may have masked the strength of the interviewer effect by
recoding the interviewer responses into "high-low" dichotomies .
Response Effects
Response effects refer to a change in peoples' responses because of some characteristic of
the survey itself. The most obvious response effect might be created by the mode of the
survey (e.g., telephone, mail, face-to-face interview). Other examples of response effects
come from the characteristics of the survey instrument (e.g., the length of the
questionnaire, the order of question presentation, question wording, the order of the
response options, the use of no-opinion filters and middle-response options).
49Question Order
Many early researchers suggested that surveys should begin with a few non-threatening
and easy to answer items (Erdos, 1957; Robinson, 1952; Sletto, 1940). The rationale was
that people would not complete the survey if the first items were too difficult or
threatening.
More recent studies have found that the most important items should appear near the
beginning of a survey (Kraut, Wolfson, and Rothenberg, 1975; Levine and Gordon,
1958; Mullner, Levy, Byre, and Matthews., 1982). The rationale is that people generally
look at the first few questions before deciding whether or not to complete the
questionnaire (Levine and Gordon, 1958). In addition, respondents often send back
partially completed questionnaires, or they decide to terminate an interview before it is
completed. By putting the most important items near the beginning, the partially
completed surveys will still contain important information.
In a study of 5,842 hospital CEO's, Mullner et al. (1982) found that the response rate on
a written questionnaire was significantly affected by the order of the questions.
Questionnaires that began with the most salient items produced greater response rates
than those in the reverse order. Another team of investigators reported that questions in
the latter half a questionnaire were much less likely to contain extreme responses, and
they were also more likely to be omitted (Kraut, Wolfson, and Rothenberg, 1975).
Carp (1974) suggested that it may be necessary to present general questions before
specific ones in order to avoid response contamination. In contrast, McFarland (1981)
49reported that when specific questions were asked before general questions, respondents
showed greater interest in the general questions.
Most investigators have found that the order in which questions are presented can affect
the way that people respond (Noelle-Newmann, 1970; Schuman and Presser, 1981;
Smith, 1982; Sudman and Bradburn, 1974; Turner and Krauss, 1978; Tourangeau and
Rasinski, 1988; Tourangeau, Rasinski, Bradburn, and D'Andrade, 1989). Tourangeau
and Rasinski (1988) proposed a four-stage process to describe how people respond to
attitude questions. They conclude that the order of the questions can affect the entire
process.
Respondents first interpret the attitude question, determining what attitude the question is about. Then they retrieve relevant beliefs and feelings. Next, they apply these beliefs in rendering the appropriate judgment. Finally, they use this judgment to select a response. All four of the component processes can be affected by prior items. (p. 299)
In contrast, other researchers have reported that question-order does not effect responses.
Bradburn and Mason (1964) reported that interviews involving self-reports and self-
evaluations were unaffected by question order. Clancey and Wachsler (1971) found that
responses to questions were similar regardless of where the questions appeared in a
questionnaire. Smith (1982) argued that the serial order of the questions was less of a
problem in written surveys because respondents were free to go back and change
previous answers. Bishop, Hippler, Schwarz, and Strack (1988) compared response
effects in self-administered and telephone surveys. They also concluded that the serial
order of questions is less likely to produce response effects in written questionnaires
because respondents are free to read over the entire questionnaire, or re-read selected
portions. In telephone surveys, respondents do no have this option and their responses
49are more likely to be "off-the-top of their heads" (Hippler and Schwarz, 1987). In a
recent study, Ayidiya and McClendon (1990) reported that order effects existed, but to
varying degrees, depending on the questions.
Middle Alternatives
Bishop (1987) conducted a series of random-digit-dialed telephone surveys to determine
the effect of offering a middle response to subjects. Three societal issues (social security,
defense spending, and nuclear energy) were used to formulate questions with and without
a middle response. In these experiments, the middle responses were to maintain the
present social security benefits, to maintain the current level for defense spending, and to
operate only those nuclear power plants that are already built. Bishop found that
including or excluding a middle response could sufficiently change the results so that a
researcher would draw different conclusions about peoples' opinions, although the effect
was inconsistent. Furthermore, when the middle alternative was presented at the end of
the question, more people tended to select it.
Ayidiya and McClendon (1990) also found that more people select a middle response
when it is specifically offered. However, in contrast to Bishop (1987), they concluded
that excluding a middle alternative from an item would not alter the bipolar response
pattern for the item. In other words, a researcher would have come to the same
conclusions regardless of whether there was a middle response option or not.
"Don't Know" Alternatives
49Poe, Seeman, McLaughlin, Mehl, and Dietz (1988) studied the effect of including a
"don't know" (DK) option in a questionnaire consisting of factual questions. The sample
of 1,360 subjects were randomly assigned to one of two groups. One group was told to
check the DK box if they didn't know an answer and the other group was told to place a
question mark in the answer space if they didn't know an answer. The response rate to
both forms of their survey was about the same (61.5% with DK boxes and 58.2%
without). These investigators reported that the version with the DK box produced a
significantly higher percentage of "don't know" responses than the form without the DK
box. Furthermore, a telephone follow-up to 22 percent of the sample indicated that there
were no significant differences in error rates between the two question formats. The
authors found that some items produced large differences, while others did not.
However, "there was no single characteristic (e.g., potentially sensitive) that would
typlify the questions which had higher substantive responses in the format without DK
boxes" (p. 218).
Ayidiya and McClendon (1990) also studied whether the inclusion of a "don't know"
alternative alters response patterns. Their sample consisted of 532 households drawn
from the Akron, Ohio phone book and their response rate was 63 percent. They reported
that the inclusion of the "don't know" alternative significantly decreased the number of
salient responses.
Presenting One or Two Sides of an Issue
Schuman and Presser (1981) studied the effects of balanced (i.e., one that offers a second
substantive alternative) versus imbalanced questions over a period of several years. Four
social issues (gun control, abortion, unions, and the fuel shortage) were presented to
49respondents in a series of split-ballot experiments. Their balanced questions presented
respondents with two legitimate sides of an issue. For example, the balanced version of
their gun control question asked:
Would you favor a law which would require a person to obtain a police permit before he could buy a gun, or do you think such a law would interfere too much with the right of citizens to own guns? (p. 182)
In contrast, their imbalanced question presented only one side of the issue:
Would you favor a law that would require a person to obtain a police permit before he could buy a gun? (p. 182)
Schuman and Presser (1981) concluded that "it does not appear that purely formal
balance of attitude items makes a detectable difference in their univariate distributions"
(p. 184).
In contrast to Schuman and Presser (1981), many other researchers have found that
presenting subjects with a balanced question can significantly change subject's responses
(Bishop, Oldendick, and Tuchfarber, 1982; Hedges, 1979; Kalton, Collins, and Brook,
1978; Noelle-Neumann, 1970; Payne, 1951; Rugg and Cantril, 1944).
Noelle-Neumann (1970) conducted 300 interviews with nonworking housewives. One
form of the question asked, "Would you like to have a job, if this were possible?," and
the other form asked, "Would you prefer to have a job, or do you prefer to do just your
housework?" With the first form of the question, 19 percent said they did not want a job,
and with the second form, 68 percent said they did not want a job. Noelle-Neumann
stated that the differences were "so staggering that it is apparent that much research needs
to be done to establish the psychological and cognitive reasons for this" (p. 200).
49
Bishop, Oldendick, and Tuchfarber (1982) examined whether presenting balanced
questions affected the marginal frequencies and the number of "don't know" answers.
Two telephone surveys (six months apart) were conducted and data was collected from
1,218 respondents. The interviewer asked respondents their opinions on nine public
policy and social issues (e.g., the power of the federal government, government
guaranteed employment, fair treatment of blacks, equal opportunities for women,
government involvement in desegregation, etc.). The questions were randomized and
some subjects received only one side of a question, while others received a two-sided
version of the question.
An example of one of their one-sided questions is:
Some people feel that the government in Washington should see to it that every person has a job and a good standard of living. Do you have an opinion on this or not?(IF YES) Do you agree or disagree with the idea that the government should see to it that every person has a job and a good standard of living? (p.71)
The two-sided form of the same question was:
Some people feel that the government in Washington should see to it that every person has a job and a good standard of living. Others think the government should just let each person get ahead on his own. Have you been interested enough in this to favor one side over the other?(IF YES) Do you agree or disagree with the idea that the government should see to it that every person has a job and a good standard of living, or should it let each person get ahead on his own? (p. 71)
Note that this study attempted to balance both the background information ("Others think
the . . ."), and the question itself (". . . or should it let each person get ahead on his
own?"). The filter to remove the "no opinion" responses was also varied ("Do you have
49an opinion on this or not?" versus "Have you been interested enough in this to favor one
side or the other?").
The results indicate that stating two sides of an issue usually (in eight of the nine
comparisons) produced changes in the marginal frequencies; however, the difference was
only significant in five of the nine comparisons. They conclude that offering respondents
another choice on an issue "will not necessarily attract them, though it may make them
more likely to 'think about' expressing an opinion" (p. 75). In addition, they found that
less educated respondents tended to acquiesce (agree) more often. This agreed with
previous findings of Jackman (1973) and Schuman and Presser (1977). The type of filter
question to eliminate "no opinion" responses did not produce significant differences in
opinions.
The researchers discussed their results in terms of information most accessible to
respondents' memories. They hypothesized that presenting a second side to an issue
places it in the most recent memory of the respondent, thus increasing the likelihood that
it will be selected. In other words, subjects tend to respond "with the first thing that
comes to mind" (Bishop et al., 1982, p. 78). Less educated respondents tended to choose
the second side more often because they had less environmental context information to
compete with the most recent memory (i.e., the second side of the issue).
The model proposed by Bishop et al. (1982) is that we "look at the survey interview as a
microcosmic communication and persuasion experiment, in which the interviewer is
presenting one- versus two-sided statements that are more or less persuasive
communication which the respondent is asked to accept or reject" (p. 80). Our
understanding of the survey process would then depend on our ability to assess the
49strengths of the alternative positions on an issue. They conclude that presenting both
sides of an issue is not the answer, because the "other side of the issue" is not always
easily defined. Of particular importance was their recognition that the concept of
relativity brought all survey research into question.
. . . it becomes questionable whether we should say that one format or another is "biased" or "unbiased" since that is clearly a relative matter. Biased for whom? Only those with less than a high school education? Or does that depend upon the issue? For that matter, should we even bother to think about whether there is a "true" or "correct" wording and format for a given issue? Or are there only more or less useful ones for a given purpose of explanation or prediction? (p. 82-83)
Assertion Versus Interrogation Format
Several researchers have looked at the effects of using interrogation and assertion
formats. In a typical question using an interrogation format, the respondent is provided
with some brief background information about an issue and then asked a question about
the issue. In an assertion format, the respondent is provided with the same introductory
information; however, they are then asked to indicate their level of agreement or
disagreement with a particular assertion. This issue is of great concern to questionnaire
designers because agree/disagree attitude scales are so common in surveys.
In an early study by Zillmann (1972), subjects listened to a defense attorney's closing
arguments. One group heard assertion statements (e.g., "Johnny was a peaceful boy"),
while the other group heard interrogatories (e.g., "Johnny was a peaceful boy, wasn't
he?" Subjects who heard the interrogatory version recommended significantly shorter
prison sentences. The act of hearing a question changed the respondents' attitudes.
49Schuman and Presser (1981) conducted two studies to compare the assertion and
interrogation formats. In contrast to Zillmann, they concluded that "there is nothing
special about agree-disagree assertions as distinct from interrogative forms that produces
acquiescence" (p. 228).
Petty, Rennier, and Cacioppo (1987) studied the effects of wording survey items as either
questions or assertions. Ninety-one undergraduate students were the subjects of two
experiments. Both experiments involved students' attitudes towards a new product, when
preceded by either weak or strong background information. In one experiment,
background information was presented about a new calculator and the other was for a
new disposable razor. As predicted, strong background information encouraged students
to see the products as more desirable, F(1,45)=86.79, p<.001. However, the researchers
also found that the interrogation format caused greater polarization in subjects' responses,
F(1,45)=4.53, p<.04. When subjects were presented with strong background
information, the interrogation format produced more positive attitudes towards the
products, and when subjects were presented with weak background information, the
interrogation format produced more negative attitudes.
Petty, Rennier, and Cacioppo (1987) discuss their findings by hypothesizing that people
engage in greater cognitive processing when the interrogation format is used. That is,
asking a question produces greater item-relevant thinking than asking a person their level
of agreement or disagreement with a statement. However, when attitudes are used to
predict behavior, this is not necessarily a desirable feature. Wilson, Dunn, Bybee,
Hyman, and Rotondo (1984) found that thinking about one's attitude can actually reduce
the relationship between attitude and behavior. It was hypothesized that this most often
occurs when the attitude object has an affective, rather than cognitive, basis.
49
Other researchers have demonstrated that rhetorical questions are more effective at
persuasion that assertions (Burnkrant and Howard, 1984; Petty, Cacioppo, and
Heesacker, 1981; Swasy and Munch, 1985; Zillmann, 1972).
Vague Quantifiers
Survey researchers often ask respondents to judge frequency ("how often"), quantity
("how much"), or intensity ("how strongly"). The response alternatives for these
questions are usually presented as an ordinal scale made up of descriptive adjectives or
adverbs. These words have been appropriately dubbed "vague quantifiers," to emphasize
the imprecision of their meanings.
Mosier (1941) conducted one of the first studies in vague quantifiers and reported that
the meaning of these words varied between individuals. Mosier hypothesized that
"meaning" was distributed normally and that the mean represented the average
"meaning."
Simpson (1944) asked subjects to evaluate twenty different vague quantifiers and to give
each one a meaning by assigning a proportion to indicate its absolute frequency. Hakel
repeated the experiment in 1968. Both researchers used the median proportion to rank
the vague quantifiers in order of their perceived meanings. The rank order correlation
between the two experiments was .99; however, large differences were found between
the actual values of the medians. Hakel reported that "Variability is rampant. One man's
'rarely' is another man's 'hardly ever'" (p. 533). Table 1 shows a comparison of
Simpson's and Hakel's findings.
49
Cliff's (1959) research focused on words that intensify the phrase that they are modifying
(e.g. "quite", "very", "extremely"). These words have no value of their own, but rather,
they act like multipliers to move the meaning of the phrase closer to an extreme. Cliff
attempted to construct numeric coefficients to describe the degree to which intensifiers
altered the meaning of the phrase being modified. For example, Cliff found that "very
often" means 1.317 times as frequently as "often", and "slightly often" means .55 times
as frequently as "often."
Table 1
A Comparison of Simpson's and Hakel's Findings on the Meaning of Vague Quantifiers
Simpson (1944) Hakel (1968)
Word Median Word Median Always 99 Always 100Very often 88 Very often 87Usually 85 Usually 79Often 78 Often 74Generally 78 Rather often 74Frequently 73 Frequently 72Rather often 65 Generally 72About as often as not 50 About as often as not 50Now and then 20 Now and then 34Sometimes 20 Sometimes 29Occasionally 20 Occasionally 28Once in a while 15 Once in a while 22Not often 13 Not often 16Usually not 10 Usually not 16Seldom 10 Seldom 9Hardly ever 7 Hardly ever 8Very seldom 6 Very seldom 7Rarely 5 Rarely 5Almost never 3 Almost never 2Never 0 Never 0
49Mosier (1941), Simpson (1944), Hakel (1968), and Cliff (1959) proceeded under the
assumption that a continuum could be established and the meaning of a vague quantifier
could be placed at a precise point on the continuum for a given individual. Several
researchers have challenged this notion (Chase, 1969; Parducci, 1968; Pepper and
Prytulak, 1974). Instead, these researchers believe that the meanings of these words are
flexible and come from the context in which they are used.
Chase (1969) used Hakel's list of vague quantifiers to construct two scales. One scale
was made up of high-frequency quantifiers ("occasionally", "now and then", "about as
often as not", "usually", and "very often"). The other scale consisted of low-frequency
quantifiers ("seldom", "not often", "once in a while", "occasionally", and "generally").
Questions containing both scales were presented to 34 students. Chase found no
significant differences in the response distributions, regardless of the scale. In other
words, people judged the meaning of a vague quantifier relative to the other response
alternatives and not according to some absolute meaning.
Pepper and Prytulak (1974) provided additional evidence to suggest that vague
quantifiers lack absolute meaning. They hypothesized that the meaning of vague
quantifiers would be perceived relative to the frequency of the events they are modifying.
Respondents were asked to assign a numerical value to represent the meaning of vague
quantifiers for highly probable events (e.g., gunfire in a Hollywood western), and highly
improbable events (e.g., an airplane crash). As hypothesized, the numerical estimates for
"very often", "frequently", "sometimes", "seldom", and "almost never" changed
depending on the frequency of the event. When describing a high-probably event,
"often" meant more often than when it described a low-probability event.
49Bradburn and Miles (1979) asked respondents about the frequency of five positive and
five negative feelings. The response categories were "never", "not too often", "pretty
often", and "very often." After selecting a response category, respondents were asked
how many times a day this meant. The results of this study supported Cliff's (1959)
findings that "very often" is about 1.3 times as frequently at "often." This study also
found that the meaning of "not too often" was different, depending on whether it
described a positive or negative feeling (e.g., excited versus bored). The meanings of
"pretty often" and "very often" seemed to remain stable regardless of the feelings they
were describing.
Schaeffer (1991) used existing data on 1,172 adults to examine whether vague quantifiers
are interpreted differently by demographic groups of race, sex, education, and age.
Absolute frequency reports were compared to grouped response categories. Two
variables were investigated--excitement and boredom. Respondents were first asked,
"How often do you feel particularly... excited or interested in something?...[bored?]"
The response categories were the same as those used by Bradburn and Miles in 1979
("never", "not too often", "pretty often", and "very often"). After choosing a response
category, they were asked, "About how many times a week or a month did you mean?"
The log of the absolute frequencies were used for the analysis--the intent being to reduce
the effect of outliers. Schaeffer found significant differences in the meaning of the
response categories based on race F(1,1056)=4.85, p=.03, education F(2,1089)=20.71,
p<.01, and age F(3,1084)=18.17, p<.01. No differences were found between males and
females F(1,1094)=.46, p=.50. The results suggest that the choice of using absolute
frequencies versus response categories can change the conclusions that a researcher
would draw from the data.
49Wording of Questions and Response Alternatives
Many studies have shown that slight changes in the wording of a question or the response
alternatives can affect the way that people respond. Bishop et al. (1978) examined the
effect of question wording on several political issues and found that "the gap in the
magnitude of association generated by the two forms can only be described as massive"
(p. 85).
In a set of informal experiments, Krosnick (1989) found that alternative wordings for
response categories can significantly change marginal frequencies. Different labels for
the response scale changed the way people responded. Three different scales were tested:
1) "very acceptable", "somewhat acceptable", "not too acceptable", and "not acceptable at
all"; 2) "strongly favor", "somewhat favor", "favor a little", and "not favor at all";
3)"strongly support", "somewhat support", "support a little", and "not support at all."
The scales using "acceptable" and "favor" produced similar results, but the "support"
scale was strongly skewed toward the "not support at all" anchor.
Rasinski (1989) conducted a series of split-ballot experiments to examine the effects of
question wording (issue labels) on peoples' attitudes towards government spending
policies. The study asked respondents, "Are we spending too much, too little, or about
the right amount on...", and this was followed by a particular way of identifying a
government program. Large significant differences (p<.001) were noted between
responses for many of the wording variations. Furthermore, this questionnaire was
repeated for three consecutive years (1984-1986) and the differences in responses to the
wording variations remained stable over time. Table 2 summarizes Rasinski's results,
averaged for the three years.
49
49Table 2
Summary of Rasinski's Results on Issue Labeling (1984-1986)
"Are we spending too much, too little, or about the right amount on..."
Issue Label Alternative Issue Label Percent Difference
"halting the rising crime rate" (67.8%) "law enforcement" (55.7%) 12.1%
"dealing with drug addiction" (63.9%) "drug rehabilitation" (54.6%) 9.3%
"assistance to the poor" (64.0%) "welfare" (22.7%) 41.3%
"assistance to big cities" (19.9%) "solving problems of big cities" (48.6%) 28.7%
"assistance to Blacks" (27.6%) "improving conditions of Blacks" (35.6%) 8.0%
"Social Security" (53.2%) "protecting Social Security" (68.2%) 15.0%
Note: Percentages are the proportion of respondents that said "too little" is spent (averaged over three years).
Rasinski (1989) discussed the results by hypothesizing that different labels "may bring to
mind different associations, actually changing the stimuli to which respondents are
reacting" (p. 392). Smith (1987) had observed similar effects in labeling welfare issues;
however, Rasinski found that the effect was also present for a variety of other social
issues.
Rasinski's (1989) findings are disturbing. How can researchers have confidence in the
validity of their studies when small wording changes can have such a profound impact?
Seemingly, "assistance to the poor" and "welfare" are close in meaning, yet, the
difference in the way people respond to these issue labels exceeds 41 percent. A public
opinion researcher choosing one label might come to a completely different conclusion
than another researcher using a different label. Yet, both labels seem to be reasonable
choices for a survey designer.
49
Rasinski (1989) reports that his research contributes "new examples of successful and
failed question wording experiments" (p. 394). One might ask the important question,
"successful or failed for whom?" Without taking into account the relativity of the
researcher or sponsor, there is no way to ascertain which label was successful and which
one failed. Rasinski concludes that progress in this area will come from researchers in
the area of cognition and communication. Another possibility is that it will come from
the application of relativity to the research process. Regardless, these results cast strong
doubts on the validity of public opinion and attitude research.
Attitude of the Survey Designer
Many of the above mentioned researchers came to the conclusion that the results of a
survey could be substantially altered by the survey instrument or interviewer. Some
researchers examined the cognitive and contextual processes involved in respondents'
decisions. However, none of them looked at the role of the survey designer in the
creation of the questions.
This researcher could not locate any studies that investigated the relationship between
survey designers' attitudes and the questions they develop. This study will break new
ground in that it recognizes that survey instruments do not just "pop into being." On the
contrary, they are generally carefully constructed by people sincerely interested in
finding the truth on an issue. We know much about the way people respond to questions.
This study will add to our knowledge by looking at how researchers formulate those
questions.
49CHAPTER IIIMethodology
Do survey researchers unknowingly influence the results of a survey through their
question wording choices? This study tested four null hypotheses. These are:
1. Researchers' choices for question wording are not related to their personal
opinions.
2. Researchers' choices for question wording are not related to the persuasion of the
study sponsor.
3. Researchers' choices for question wording are not related to their self-assessed
experience in questionnaire design.
4. There is no difference between beginning and advanced researchers with respect
to the degree to which they incorporate their personal opinions and those of the
sponsor into their choice of question wording.
Research Design
This study used a mail questionnaire to examine a nonprobability sample of researchers
who have some involvement in survey research. Subjects received a questionnaire that
asked them to "design" a public opinion poll to investigate how people feel about
government spending on six social issues.
49The questionnaire itself contained four components. The first component was a short
introductory paragraph that placed the respondents in a hypothetical situation where they
were hired by a sponsoring agency to conduct a public opinion survey. The second
component asked respondents to choose one of the two issue labels for each of the six
issues. The third component asked respondents how they would personally answer the
six questions. The final component asked them to provide a self-rating of their
experience as a designer of surveys (beginner, average, or expert). Appendix A contains
a copy of the cover letter and survey. The response alternatives for questions 2, 4, and 6
were reversed in order to nullify any serial-order or response set effects.
Two different forms of question wording were presented for each of the six social issues.
These are the same issue labels (i.e., question wording alternatives) used by Rasinski
(1989). They were selected for this study because of their demonstrated ability to
produce consistent differential response patterns. Respondents were asked to select the
issue labels that they would use in a public opinion poll.
Each question was prefaced with, "Are we spending too much, too little, or about the
right amount on . . ." The respondent then picked one of the issue labels to complete the
question. Table 3 shows the alternative issue labels for each of the six social issues. In
Rasinski's experiments, the first column of issue labels was more likely to produce a
higher percentage of people who say that "too little is spent."
49Table 3
Rasinski's (1989) Issue Labels that were Replicated in this Study
"Are we spending too much, too little, or about the right amount on . . ."
Issue <-------------------------- Alternative Issue Labels -------------------------->
Crime "halting the rising crime rate" "law enforcement"
Drug addiction "dealing with drug addiction" "drug rehabilitation"
Welfare "assistance to the poor" "welfare"
Cities "solving problems of big cities" "assistance to big cities"
Blacks "improving conditions of Blacks" "assistance to Blacks"
Social Security "protecting Social Security" "Social Security"
Note: This table is arranged such that the first column of issue labels is the one that people are more likely to respond that "too little is spent."
Subjects were randomly assigned to receive one of three different forms of the survey.
The only difference between the three forms was in a one paragraph introduction that
described the persuasion of the sponsor of the study.
You have been hired by a [liberal/conservative] organization to find out how the
public really feels about government spending on six social issues. During the
planning meeting, you [discover that the organization favors increased/reduced
spending levels. You also] learn that the organization is considering hiring you to
conduct several other surveys in the future. As you leave the meeting, the
Director says to you, "We really need to know the truth, so be objective."
In one form of the survey, the sponsor was a liberal organization favoring increased
program spending, and in another other form of the survey, the sponsor was a
conservative organization favoring reduced spending. In the third form of survey, the
49sponsor was not identified (i.e., the bracketed text was excluded). This served as a
control group for the sponsorship variable.
It is important to note that the opening paragraph also told respondents to "be objective."
This was included because it more closely approximates the "real-world" research
process. On a conscious level, sponsors of research are nearly always interested in
finding the "truth" on an issue, regardless of their own persuasion. If sponsorship effects
exist, they do so in spite of a sponsor's conscious desire to find the truth. In addition, it
was believed that the inclusion of the instructions to "be objective", would inform
respondents that they should not make deliberate attempts to satisfy the desires of the
sponsor.
At the onset of this study, it was assumed that respondents would probably understand
the purpose of this study as soon as they looked at the questionnaire. No attempts were
made to conceal the goals of this research and none were needed. In the "real world", if
researchers' personal opinions (or those of the sponsor) affect their choices of question
wording, then it happens in spite of researchers' conscious efforts to remain objective.
Sample Selection
The population for this study was all researchers who design surveys. This, of course,
includes a wide variety of people from many different disciplines and with varying levels
of survey design experience. Obviously, it was not possible to identify all members of
the population and therefore, a random sample could not be chosen. Since this study was
exploratory in nature, a convenience sample was appropriate. This nonprobability
49method is often used during preliminary and exploratory research efforts to get rough
estimates of the results.
Rasinski's (1989) experiments were conducted over a period of three years (1984-1986).
During that time, the minimum observed effect of issue labeling was 5.5 percent in 1985
(28.2% said too little is spent on "assistance to Blacks" versus 33.7% for "improving
conditions of Blacks"). Therefore, the sample size for this study was selected to allow a
5.5 percent difference to be significant at the 95 percent confidence level. The formula
to determine sample size given an expected difference between two percents is:
( P1 + P2 ) - D2
N= Z2 · ---------------------------- D2
where,
N is the required sample size.Z is the standard normal deviate for the desired confidence level.P1 and P2 are the proportions for the first and second groups.D is the difference between the two proportions.
This study used a one-tailed test of significance because the directions of the effects were
predicted. The Z value required to produce a one-tailed significance level of 95 percent
is 1.645. Thus, for this study, the required sample size was calculated to be 551.
(.282+.337) - (.282-.337)2 .619 - (.055)2 .615975
N = (1.645)2 x ------------------------------------- = 2.706 x ------------------ = 2.706 x ------------ = 551 (.282-.337)2 (.055)2 .003025
The sample was drawn from a list of 1,400 recent purchasers of StatPac Gold IV -
Survey and Marketing Research Edition (Walonick, 1993), a computer software package
49written by the author for the purpose of conducting and analyzing surveys. All
purchasers of the software are somehow involved in the process of survey design, data
collection, or analysis. Nearly all purchasers are businesses and roughly half are
marketing research companies. Foreign users (located outside the United States) were
eliminated from the list, leaving a total usable sample of 950 potential respondents.
From this sample, 551 subjects were randomly selected without replacement. Subjects
were assigned to one of the three sponsorship levels by selecting every third name from
the list of subjects. The survey was mailed on January 28, 1994. In this initial mailing,
184 of the surveys described the persuasion of the sponsor as conservative, 184 as liberal,
and 183 named no sponsor.
Two weeks after the initial mailing (February 11, 1994), a projection of the response rate
was made and additional surveys were mailed to the remaining 399 people in the sample.
This consisted of 133 surveys for each of the three sponsorship conditions. Thus, the
total for both mailings included 317 surveys citing a conservative sponsor, 317 citing a
liberal sponsor, and 316 with no defined sponsor.
The data collection phase of this project was completed three weeks after the second
mailing. A total of 953 questionnaires had been mailed. Out of those, 32 were returned
by the Post Office as undeliverable and the remaining 921 surveys were presumed to
have reached their intended destination.
Analysis
49This study was concerned with exploring the relationship between researchers and the
survey instruments they create. However, if relativity is an issue in survey research, then
this researcher and this study are also called into question.
Researchers often think of the word "relationship" as if it possessed a single and precise
definition. There are, in fact, many precise and unique ways to define a relationship. To
the scientist, a relationship is defined in terms of a mathematical construct. Some
statistical tests establish the existence or nonexistence of a relationship (e.g., chi-square
and t-test between proportions). Others examine the degree to which variables co-vary
(e.g., gamma and correlation coefficients); and still others look at relationships in terms
of variability that can be explained (e.g., regression and cannonical correlation).
In most studies, there are a variety of statistical techniques available to the researcher.
Depending on their choice of statistics, two researchers could draw different conclusions
from the same data. Furthermore, both researchers might be correct. The apparent
paradox can only be solved by examining the underlying assumptions associated with the
researcher's definitions of "relationship." If relativity is an issue in survey research, then
this research had to consider the possibility that the statistical methodology for this study
might also be biased. In an attempt to compensate for this potential source of error, this
study utilized a variety of statistical tests that might be used to answer the research
questions, rather than being limited to a single technique.
The following is a description of the statistical tests that were performed for the first two
hypotheses. Each of tests each represent a different operational definition of the word
"related." While the text only discusses the first hypothesis (researchers' personal
49opinions), the same testing procedure was applied to the second (persuasion of the
sponsor).
The various tests were chosen because they represent commonly used statistical
techniques. Some tests may be questionable to the reader because they blur the
distinction between ordinal and interval data. However, they are reported in this study
because they represent the kinds of analyses frequently found in the literature. For
example, Likert scales are often viewed as ordinal data and reported as counts and
percents. Other times, Likert scale items are summed and averaged to create subscales
and reported as means and standard deviations...an assumption of interval data. The
intent of this study is not to debate whether a Likert scale is ordinal or interval. The fact
that many studies report mean averages for Likert scale items was sufficient for its
inclusion in this study.
The astute reader might also notice that several of the statistical techniques are actually
designed to test for differences rather than measure the strength of a relationship directly
(e.g., a t-test). However, in the special case of a dichotomous dependent variable, a
significant difference is a relationship.
49Liberal and Conservative t-Tests Between Proportions (Tests 1 and 2)
The first two tests view the relationship from a dichotomous perspective...either the
relationship exists or it doesn't. The relationship between a respondent's opinion and
their choice of question wording was quantified by creating two nominal scales. One
nominal scale utilized a liberal definition of "relationship" and the other adopted a more
conservative measure. While both scales contain bias, it would be acceptable for a
researcher to adopt either scale before beginning the research. For example, a researcher
might argue that a very conservative definition of "relationship" had been adopted in
order to be more assured that the results would not suggest a relationship when there
actually was none. Another researcher might adopt a liberal definition of "relationship"
during an initial exploratory research effort to determine if there were any evidence of a
phenomena..
"Relationship" refers to the apparent harmony (or consistency) between a respondent's
personal opinion on an issue and their question wording choice. If a positive relationship
exists, then a respondent who believes too much is being spent on crime would be more
likely to select the question wording that favors more "too much is spent" responses.
Similarly, a person who believes too little is being spent on crime would be more likely
to select the question wording that favors "too little is spent." If a negative relationship
exists, a person who believes too much is being spent on crime would be more likely to
select the question wording that favors "too little is spent" responses and visa versa.
Both scales look at each pair of questions in terms of whether or not a respondent's
personal opinion was (or was not) consistent with their question wording choice. In
other words, for any given pair of questions, the relationship was viewed as
49dichotomous... either it existed or it did not. If a respondent's personal opinion was
consistent with their question wording choice, the contribution to the scale was one and if
it was inconsistent, the contribution was zero.
The difference between the liberal and conservative dichotomous indicators is in how
they interpret "the right amount" response. The liberal indicator views "the right
amount" response as evidence of a relationship, while the conservative scale interprets it
as no relationship. Table 4 reveals the construction of the scales for the liberal and
conservative dichotomous indicators.
Table 4
Construction of the Liberal and Conservative Dichotomous Scales
Liberal ConservativeQuestion Wording That Respondent's Dichotomous DichotomousThe Respondent Selected Opinion Indicator Indicator
favors "too much is spent" too much +1 +1favors "too much is spent" right amount +1 0favors "too much is spent" too little 0 0favors "too little is spent" too much 0 0favors "too little is spent" right amount +1 0favors "too little is spent" too little +1 +1
Using the liberal definition of "relationship", we expected to see a relationship in 66.7
percent (4/6) of the question/opinion pairs, even if there was actually no true relationship.
Using the conservative definition of "relationship", we expected to see a relationship in
33.3 percent (2/6) of the question/opinion pairs. If there were positive or negative
relationships between researchers' opinions and their question wording choices, we
would see deviations away from the values expected by chance.
49A t-test between proportions was used to compare the "percent of question/opinion
comparisons that showed a relationship" with the "percent that would be expected by
chance." If the t-statistic was significant, the null hypothesis was rejected and it was
concluded that a relationship did exist between the two variables. The two tests are
reported as the "Liberal t-Test Between Proportions" and the "Conservative t-Test
Between Proportions."
One-Way Classification Chi-Square Test (Test 3)
Construction of an unbiased (or neutral) scale overcame the problem of how to classify a
response of "the right amount." Instead of defining "relationship" as a dichotomous
variable (yes or no), the neutral scale viewed the relationship as a matter of degree,
varying from minus one to plus one. "The right amount" was considered a neutral
response and assigned a value of zero. Table 5 shows the construction of the scale.
49Table 5
Construction of the Unbiased (Neutral) Scale
Question Wording That Respondent's UnbiasedThe Respondent Selected Opinion Indicator
favors "too much is spent" too much +1favors "too much is spent" right amount 0favors "too much is spent" too little -1favors "too little is spent" too much -1favors "too little is spent" right amount 0favors "too little is spent" too little +1
For each question/opinion comparison, if a respondent's personal opinion was consistent
with their question wording choice, it was classified as "a positive relationship" and the
value for that comparison was plus one. If a respondent answered "the right amount"
(regardless of their question wording choice), it was interpreted as "no relationship" and
assigned a value of zero. Inconsistent responses were assigned a value of minus one.
Thus, the scale was similar to that of a correlation coefficient, ranging from minus one to
plus one. If only chance were operating, we would predict that the scores would be
equally distributed over this range and the mean average would be zero.
Test number three involved a one-way classification chi-square test to determine if the
observed frequencies were significantly different than would be expected by chance.
Since there are three possible scores for each question/opinion comparison (i.e., -1, 0,
and +1), the expected frequency for each cell was 33.3 percent of the sample. A
significant chi-square statistic would indicate that the observed frequencies were different
than would be expected by chance and the null hypothesis would be rejected. This test is
reported as the "One-Way Classification Chi-Square Test."
49Two-Way Classification Chi-Square Test (Test 4)
The fourth test involved a two-way (2 x 3) contingency table analysis using the chi-
square statistic and gamma. A significant chi-square statistic indicated that the observed
frequencies were significantly different than the expected frequencies and the null
hypothesis was rejected. Gamma is interpreted like a correlation coefficient and thus
provided an easily understood measure of the strength of a relationship. This test is
reported as the "Two-Way Classification Chi-Square Test." Table 6 reveals the
construction of the 2 x 3 contingency table.
Table 6
Construction of the 2 x 3 Contingency Table for the Two-Way Classification Chi-Square Test
Personal Opinion
__________________________________________Question Wording Choice Too Little The Right Amount Too Much_______________________________________________________________________
Favors "too little spent" --- --- ---
Favors "too much is spent" --- --- ---
t-Test Between Means (Test 5)
If a researcher assumes that a Likert scale is interval data, then a comparison of mean
averages is an acceptable method of hypothesis testing. A t-test between means was used
to compare the mean personal opinion for those who chose the question wording that
favors "too little is spent", with the mean personal opinion for those who chose the
49question wording that favors "too much is spent." If the t-statistic was significant, we
concluded that there was difference between the means, which implied the existence of a
relationship. This test is reported as the "t-Test Between Means."
Logistic Regression (Tests 6, 7, 8 and 9)
Logistic regression provides a multivariate nonlinear model appropriate for a
dichotomous dependent variable. The null hypothesis implies that knowing the personal
opinions of researchers would not improve our ability to predict which issue label they
would choose. A total of four tests were based on the logistic regression model. For all
four, the dependent variable was the respondents' choice of question wording and the
independent variable was the respondent's personal opinion. Two of the models viewed
the independent variable as ordinal and the other two as interval.
In logistic regression, the chi-square statistic reveals whether there is a significant
relationship between the combined effect of the independent variable(s) and the
dependent variable. Thus, it's interpretation is similar to the F-ratio in standard multiple
regression. Unlike standard multiple regression; however, logistic regression lacks an
intuitive measure of how well the regression model actually improves our ability to
predict the dependent variable. In standard multiple regression, the r-squared statistic is
the proportion of variability in the dependent variable that can be explained by the
independent variables. Traditional logistic regression has no similar intuitive statistic.
When there are a large number of cases, independent variables with very small effects
can meet the requirements for significance. Researchers using logistic regression usually
report the chi-square statistic and its associated significance level, but this reveals little
49about how much the regression model actually improves our ability to predict the
dependent variable. In fact, in the extreme condition where the sample is sufficiently
large and the effect is relatively small, it is possible for a logistic regression model to
show a significant chi-square statistic, even when there is no actual improvement in our
ability to predict the dependent variable. This apparent paradox can occur because the
logistic growth model is a nonlinear approximation of a dichotomous variable.
The value of a logistic regression model is found in its capacity to improve a researcher's
ability to predict the dependent variable. While it is important to look at the significance
of the model, it is perhaps even more important to look at the magnitude of the effect.
One way to understand the magnitude of effect in logistic regression is to ask the
questions:
1. How well can we predict the dependent variable in the absence of any
information about the independent variables?
2. How well can we predict the dependent variable with the knowledge of the
independent variables?
"How well" is readily quantified as the percent of our predictions that are correct. By
comparing the "percent correct" with and without knowledge of the independent
variables, we can get a quantitative measure of the magnitude of the effect.
The "percent correct" without knowledge of the independent variables is obtained from
the frequency distribution of the dependent variable. For example, if seventy-five
percent of the sample had a dependent variable equal to one, our best guess for the value
of the dependent variable in the absence of all other information would be one, and we
49would be wrong twenty-five percent of the time. Thus, the "percent correct" would be
seventy-five percent.
The "percent correct" with knowledge of the independent variables is obtained from the
regression model. For each case, the logistic function yields the probability that the
dependent variable is equal to one. If the probability for a case is .5 or higher, our best
prediction for that case is that the dependent variable is equal to one. If the probability is
less than .5, we would predict that the dependent variable is equal to zero. By comparing
our predictions with the actual values for the dependent variable, we can calculate the
"percent correct" with knowledge of the independent variables.
The magnitude of effect is the difference between the two predictions (with and without
knowledge of the independent variables). A t-test between proportions was used to
determine if the difference is statistically significant. This method was deemed superior
to the chi-square statistic because it provided direct information regarding the magnitude
of the effect, rather than the significance level of the equation. The results relate to our
improved ability to predict the dependent variable and not simply to a mathematical
construct.
Tests six and seven viewed researchers' personal opinions as an ordinal scale. The
"researchers' opinion" variable was converted to dummy variables and these became the
independent variables for the regression model. Since there was substantial nonresponse,
all three dummy variables were included in the regression model without creating a
collinearity problem. That is, "no response" became the standard and was excluded from
the model. Thus, both models encompass the idea that "no personal opinion" may also
be predictive of question wording choice. The two tests are reported as "Logistic
49Regression with Dummy IV's (Chi-Square)" and "Logistic Regression with Dummy IV's
(t-Test).
The eighth and ninth tests were the same regression models except that the independent
variable was viewed as an interval scale and therefore, it could be used directly in the
regression without creating dummy variables. If a respondent did not give a personal
opinion to an item, that data pair was excluded from the analyses. They are reported as
"Logistic Regression with an Interval IV (Chi-Square)" and "Logistic Regression with an
Interval IV (t-Test)."
Table 7 contains a summary of the nine tests that were used to examine the first two
hypotheses. All nine were performed for each of the first two hypotheses. Each test
used a different mathematical construct to define the "relationship" between two
variables.
49Table 7
Summary of Nine Statistical Tests Used for Testing the First Two Hypotheses
1. Liberal t-Test Between Proportions2. Conservative t-Test Between Proportions3. One-Way Classification Chi-Square Test4. Two-Way Classification Chi-Square Test5. t-Test Between Means6. Logistic Regression with Dummy IV's (Chi-Square Test)7. Logistic Regression with Dummy IV's (t-Test)8. Logistic Regression with an Interval IV (Chi-Square Test)9. Logistic Regression with an Interval IV (t-Test)
The first three statistical tests were not appropriate for evaluating the third hypothesis
because there is no way to define what constitutes a positive or negative relationship
between researchers' question wording choices and their self-assessed experience.
However, the other six tests were used to evaluate the third hypothesis.
Multivariate Models
The multivariate model asks whether the combined knowledge of the independent
variables (researchers' opinions, persuasion of the sponsor, and self-assessed experience)
improves our ability to predict the dependent variable (question wording choice).
Logistic regression was used to explore this question. As in the previous logistic
regression models, both the chi-square test and the t-test between proportions were used
to evaluate the model.
The four multivariate tests parallel tests 6, 7, 8, and 9. The only difference is that all
three independent variables were simultaneously included in the model. These tests are
referred to as:
49Multivariate Logistic Regression with Dummy IV's (Chi-Square Test)Multivariate Logistic Regression with Dummy IV's (t-Test)Multivariate Logistic Regression with Interval IV's (Chi-Square Test)Multivariate Logistic Regression with Interval IV's (t-Test)
The fourth research question was whether there were differences between beginning and
advanced researchers with respect to the degree to which they incorporate their personal
opinions and those of the sponsor into their choice of question wording. Two sets of
tests were conducted, one to examine the personal opinions of the researchers and the
other to look at the persuasion of the study sponsor.
The liberal, conservative, and unbiased measures of "relationship" were compared for the
different levels of self-assessed experience, using a two-way classification chi-square
test. A significant chi-square statistic would indicate that there was a relationship
between self-assessed experience and the degree to which they incorporated their
personal opinions or those of the sponsor. Additionally, the unbiased or neutral scale
was used for two additional tests. The first was a t-test between means, comparing
beginning with advanced researchers. The second was a one-way ANOVA, where the
dependent variable was the unbiased scale and the factor variable was the self-assessed
experience level. These five tests are referred to as:
49Two-Way Classif. Chi-Square Test - Liberal DefinitionTwo-Way Classif. Chi-Square Test - Conservative DefinitionTwo-Way Classif. Chi-Square Test - Unbiased Definitiont-Test between Means - Unbiased DefinitionOne-Way ANOVA - Unbiased Definition
Analyses of the Hypotheses Broken Down By Question Wording Pairs
All of the previous analyses were conducted on all issues combined. Each respondent
contributed six records to the final model (one for each issue). The same analyses were
performed for each of the six question wording pairs individually. This was done to
investigate whether the magnitude of the relationship was dependent on the particular
question wording pair.
Methodological Limitations
One shortcoming of all mail-surveys is the possibility of a low response rate. For this
study, it was determined that 551 completed surveys would be required to detect a 5.5
percent difference as significant at the 95 percent confidence level using a one-tailed test.
There is considerable information regarding response rates to mail surveys for the
general public. However, virtually no information exists regarding response rates of the
research community itself. Therefore, this study replaced nonresponders with new
subjects. While this technique had the potential of adding bias to the results, it provided
a sufficient number of respondents for performing the desired analyses.
Another possible shortcoming of this study was that researchers might consistently select
one form of question wording, regardless of the other variables. While this was not
49likely to occur for all six issues labels, it could have necessitated eliminating one or more
of the issue labels pairs from the analyses, since the data would not provide
discriminating information. A similar (remote) possibility existed for the researcher's
experience variable.
Another possible shortcoming of this study was that researchers might already be
familiar with the results of Rasinski's question wording experiments. This might have
had the effect of influencing their choice of issue labels, but it is not clear what the
direction or magnitude of this bias would be. One might assume that more experienced
researchers were more likely to be familiar with Rasinski's findings, but again, there is no
basis for predicting what this bias might be.
A final limitation of this study was that the sample may not be representative of all
survey researchers. The universe of survey researchers is diverse and difficult to
identify. The sample selection technique that was used in this study taps only a small
segment of the population. Since the quality of the sample could not be determined, this
study does not make inferential statements about the population.
Validity and Reliability
It is difficult to discuss the validity of a survey instrument designed to investigate the
validity of surveys in general. Validity refers to the accuracy or truthfulness of a
measurement. Are we measuring what we think we are? "Validity itself is a simple
concept, but the determination of the validity of a measure is elusive" (Spector, 1981, p.
14).
49Most researchers would not knowingly administer a survey with the intention of
distorting the results. Face validity is an issue of integrity. The determination of face
validity is based on the subjective opinion of the researcher. Unless a researcher
knowingly uses faulty procedures or instruments, a study will have face validity. This
study had face validity. This researcher scrutinized and modified the experimental
design and was satisfied that it would accurately measure the desired constructs.
This study also had content validity. A thorough literature review was performed and
this researcher was convinced that the questionnaire would adequately test the four
hypotheses. In addition, other researchers were be asked to review the experimental
design and questionnaire before the study design was finalized.
This study attempted to establish concurrent (criterion-related) validity. Logistic
regression and discriminant function analysis produce a mathematical model, where the
dependent variable can be predicted from one or more independent variables, at the same
point in time.
It is less clear whether or not this study has construct validity. Taken at face value, the
theoretical foundations of this study are sound. However, it is important to keep in mind
that researchers were asked to design a hypothetical survey with a hypothetical sponsor.
There is no way to be assured that this was representative of researchers operating in the
"real world." However, the vignette has been shown to be an effective technique that can
increase both validity and reliability in opinion surveys (Alexander and Becker, 1978).
In addition, this study imposed the constraint that the researcher had to choose between
two alternative issue labels. In a real survey, researchers would be free to create their
49own question wording; they would not be limited to the issue labels presented in this
study. Nevertheless, this study was exploratory in nature and the results were interpreted
as such.
Rasinski (1989) reported that the six pairs of issue labels produced consistent and stable
differences in peoples' responses from 1984 to 1986. The down-turn in the United States
economy has created a more conservative spending attitude and it would not be
surprising to learn that the mood of the public regarding government spending has
changed since 1986. However, no attempt was made to compare the findings of this
study to Rasinski's. This study used Rasinski's issue labels because they have
demonstrated reliability in their ability to produce different responses. In other words,
this study was concerned with relative response patterns, not absolute values.
The questionnaire for this study covered six different social issues; however, the actual
issues were not important for this study. Each pair of issue labels provided a
measurement of the phenomena being studied, thus resulting in six measurements for
each respondent. While this was not the same as equivalent-form reliability, it did allow
for repeated measures within individuals, thereby adding to our confidence in the
findings. Cronbach's alpha was used as a measure of the tendency of respondents to
favor one form of question wording.
Procedures and Timetable
This study began immediately following the approval of the research proposal and
approval of the Request for Exemption from Committee Review of Research Involving
Human Subjects.
49
Selection of the sample took approximately one week. Preparing envelopes and
questionnaires required an additional week. The mailing of the questionnaires began
approximately two weeks after the study began. Three weeks after the initial mailing,
the response rate was calculated and an additional mailing was made to all the remaining
potential subjects.
The analyses and final report was completed approximately three months after the first
mailing.
49CHAPTER IV
Results
Response and Non-Response
Response rate was tracked on a daily basis to determine whether or not there was a
difference in response based upon which of the three forms the respondent received. The
total number of returns was similar for the three different forms and it appears that the
sponsorship variable did not influence response rate to the questionnaire.
A total of 953 surveys were mailed. Thirty-two were returned by the post office as
undeliverable and 921 are presumed to have reached their intended destination. The final
number of usable surveys was 361; thus, the overall response rate was 39.2 percent, a
fairly high usable response for mail surveys without follow-up. Table 8 shows the final
returns and response rates for each of the three forms of the survey.
Table 8
Response Rate Information Broken Down by the Sponsorship Variable
Not Usable Response Mailed Deliverable Returns Rate Overall (all surveys combined) 953 32 361 39.2%Conservative sponsor 318 9 119 38.5%Liberal sponsor 318 13 116 38.0%No sponsor identified 317 10 126 41.0%
Many respondents failed to answer all the items on the questionnaire. Omitted items
were frequently accompanied by text explaining why they did not choose either form of
49question wording and some even suggested a third wording alternative. Table 9 shows
the completion rate for each of the items on the questionnaire. The percentages are based
upon the total number of usable returns (N=361).
Table 9
Number of Valid Responses and Completion Rates for All Items on the Survey
Researcher's self-assessed experience 328 (90.9%)
Made A Valid Question Made a Valid PersonalIssue Wording Selection Opinion Selection
Crime 359 (99.4%) 330 (91.4%)
Drugs 352 (97.5%) 321 (88.9%)
Welfare 352 (97.5%) 325 (90.0%)
Cities 357 (98.9%) 326 (90.3%)
Blacks 344 (95.3%) 308 (85.3%)
Social Security 358 (99.2%) 324 (89.8%)
Nine percent of those who returned the survey did not specify their survey design
experience level. This was the first item on the survey and presumably not likely to be
overlooked. Pretesting had not shown a problem and a review of respondents' comments
did not reveal any reasons why they may have omitted the item. We suspect that many
respondents inadvertently left this item blank, possibly because it was not numbered.
Future studies might number the item, or highlight it in some other way, in order to
improve the response rate.
49Twenty-three respondents (6.4%) answered most of the six question wording choices, but
left all six of the personal opinion questions blank. A few respondents provided an
explanation why they had omitted the personal opinion questions; however, in most
cases, they were just left blank. The most plausible explanation is that these people
simply did not realize that they were supposed to answer this group of questions.
Although the instructions were printed in bold, they may have interpreted the questions
as part of the questionnaire that they would be sending to the public. We suspect that this
reflects a minor design flaw in the survey instrument. It is not surprising that the
deficiency did not surface in pretesting, since it appeared in such a small proportion of
the surveys.
Respondents' Comments
The cover letter that accompanied the questionnaire asked respondents to write
comments anywhere on the questionnaire. This untraditional approach to the solicitation
of comments proved to be extremely effective. Over a quarter (27.7%) of all the
returned questionnaires contained one or more comments. Counting only those
respondents that made at least one comment, the average number of comments per
questionnaire was 2.4. While it was not the original intent of this study to summarize the
comments, there are presented here because they reveal much about this survey and the
survey design process in general.
Reaction to the survey was mixed. One respondent wrote, "I've been designing surveys
for about 10 years and this is one of the strangest survey's I've ever seen or completed."
Another commented that it was a "very clever study." A few respondents seemed
suspicious of the survey. "Who's paying for this-a government agency? It sure doesn't
49sound like a 'question wording survey.'" Fifteen respondents specifically requested a
copy of the results.
This survey was offensive to a couple of subjects. One subject, who refused to complete
the questionnaire, described how tired he/she was of "researcher bashing."
Because I have integrity and am tired of having my profession bashed for "misuse" all the time, I would turn down any offers and rescind any bids. None of the questions are bias free or clear enough to know what the respondent would be answering.
Other respondents stressed the importance of objectivity in survey research. In one case,
a respondent refused to answer the personal opinion questions because "that in itself
would not be objective." It was clear that objectivity was important to many
respondents. Some even expressed open contempt for researchers who would allow their
professional judgment to be influenced by the persuasion of the sponsor or the potential
for additional work.
Your survey is testing the hypothesis that designers are going to worry more about their next design job than a good instrument. These people lack the character needed to maintain credibility to compete for design jobs.
This is nonsense. I attempt to do the most objective job regardless of the client's motivations.
Why do you care if I'm liberal or conservative or whatever? Being objective has to be the first priority-regardless of future research opportunities.
Several respondents wrote flowing narrative describing the methods they use to
overcome the problem of question wording bias. The common thread in these responses
was the use of a panel or focus group to develop the most appropriate question wording.
49Each of the alternative casts a different light on the issues. The alternatives give different "set-up" and provoke different emotions. None are any more or less objective than the rest. I find the "truth" is seldom acquired by a single question. Truth is always subject to context. Therefore, I generally probe an issue with several slightly different questions. In other words, I ask the question 3 or 4 different ways and then look at the sum of the results. Actually, if I don't know the right wording to use, I usually do a couple focus groups first. The focus groups usually show which wording to use.
We develop most of our questions by asking a panel of individuals to suggest and cross check the questions for neutrality, clarity, appropriate language level, bias, etc. This is often accomplished by soliciting the opinion of extreme and opposite ideologies (i.e. an adaptation of Likert techniques) and then pre-testing the instrument ... No survey instrument is perfect and I've made my mistakes, but all too many researchers seek the answers desired by the customer. We always tell our customers that we will try our best to obtain objective answers, although the results may not be what is desired.
Only two respondents made statements to suggest that they might knowingly bias the
results, although one (or both) may have been joking.
Being objective depends on how badly I need the job.
Of course, the situation is hypothetical. It's likely I would not get myself in the position of working for such a group. But if for some reason I did, I would word the questions so they got the results they want. Hey, I'm a professional, Ya Know?
Eight respondents indicated that they needed to know whether the "we" in "Are we
spending..." referred to the federal, state, or local government. In order to correct this
ambiguity, future researchers might consider changing the wording to "Is the federal
government spending...".
The most frequent comment was a short general statement indicating that the respondent
had a problem with one or both of the question wording choices. Some comments were
directed at the lack of specificity in the questions. These included words like "vague",
49"broad", "unclear", and "ambiguous." Other comments addressed the issue of bias
directly and contained phrases like "loaded words", "opinionated", "leading questions",
"negative connotations", "limiting", and "biased." A large number of respondents
pointed out that the questions were addressing two distinct and separate issues.
Occasionally, respondents refused to answer an item and stated that neither choice was
acceptable. They often suggested a third alternative wording.
A substantial number of respondents wrote in "DK" or "don't know" for some of the
personal opinion questions. Others circled "the right amount" and added a qualifier
pointing out that "the right amount" is not synonymous with appropriate spending, or
they added some other qualifying statement.
The right amount, but misdirected.
This was hard to do. Felt that more information should be provided to obtain accurate perceptions. I really feel that the money is not well spent therefore, one dollar spent is too much!
49Crime
Examples of alternative wording suggested by respondents include:
...protecting law-abiding citizens.
...getting and keeping criminals off the street.
...reducing crime.
Two respondents correctly pointed out that crime rate is not rising in many areas.
Another stated that, "You have assumed that spending has a direct effect upon crime rate.
I doubt that it does." Another respondent noted that "law enforcement is only one
method for dealing with crime, and then, after the fact."
Drugs
Examples of alternative wording suggested by respondents include:
...the drug epidemic.
...decreasing the illegal use of drugs.
...the social problems of medicine and chemical misuse.
Several respondents commented that alcohol abuse was also part of the drug problem.
Others focused on the difference between rehabilitation and prevention. "Rehabilitation
is one after-the-fact method. The choice I circled could include prevention." "We need
more education about the hazards of drug and alcohol abuse." Surprisingly, no
comments were made about the emotional connotations of the word "addiction."
Welfare
Examples of alternative wording suggested by respondents include:
...public assistance programs.
...ending poverty.
...helping the homeless.
Many respondents commented on this issue. A large number mentioned that the salient
issue is not how much money, but rather, how the money is used. Typical responses
49were that we are "spending the right amount but in the wrong places", or "the right
amount if how the money is used is restructured." One respondent wrote that we "need
to revamp the whole system." Others focused more on more specific solutions. "It's not
just a spending issue. What I think 'the poor' need is more quality education which may
or may not require more dollars."
Several respondents commented that "welfare" had severe negative connotations. One
respondent exhibited an almost knee-jerk reaction to the word "welfare" and wrote, "Why
spending? Perhaps pulling, pushing or kicking would be more effective." A few
respondents pointed out that "welfare is only one type of assistance to the poor. What
about inner city job programs, etc."
Cities
Examples of alternative wording suggested by respondents include:
...urban revitalization.
...urban decay.
...solving THE problems of OUR big cities.
There were fewer comments for this question compared to the other issues. One
respondent pointed out that "assistance is different than 'solving' or 'improving'", and then
went on to suggest that multiple questions were needed to explore the issue. Another
respondent, who didn't like either choice, wrote that "the emphasis should be on
partnership, research, and discovering what works" . Another stated that both questions
were "assuming that the wrong medicine will cure the patient."
Blacks
Examples of alternative wording suggested by respondents include:
49...helping economically and socially disadvantaged persons....affirmative actions for all minorities....programs targeted to the African American community....improving THE ECONOMIC conditions of AFRICAN AMERICANS....helping minorities....helping disadvantaged people improve their situation.
This question generated more comments than the other issues. About twenty-five
respondents pointed out that one should "never use the term 'Blacks' in today's race
conscious society." Respondents most often referred to this as "politically incorrect", but
others used words such as "offensive", "racist", "discriminatory", "emotional trigger",
"condescending", "paternalistic", and "patronizing." Most suggested using "African
Americans" instead of "Blacks", although a few respondents suggested "minority groups"
and "people of color."
Social Security
No respondents suggested alternative wording for this question, although there were a
few comments stating that the question wording was not sufficiently clear. One
respondent wrote, "Do you mean Social Security payments? Social Security deductions
(FICA)? Too vague. Don't understand this question." Another respondent asked, "Do
you want to measure what's spent on the program, or what's spent by special interest,
lobbyist, etc., to keep it?" One respondent pointed out that the word "protecting" implied
that Social Security is in jeopardy.
49Summary of Comments
A total of 239 comments were made by 100 respondents. Table 10 shows the
approximate number of comments made for each of the issues, as well as the percent of
respondents who made a valid question wording selection. As we might expect, there
appears to be a negative relationship between the number of comments and the response
percent...as response goes down, the number of comments goes up.
Table 10
Number of Comments and Response Percent Ranked by Frequency of Mention
Issue Number of Comments Response Percent
General (nonspecific) 64 Not appropriate
Blacks 42 95.3%
Welfare 38 97.5%
Crime 29 99.4%
Drugs 27 97.5%
Cities 23 98.9%
Social Security 16 99.2%
Null Hypothesis Testing
The first two null hypotheses were tested using nine different operational definitions of
"relationship." These are referred to as:
1. Liberal t-Test between Proportions 2. Conservative t-Test between Proportions 3. One-Way Classification Chi-Square Test 4. Two-Way Classification Chi-Square Test
49 5. t-Test between Means 6. Logistic Regression with Dummy IV's (Chi-Square Test) 7. Logistic Regression with Dummy IV's (t-Test) 8. Logistic Regression with an Interval IV (Chi-Square Test) 9. Logistic Regression with an Interval IV (t-Test)
Null Hypothesis 1: Researchers' choices for question wording are not related to their
personal opinions.
Using the "Liberal t-Test between Proportions", we observed a relationship in 71.6
percent of the question pairs. Table 11 shows that the t-statistic was significant,
t(1932)=1.834, p=.033, thus we reject the null hypothesis and conclude that there was a
significant relationship between researchers' personal opinions and their question
wording choices.
Table 11
Liberal t-Test between Proportions___________________________________________________________________Total number of comparisons = 1934 N Percent t df p___________________________________________________________________
Observed relationships 1385 71.6 1.834 1932 .033
Relationships expected by chance 1289 66.7
The "Conservative t-Test between Proportions" showed a relationship in 41.0 percent of
the question pairs. Table 12 reveals that the t-test between proportions was significant,
t(1932)=3.926, p<.001. Therefore, we reject the null hypothesis and conclude that a
significant relationship did exist between researchers' personal opinions and their
question wording choices.
49Table 12
Conservative t-Test between Proportions___________________________________________________________________Total number of comparisons = 1934 N Percent t df p___________________________________________________________________
Observed relationships 792 41.0 3.926 1932 .000
Relationships expected by chance 645 33.3
The "One-Way Classification Chi-Square Test" using the unbiased or neutral measure of
the relationship revealed a strong significant difference between the observed distribution
and one that would be expected by chance if there were no true relationship,
c2(2)=52.006, p<.001. Therefore, we reject the null hypothesis and conclude that a
relationship did exist between researchers' personal opinions and their question wording
choices. Table 13 shows the counts and percents for the one-way contingency table.
Table 13
One-Way Classification Chi-Square Test
Positive No NegativeRelationship Relationship Relationship c2 df p
792 (41.0%) 593 (30.7%) 549 (28.4%) 52.006 2 .000
The "Two-Way Classification Chi-Square Test" revealed a significant relationship
between researchers' personal opinions and their question wording choices,
c2(2)=61.178, p<.001, thus we reject the null hypothesis. Gamma indicated a low to
moderate strength relationship (l=.287). Therefore, about eight percent of the variability
in question wording choices can be explained by knowing the personal opinions of
49researchers. Table 14 shows the counts and row percents for the two-way contingency
table.
Table 14
Two-Way Classification Chi-Square Test
N=1934 Personal Opinion_________________________________________
Question Wording Choice Too Little Right Amount Too Much______________________________________________________________________
Favors "too little spent" 522 (55.9%) 246 (26.3%) 166 (17.8%)
Favors "too much is spent" 383 (38.3%) 347 (34.7%) 270 (27.0%)
The "t-Test between Means" revealed a strong significant difference in personal opinions
between those who chose the question wording that favors "too little is spent" and those
who chose the wording that favors "too much is spent", t(1932)=7.501, p<.001.
Therefore, we reject the null hypothesis and conclude that there was a significant
relationship between researchers' personal opinions and their question wording choices.
Table 15 reveals that regardless of which question wording choice respondents chose,
their personal opinions tended to fall somewhere between "too little" and "the right
amount" is spent (i.e., between zero and one). A mean of zero would indicate that all
respondents felt that "too little" is spent, while a mean of one would indicate they thought
"the right amount" was being spent.
49Table 15t-Test between Means
______________________________________________________________________
Question wording choice N Mean SD t df p______________________________________________________________________
Favors "too little is spent" 934 .619 .769 7.501 1932 .000
Favors "too much is spent" 1000 .887 .801
The "Logistic Regression with Dummy IV's" models produced a highly significant chi-
square statistic, c2(3)= 67.732, p<.001. Thus, we conclude that a significant relationship
existed between researchers' personal opinions and their question wording choices. The
t-statistic was also significant, thereby indicating that knowing researchers' personal
opinions improves our ability to predict which question wording choice a researcher
would choose, t(2120)=2.884, p=.002. Table 16 reveals that the regression model
increases our prediction accuracy by 6.6 percent, from 52.5 percent to 59.1 percent.
Table 16
Logistic Regression with Dummy IV's (t-Test)
DV=Question wording choiceIV=Personal opinion of the researcher___________________________________________________________________
Total variable pairs = 2122 N Percent t df p___________________________________________________________________
Correct with knowledge of the personal opinion of the researcher 1254 59.1 2.884 2120 .002
Correct without knowledge of thepersonal opinion of the researcher 1115 52.5
49The "Logistic Regression with an Interval IV" model also revealed a significant
relationship, c2(1)= 55.279, p<.001. Table 17 shows that this model was slightly better
than logistic regression with dummy variables. Our ability to predict the dependent
variable increased 7.1 percent (from 51.7 to 58.9 percent) and the t-statistic was highly
significant, t(1932)=3.018, p=.001.
Table 17
Logistic Regression with an Interval IV (t-Test)
DV=Question wording choiceIV=Personal opinion of the researcher___________________________________________________________________
Total variable pairs = 1934 N Percent t df p___________________________________________________________________
Correct with knowledge of the personal opinion of the researcher 1139 58.9 3.018 1932 .001
Correct without knowledge of thepersonal opinion of the researcher 1000 51.7
Table 18 is a summary of the conclusions that were drawn from each of the statistical
tests. In every test, the null hypothesis was rejected and therefore we conclude that there
was a significant relationship between researchers' personal opinions and their question
wording choices.
49Table 18
Summary of Conclusions for the First Null Hypothesis
Null hypothesis: Researchers' choices for question wording are not related to their personal opinions.
Operational Definition of Relationship Conclusion
Liberal t-Test between Proportions Reject Conservative t-Test between Proportions RejectOne-Way Classification Chi-Square Test RejectTwo-Way Classification Chi-Square Test Rejectt-Test between Means RejectLogistic Regression with Dummy IV's (Chi-Square) RejectLogistic Regression with Dummy IV's (t-Test) RejectLogistic Regression with an Interval IV (Chi-Square) RejectLogistic Regression with an Interval IV (t-Test) Reject
The final step in testing the first hypothesis was to determine whether there were
differences between the six sets of question wording pairs. Was the effect present for
some issues, but not for others? The same nine operational definitions of "relationship"
were used to test each of the six question wording pairs. Table 19 contains a summary of
the significance levels for all 54 statistical tests. For all issues except drugs, researchers'
personal opinions seemed to be significantly related to their question wording choices.
Note that a probability of one for a logistic regression model means that there was no
improvement in our ability to predict the dependent variable as a result of the regression
equation.
49Table 19
Probability Levels for Each of the Nine Tests Broken Down by Issue
Crime Drugs Welfare Cities Blacks Social Security
N=330 N=321 N=325 N=326 N=308 N=324
Liberal t-Test between Proportions
.101 .216 .109 .098 .050 .012
Conservative t-Test between Proportions
.444 .212 .000 .013 .182 .055
One-Way Classification Chi-Square Test
.003 .004 .000 .000 .000 .000
Two-Way Classification Chi-Square Test
.001 .160 .000 .001 .002 .000
t-Test between Means .001 .227 .000 .000 .000 .000
Logistic Regression with Dummy IV's (Chi-Square)
.002 .192 .000 .001 .000 .000
Logistic Regression with Dummy IV's (t-Test)
1.000 .256 .324 .174 .055 .363
Logistic Regression with an Interval IV (Chi-Square)
.001 .452 .000 .000 .000 .000
Logistic Regression with an Interval IV (t-Test)
1.000 1.000 .319 .285 .151 .355
Null Hypothesis 2: Researchers' choices for question wording are not related to the
persuasion of the study sponsor.
Using the "Liberal t-Test between Proportions", we observed a relationship in 70.1
percent of the question pairs. Table 20 shows that the t-statistic was not significant,
t(2120)=1.353, p=.088, thus we fail to reject the null hypothesis and conclude that there
was no significant relationship between researchers' question wording choices and the
persuasion of the study sponsor.
49Table 20
Liberal t-Test between Proportions___________________________________________________________________Total number of comparisons = 2122 N Percent t df p___________________________________________________________________
Observed relationships 1488 70.1 1.353 2120 .088
Relationships expected by chance 1414 66.7
The "Conservative t-Test between Proportions" showed a relationship in 34.8 percent of
the question pairs. Table 21 reveals that the t-test between proportions was not
significant, t(2120)=.838, p=.201. Therefore, we fail to reject the null hypothesis and
conclude that there was no significant relationship between researchers' wording choices
and the persuasion of the study sponsor.
Table 21
Conservative t-Test between Proportions___________________________________________________________________Total number of comparisons = 2122 N Percent t df p___________________________________________________________________
Observed relationships 739 34.8 0.838 2120 .201
Relationships expected by chance 707 33.3
The "One-Way Classification Chi-Square Test" using the unbiased or neutral measure of
the relationship revealed a strong significant difference between the observed distribution
and one that would have been expected by chance if there were no true relationship,
c2(2)=11.475, p=.003. Therefore, we reject the null hypothesis and conclude that a
relationship did exist between researchers' question wording choices and the persuasion
49of the study sponsor. Table 22 shows the counts and percents for the one-way
contingency table.
Table 22
One-Way Classification Chi-Square Test
Positive No NegativeRelationship Relationship Relationship c2 df p
739 (34.8%) 749 (35.3%) 634 (29.9%) 11.475 2 .003
The "Two-Way Classification Chi-Square Test" revealed a significant relationship
between researchers' personal opinions and their question wording choices,
c2(2)=8.660, p<.013, thus we reject the null hypothesis. Gamma indicated that the
strength of the relationship was extremely small (l=.099). Table 23 shows the counts
and row percents from the contingency table.
Table 23
Two-Way Classification Chi-Square Test
N=2122 Persuasion of the Study Sponsor_________________________________________
Question Wording Choice Liberal None Conservative______________________________________________________________________
Favors "too little spent" 351 (34.9%) 346 (34.4%) 310 (30.8%)
Favors "too much is spent" 324 (29.1%) 403 (36.1%) 388 (34.8%)
The "t-Test between Means" revealed a significant difference in the persuasion of the
study sponsor between those who chose the question wording that favors "too little is
spent" and those who chose the wording that favors "too much is spent", t(2120)=2.810,
49p=.003. A one-tailed test of significance was used because the direction of the difference
was predicted. Therefore, we reject the null hypothesis and conclude that there was a
significant relationship between the persuasion of the study sponsor and researchers'
question wording choices. Although the difference was significant, it is relatively small.
In Table 24, a mean of less than one indicates a more liberal sponsor, while a mean
greater than one indicates a more conservative sponsor.
Table 24
t-Test between Means
______________________________________________________________________
Question wording choice N Mean SD t df p______________________________________________________________________
Favors "too little is spent" 1007 .959 .810 2.810 2120 .003
Favors "too much is spent" 1115 1.057 .797
Both logistic regression models had significant chi-square statistics. The "Logistic
Regression with Dummy IV's" model, c2(2)= 8.658, p=.003, was only slightly higher
than the "Logistic Regression with an Interval IV" model, c2(1)= 7.881, p<.005. From
the chi-square statistics, we would conclude that a significant relationship existed
between the persuasion of the study sponsor and researchers' question wording choices.
Table 25 reveals that the regression models increased our prediction accuracy by only 1.3
percent...from 52.5 percent to 53.8 percent. The t-statistic was the same for both models
and it was not significant, t(2120)=0.581, p=.281. Therefore, we conclude that knowing
the persuasion of the study sponsor does not significantly improve our ability to predict
which question wording choice a researcher would choose.
Table 25
49
Logistic Regression with Dummy IV's (t-Test) and Logistic Regression with an Interval IV (t-Test)
DV=Question wording choiceIV=Persuasion of the study sponsor___________________________________________________________________
Total variable pairs = 2122 N Percent t df p___________________________________________________________________
Correct with knowledge of the persuasion of the study sponsor 1142 53.8 0.581 2120 .281
Correct without knowledge of thepersuasion of the study sponsor 1115 52.5
Table 26 is a summary of the conclusions that were drawn from each of the statistical
tests. The overall results are inconclusive. Whether or not one rejects (or fails to reject)
the null hypothesis depends upon the operational definition adopted by the researcher at
the onset of the study. Therefore, we are unable to draw a summary conclusion
regarding the relationship between researchers' question wording choices and the
persuasion of the study sponsor.
49Table 26
Summary of Conclusions for the Second Null Hypothesis
Null hypothesis: Researchers' choices for question wording are not related to the persuasion of the study sponsor.
Operational Definition of Relationship Conclusion
Liberal t-Test between Proportions Fail to reject Conservative t-Test between Proportions Fail to rejectOne-Way Classification Chi-Square Test RejectTwo-Way Classification Chi-Square Test Rejectt-Test between Means RejectLogistic Regression with Dummy IV's (Chi-Square) RejectLogistic Regression with Dummy IV's (t-Test) Fail to rejectLogistic Regression with an Interval IV (Chi-Square) RejectLogistic Regression with an Interval IV (t-Test) Fail to reject
The final step in testing the second hypothesis was to determine whether there were
differences between the six sets of question wording pairs. Was the effect present for
some issues, but not for others? The same nine operational definitions of "relationship"
were used to test each of the six question wording pairs. Table 27 contains a summary of
the significance levels for all fifty-four statistical tests. The effect was clearly strongest
for the cities and welfare issues.
49Table 27
Probability Levels for Each of the Nine Tests Broken Down by Issue
Crime Drugs Welfare Cities Blacks Social Security
N=359 N=352 N=352 N=357 N=344 N=358
Liberal t-Test between Proportions
.346 .478 .217 .183 .267 .289
Conservative t-Test between Proportions
.439 .335 .245 .204 .363 .335
One-Way Classification Chi-Square Test
.604 .651 .142 .075 .289 .384
Two-Way Classification Chi-Square Test
.485 .932 .086 .079 .392 .139
t-Test between Means .245 .372 .020 .014 .101 .090
Logistic Regression with Dummy IV's (Chi-Square)
.485 .933 .089 .078 .392 .142
Logistic Regression with Dummy IV's (t-Test)
1.000 .458 1.000 .419 .280 1.000
Logistic Regression with an Interval IV (Chi-Square)
.488 .744 .039 .027 .201 .179
Logistic Regression with an Interval IV (t-Test)
1.000 .458 1.000 .419 .280 1.000
Null Hypothesis 3: Researchers' choices for question wording are not related to their
self-assessed experience in questionnaire design.
The "Liberal t-Test between Proportions", "Conservative t-Test between Proportions"
and the "One-Way Classification Chi-Square Test" were not used for testing the third
hypothesis because they do not provided meaningful operational definitions of positive or
negative relationships. The other tests used two-tailed probabilities for significance
testing because theory does not predict the direction of the relationship, .
The "Two-Way Classification Chi-Square Test" revealed a significant relationship
between researchers' self-assessed experience and their question wording choices,
c2(2)=8.148, p=.017, thus we reject the null hypothesis. Gamma indicated a low strength
49relationship (l=.111). Closer examination of the contingency table revealed that the
significance was primarily due to the fact that self-assessed expert researchers tended to
choose the question wording that favors "too much is spent." Table 28 shows the
contingency table and column percents.
Table 28
Two-Way Classification Chi-Square Test
N=1925 Self-Assessed Experience_________________________________________
Question Wording Choice Beginner Average Expert______________________________________________________________________
Favors "too little spent" 191 (51.5%) 460 (47.3.3%) 246 (42.3%)
Favors "too much is spent" 180 (48.5%) 512 (52.7%) 336 (57.7%)
The "t-Test between Means" revealed a small significant difference in researchers'
experience levels between those who chose the question wording that favors "too little is
spent" and those who chose the wording that favors "too much is spent", t(1925)=2.852,
p<.004. Therefore, we reject the null hypothesis and conclude that there was a
significant relationship between researchers' self-assessed experience and their question
wording choices. Table 29 reveals that researchers who chose the question wording that
favors "too much is spent" tended to rate themselves as more experienced than those who
chose the question wording that favors "too little is spent."
49Table 29
t-Test between Means
______________________________________________________________________
Question wording choice N Mean SD t df p______________________________________________________________________
Favors "too little is spent" 897 1.061 .769 2.852 1923 .004
Favors "too much is spent" 1028 1.152 .692
The "Logistic Regression with Dummy IV's" model produced a significant chi-square
statistic, c2(3)= 14.275, p=.003, thus indicating that there was a relationship between
researchers' self-assessed experience and their question wording choices. However, our
ability to predict the dependent variable increased by less than two percent (from 52.5
percent to 54.1 percent). As shown in table 30, the t-test revealed that the improvement
was not significant, t(2120)=0.715, p=.475. Therefore, we fail to reject the null
hypothesis and conclude that knowing a researchers' self-assessed experience does not
significantly improve our ability to predict their question wording choices.
Table 30
Logistic Regression with Dummy IV's (t-Test)
DV=Question wording choiceIV=Self-assessed experience level of the researcher___________________________________________________________________
Total variable pairs = 2122 N Percent t df p___________________________________________________________________
Correct with knowledge of the self-assessed experience of the researcher 1149 54.1 0.715 2120 .475
Correct without knowledge of the self-assessed experience of the researcher 1115 52.5
49
The "Logistic Regression with an Interval IV" model also revealed a significant
relationship, c2(1)= 8.119, p=.004. However, as shown in table 31, this model was even
worse than logistic regression with dummy variables. Our ability to predict the
dependent variable improved by less than one percent (from 53.4 to 54.0 percent). Table
31 shows that this was not a significant improvement, t(1923)=.241, p=.809, and thus we
fail to reject the null hypothesis.
Table 31
Logistic Regression with an Interval IV (t-Test)
DV=Question wording choiceIV=Self-assessed experience level of the researcher___________________________________________________________________
Total variable pairs = 1934 N Percent t df p___________________________________________________________________
Correct with knowledge of the self-assessed experience of the researcher 1139 54.0 0.241 1923 .809
Correct without knowledge of the selfassessed experience of the researcher 1028 53.4
Table 32 is a summary of the conclusions that were drawn from each of the statistical
tests. Regardless of whether researchers' self-assessed experience is viewed as an ordinal
or interval scale (i.e., nonparametric or parametric), the results are mixed. The predictive
models failed miserably, yet the parametric t-test and the nonparametric chi-square test
showed a significant relationship.
49Table 32
Summary of Conclusions for the Third Null Hypothesis
Null hypothesis: Researchers' choices for question wording are not related to their self-assessed experience in questionnaire design.
Operational Definition of Relationship Conclusion
Two-Way Classification Chi-Square Test Rejectt-Test between Means RejectLogistic Regression with Dummy IV's (Chi-Square) Fail to rejectLogistic Regression with Dummy IV's (t-Test) Fail to rejectLogistic Regression with an Interval IV (Chi-Square) Fail to rejectLogistic Regression with an Interval IV (t-Test) Fail to reject
The final step in testing the third hypothesis was to determine whether there were
differences between the six sets of question wording pairs. The same six operational
definitions of "relationship" were used to test each of the six question wording pairs.
Table 33 contains a summary of the significance levels for all thirty-six statistical tests.
The relationship between the self-assessed experience and question wording choice was
strongest (and often significant) for the crime and Social Security issues.
49Table 33
Probability Levels for Each of the Six Tests Broken Down by Issue
Crime Drugs Welfare Cities Blacks Social Security
N=326 N=319 N=319 N=324 N=312 N=325
Two-Way Classification Chi-Square Test
.059 .462 .911 .680 .316 .037
t-Test between Means .020 .272 .696 .831 .174 .010
Logistic Regression with Dummy IV's (Chi-Square)
.040 .388 .957 .576 .115 .073
Logistic Regression with Dummy IV's (t-Test)
1.000 .567 1.000 1.000 .346 1.000
Logistic Regression with an Interval IV (Chi-Square)
.020 .271 .695 .831 .173 .010
Logistic Regression with an Interval IV (t-Test)
1.000 .869 1.000 1.000 .660 1.000
Multivariate Models to Test the First Three Null Hypotheses
As a final test of the first three hypotheses, two multivariate models were developed.
The dependent variable for both models was the question wording choice and the three
independent variables were the 1) personal opinion of the researcher, 2) persuasion of the
study sponsor, and 3) self-assessed experience level of the researcher. One logistic
regression model viewed the independent variables as ordinal data and the other interval
data. Dummy variables were created for the model where the independent variables was
viewed as ordinal. Thus, the models cover both parametric and nonparametric
interpretations of the independent variables. Both the chi-square statistic and the t-test
between proportions were examined.
The "Multivariate Logistic Regression with Dummy IV's" model produced a significant
chi-square statistic, c2(8)= 88.715, p<.001. Table 34 shows that the t-test between
proportions was also significant, t(2120)=2.563, p=.008. We conclude that knowing the
49independent variables improved our ability to predict researcher's question wording
choices by about six percent and that this improvement was significant.
Table 34
Multivariate Logistic Regression with Dummy IV's
DV=Question wording choiceIV's=Researchers personal opinions Persuasion of the study sponsor Self-assessed experience level of the researcher___________________________________________________________________
Total variable pairs = 2122 N Percent t df p___________________________________________________________________
Correct with knowledge of allthree independent variables 1239 58.4 2.563 2120 .008
Correct without knowledge of theindependent variables. 1115 52.5
The "Multivariate Logistic Regression with Interval IV's" model also revealed a
significant relationship, c2(3)= 70.498, p<.001. Table 35 shows that this model
performed slightly better than logistic regression with dummy variables. Knowledge of
the independent variables significantly improved our ability to predict the dependent
variable by nearly seven percent, t(1741)=2.746, p=.003.
49Table 35
Multivariate Logistic Regression with Interval IV's
DV=Question wording choiceIV's=Researchers personal opinions Persuasion of the study sponsor Self-assessed experience level of the researcher___________________________________________________________________
Total variable pairs = 1743 N Percent t df p___________________________________________________________________
Correct with knowledge of allthree independent variables 1038 59.6 2.746 1741 .003
Correct without knowledge of theindependent variables. 917 52.6
Table 36 shows the logistic regression coefficients and probabilities for each of the
independent variables. Since all three variables were scaled from zero to two, the
coefficients represent the relative importance of the variables in the prediction model.
Researchers' personal opinions was the most important, the persuasion of the study
sponsor was second, and the self-assessed experience was least important in the
prediction model. However, all three variables made significant contributions.
Table 36
Logistic Regression Coefficients for the Multivariate Model with Interval IV's
DV=Question wording choiceIndependent Variable Coefficient Std. Error T-Ratio Prob.Experience Level .224 .071 3.157 .002Persuasion of the sponsor .249 .061 4.087 .000Personal opinion .402 .063 6.405 .000
49Null Hypothesis 4: There are no differences between beginning and advanced
researchers with respect to the degree to which they incorporate their personal opinions
and those of the sponsor into their choice of question wording.
There are two parts to this null hypothesis. The first part addresses the degree to which
researchers incorporate their personal opinions and whether there are differences
depending on the self-assessed experience levels of the researchers. The second part
looks at the degree to which researchers' incorporate the persuasion of the study sponsor
and whether there are differences depending on the self-assessed experience levels of the
researchers. Seven tests were used to examine both parts of this null hypothesis. These
tests use the same three operational definitions of "relationship" (i.e., liberal,
conservative, and unbiased) that were used to test the first two hypotheses.
A two-way contingency table was prepared using the "Liberal t-Test between
Proportions" and the self-assessed experience level of the researchers. The chi-square
was significant, c2(2)=6.045, p=.049, thus we reject the null hypothesis and conclude that
there was a significant difference between beginning and advanced researchers with
respect to the degree to which they incorporated their personal opinions into their
question wording choices. Expert researchers were less likely to incorporate their
opinions than beginning or average researchers. Gamma indicated a low strength
relationship (l=-.097). Table 37 shows the counts and column percents for the
contingency table.
49Table 37
Two-Way Classification Chi-Square Test using the Liberal Definition of the Relationship
N=1743 Self-Assessed Experience_________________________________________
Beginner Average Expert______________________________________________________________________
No observed relationship 90 (26.8%) 240 (26.9%) 168 (32.7%)
Observed relationship 246 (73.2%) 653 (73.1%) 346 (67.3%)
A two-way contingency table was prepared using the "Conservative t-Test between
Proportions" and the self-assessed experience level of the researchers. The chi-square
was not significant, c2(2)=4.285, p=.117, thus we fail to reject the null hypothesis and
conclude that there was no significant difference between beginning and advanced
researchers with respect to the degree to which they incorporated their personal opinions
into their question wording choices. Gamma was very close to zero, (l=.005). Table 38
shows the counts and column percents for the contingency table.
Table 38
Two-Way Classification Chi-Square Test using the Conservative Definition of the Relationship
N=1743 Self-Assessed Experience_________________________________________
Beginner Average Expert______________________________________________________________________
No observed relationship 214 (63.7%) 517 (57.9%) 318 (61.9%)
Observed relationship 122 (36.3%) 376 (42.1%) 196 (38.1%)
49A two-way contingency table using the unbiased or neutral measure of the relationship
revealed a weak significant relationship between researchers' self-assessed experience
levels and the degree to which they incorporated their personal opinions into their
question wording choices. The chi-square was significant, c2(4)=10.933, p=.027, thus
we reject the null hypothesis and conclude that there was a significant relationship.
Close analysis of Table 39 reveals that the relationship was due mostly to the fact that
experienced researchers more often chose question wording that favores a response
opposite their own opinions, and beginning researchers were more likely to show no
relationship between their personal opinions and their question wording choices.
Table 39
Two-Way Classification Chi-Square Test using the Unbiased Definition of the Relationship
Self-Assessed ExperienceN=1743 _________________________________________
Beginner Average Expert______________________________________________________________________
Negative relationship 90 (26.8%) 240 (26.9%) 168 (32.7%)
No relationship 124 (36.9%) 277 (31.0%) 150 (29.2%)
Positive relationship 336 (36.3%) 376 (42.1%) 196 (38.1%)
A t-test between means was performed to compare beginning and advanced researchers
on the strength of the relationship between their personal opinions and their question
wording choices. The unbiased or neutral measure of the relationship was compared
between beginning and advanced researchers. Table 40 shows that the t-statistic was not
significant, t(848)=.708, p=.240, thus, we fail to reject the null hypothesis and conclude
that there is no significant difference between beginning and advanced researchers with
49respect to the degree in which they incorporate their personal opinions into their question
wording choices. If we were to adopt an ordinal interpretation of the unbiased scale, the
Mann-Whitney U statistic supports the same conclusion, U(848)=88444, p=.550.
Table 40
t-Test between Means using the Unbiased Definition of the Relationship
______________________________________________________________________
Self-Assessed Experience N Mean SD t df p______________________________________________________________________
Beginner 336 .095 .790 0.708 848 .240
Advanced 514 .054 .707
The same hypothesis was tested using a one-way analysis of variance. The dependent
variable was the unbiased measure of the relationship between researchers' personal
opinions and their question wording choices. The factor variable was the self-assessed
experience level of the researchers and the three levels were beginning, average, and
advanced. Table 41 shows that the F-ratio was not significant, F(2,1740)=2.420, p=.089,
thus, we fail to reject the null hypothesis. If we were to adopt an ordinal interpretation of
the unbiased scale, the Kruskal-Wallis test statistic supports the same conclusion,
KW(2)=4.180, p=.124.
49Table 41
One-Way ANOVA using the Unbiased Definition of the Relationship
DV=Unbiased definition of the relationship______________________________________________________________________
Sum of MeanSource of Variation df Squares Squares F p______________________________________________________________________
Self-Assessed Experience 2 3.245 1.623 2.420 .089
Error 1740 1166.715 0.671
Total 1742 1169.960
A summary of the first five tests are shown in Table 42. The tests looked at the degree to
which researchers incorporated their personal opinions and whether there were
differences depending on their self-assessed experience levels. The results of these tests
are inconclusive. Researchers can draw different conclusions depending on which
statistical method they choose.
Table 42
Summary of Conclusions for the Fourth Null Hypothesis
Null hypothesis: There are no differences between beginning and advanced researchers with respect to the degree to which they incorporate their personal opinions into their question wording choices.
Operational Definition of Relationship Conclusion
Two-Way Classif. Chi-Square Test - Liberal Definition Reject Two-Way Classif. Chi-Square Test - Conservative Definition Fail to rejectTwo-Way Classif. Chi-Square Test - Unbiased Definition Rejectt-Test between Means - Unbiased Definition Fail to rejectOne-Way ANOVA - Unbiased Definition Fail to reject
49The second set of seven tests looked at the degree to which researchers' incorporate the
persuasion of the study sponsor and whether there are differences depending on the self-
assessed experience levels of the researchers. These tests use the same three operational
definitions of "relationship" (i.e., liberal, conservative, and unbiased) as the previous
five.
A two-way contingency table was prepared using the "Liberal t-Test between
Proportions" and the self-assessed experience level of the researchers. The chi-square
was not significant, c2(2)=3.504, p=.173, thus, we fail to reject the null hypothesis and
conclude that there was no significant difference between beginning and advanced
researchers with respect to the degree to which they incorporated the persuasion of the
study sponsor into their question wording choices. Table 43 shows the counts and
column percents for the contingency table.
Table 43
Two-Way Classification Chi-Square Test using the Liberal Definition of the Relationship
N=1925 Self-Assessed Experience_________________________________________
Beginner Average Expert______________________________________________________________________
No observed relationship 98 (26.4%) 277 (28.5%) 185 (31.8%)
Observed relationship 273 (73.6%) 695 (71.5%) 397 (68.2%)
A two-way contingency table was prepared using the "Conservative t-Test between
Proportions" and the self-assessed experience level of the researchers. The chi-square
was significant, c2(2)=7.944, p=.019, thus we reject the null hypothesis and conclude that
49there was a significant difference between beginning and advanced researchers with
respect to the degree to which they incorporated the persuasion of the study sponsor.
Gamma indicated a very weak negative relationship, (l=-.034). Table 44 shows the
counts and column percents for the contingency table.
Table 44
Two-Way Classification Chi-Square Test using the Conservative Definition of the Relationship
N=1925 Self-Assessed Experience_________________________________________
Beginner Average Expert______________________________________________________________________
No observed relationship 247 (66.6%) 592 (60.9%) 392 (67.4%)
Observed relationship 124 (33.4%) 380 (39.1%) 190 (32.6%)
A two-way contingency table using the unbiased or neutral measure of the relationship
revealed a weak significant relationship between researchers' self-assessed experience
levels and the degree to which they incorporated the persuasion of the study sponsor into
their question wording choices. The chi-square was significant, c2(4)=12.317, p=.015,
thus we reject the null hypothesis and conclude that there was a significant relationship.
Gamma showed a weak negative relationship, (l=-.048). The counts and column
percents are displayed in contingency Table 45.
49Table 45
Two-Way Classification Chi-Square Test using the Unbiased Definition of the Relationship
N=1925 Self-Assessed Experience_________________________________________
Beginner Average Expert______________________________________________________________________
Negative relationship 98 (26.4%) 277 (28.5%) 185 (31.8%)
No relationship 149 (40.2%) 315 (32.4%) 207 (35.6%)
Positive relationship 124 (33.4%) 380 (39.1%) 190 (32.6%)
A t-test between means was performed to compare beginning and advanced researchers
on the strength of the relationship between their question wording choices and the
persuasion of the study sponsor. The unbiased or neutral measure of the relationship was
compared between beginning and advanced researchers. Table 46 shows that the t-
statistic was not significant, t(951)=1.170, p=.121, thus, we fail to reject the null
hypothesis and conclude that there was no significant difference between beginning and
advanced researchers with respect to the degree in which they incorporated the
persuasion of the sponsor into their question wording choices. If we were to adopt an
ordinal interpretation of the unbiased scale, the Mann-Whitney U statistic supports the
same conclusion, U(951)=112439.5, p=.280.
49Table 46
t-Test between Means using the Unbiased Definition of the Relationship______________________________________________________________________
Self-Assessed Experience N Mean SD t df p______________________________________________________________________
Beginner 371 .070 .771 1.170 951 .121
Advanced 582 .009 .803
The same hypothesis was tested using a one-way analysis of variance. The dependent
variable was the unbiased measure of the relationship between researchers' personal
opinions and their question wording choices. The factor variable was the self-assessed
experience level of the researchers and the three levels were beginning, average, and
advanced. Table 47 shows that the F-ratio was not significant, F(2,1922)=2.673, p=.069,
thus we fail to reject the null hypothesis. If we were to adopt an ordinal interpretation of
the unbiased scale, the Kruskal-Wallis test statistic supports the same conclusion,
KW(2)=4.868, p=.088.
Table 47
One-Way ANOVA using the Unbiased Definition of the Relationship
DV=Unbiased definition of the relationship______________________________________________________________________
Sum of MeanSource of Variation df Squares Squares F p______________________________________________________________________
Self-Assessed Experience 2 3.452 1.726 2.673 .069
Error 1922 1241.220 0.646
Total 1924 1244.672
49A summary of the second five tests are shown in Table 48. The tests looked at the
degree to which researchers' incorporated the persuasion of the sponsor and whether
there were differences depending on the their self-assessed experience levels. The results
of these tests are inconclusive. Researchers can draw different conclusions depending on
which statistical method is chosen.
Table 48
Summary of Conclusions for the Fourth Null Hypothesis
Null hypothesis: There are no differences between beginning and advanced researchers with respect to the degree to which they incorporate the persuasion of the study sponsor into their question wording choices.
Operational Definition of Relationship Conclusion
Two-Way Classif. Chi-Square Test - Liberal Definition Fail to reject Two-Way Classif. Chi-Square Test - Conservative Definition RejectTwo-Way Classif. Chi-Square Test - Unbiased Definition Rejectt-Test between Means - Unbiased Definition Fail to rejectOne-Way ANOVA - Unbiased Definition Fail to reject
Additional Findings
A final question that we might ask is whether or not respondents consistently chose one
form of question wording over the other. For example, some respondents might have
consistently selected the question wording that elicits more "too little is spent" responses,
while others selected the question wording that elicits more "too much is spent"
responses.
A scale was constructed to examine whether or not respondents consistently chose the
same form of question wording. The scale was a simple count of the number of times
49that a respondent selected the question wording that favored "too little is spent." Thus,
the scale could vary from zero to six, where a zero indicated that a respondent never
selected that wording and a six indicated that they always selected the wording.
If we view each question wording pair as an independent comparison, then it becomes
easy to calculate the expected distribution If only chance is operating. A good analogy is
to view each question wording pair like the flip of a coin. For any given comparison,
there is a fifty-fifty probability that the respondent will select a particular question
wording. The binomial expansion of (a+b)6 provides coefficients for determining the
proportion of respondents for each of the six possible values. A chi-square test was used
to determine whether the distribution of the scale was significantly different than the one
predicted by the binomial distribution.
Table 49 shows the contingency table for the comparison of the predicted binomial
distribution with the number of times that respondents selected the question wording that
favored "too little is spent." The chi-square was highly significant, c2(6)=85.216,
p<.001, thus indicating that the distribution was different than would be expected by
chance. Close examination of the table shows that respondents tended to favor one form
of question wording or the other. Instead of approximating the binomial distribution, the
scale was relatively flat.
Table 49
Two-Way Contingency Table Comparing Responses to the Binomial Distribution
Number of comparisons where respondent selectedwording that favors a response of "too little is spent"______________________________________________
49N=361 0 1 2 3 4 5 6
______________________________________________
Observed 46 59 62 61 49 59 25(12.7%) (16.3%) (17.2%) (16.9%) (13.6%) (16.3%) (6.9%)
Binomial prediction 6 34 85 113 85 34 6(1.6%) (9.4%) (23.4%) (31.3%) (23.4%) (9.4%) (1.6%)
Descriptive statistics were also computed for the scale. The mean average of the scale
was close to three (M=2.79, SD=1.83) and the skewness was near zero (SK=.08),
therefore, we conclude that the distribution was not heavily lopsided one way or the
other. However, the kurtosis indicated that the distribution was rather flat (k=1.88) and
the Kolmogorov-Smirnov statistic for normality confirmed that the shape of the
distribution was significantly different than the normal bell-shaped curve (KS=2.47,
p<.01).
Cronbach's alpha provides a measure of the internal consistency among a group of items.
In most testing instruments, we would be pleased to see that a group of items had high
reliability. For example, if a teacher said that a test was highly reliability, it would mean
that most of the items in the instrument discriminated well between students who knew
the material and those that didn't. In this study though, high "reliability" means that
respondents were likely to consistently favor one form of question wording. Thus, a
rather dramatic paradox becomes apparent. For this study, high reliability means high
bias and low reliability means low bias. A researcher who consistently selects a specific
form of question wording would produce a survey that contains the greatest bias. A
researcher who chooses questions at random would produce a survey with the least bias.
49Cronbach's alpha was .69, a moderately high value. This provides further support that
researchers favored one form of question wording or the other.
49CHAPTER V
Conclusions and Recommendations
Summary
A review of the literature revealed that there can be large differences in the way that
people respond to public opinion surveys depending on the phraseology of the questions.
Seemingly minor changes in question wording can have enormous impact on people's
responses. In some cases, researchers would draw different conclusions on an issue
depending on their choice of question wording.
This study examined the degree to which researchers incorporated their own opinions
and those of the study sponsor into their question wording choices. It was hypothesized
that researchers unknowingly select phraseology that produces public response
supportive of their own opinions, or those of the study sponsor. The literature provided
many examples of studies that examined public response to question wording
alternatives, but none were found that looked at the role of the researcher in the creation
of those questions.
This purpose of this study was determine whether or not survey researchers unknowingly
influence the results of a survey through their question wording choices. This study
tested four null hypotheses using a variety of statistical techniques. The null hypotheses
were:
491. Researchers' choices for question wording are not related to their personal
opinions.
2. Researchers' choices for question wording are not related to the persuasion of the
study sponsor.
3. Researchers' choices for question wording are not related to their self-assessed
experience in questionnaire design.
4. There is no difference between beginning and advanced researchers with respect
to the degree to which they incorporate their personal opinions and those of the
sponsor into their choice of question wording.
A survey was designed using six pairs of social issue labels that had been studied by
Rasinski over a three-year period (1984-86). These question wording pairs were selected
because they were known to evoke different responses from the public.
The survey was mailed to 953 people who had some involvement in the survey research
process. A short vignette told respondents that they had been hired to design a public
opinion survey. One third of the respondents was told that the sponsor for their study
was a conservative anti-spending group, another third was told that their sponsor was a
liberal pro-spending group, and for the last third, no sponsor was identified. The vignette
explained that the respondent might also be hired to conduct additional surveys in the
future; however, the director of the organization specifically told them, "We really need
to know the truth, so be objective."
The survey presented both forms of question wording and asked respondents which one
they would use in their own survey. It also asked how they would personally answer the
49question wording that they had selected. This was repeated for each of the six social
issues.
A variety of different statistical techniques were used to test each of the hypotheses.
Each of the tests provided a different operational definition for the "relationship"
between the variables. The primary techniques were contingency table analysis using the
chi-square statistic and gamma, the student's t-test between means, the one-sample t-test
between proportions, and logistic regression analysis.
Conclusions
A total of 361 usable surveys was returned. The response rate was about 40 percent, a
respectable return for mail surveys without follow-up mailings. There were no
significant differences in response rates depending on which of the three forms the
respondent received.
This study used an untraditional approach to elicit comments. The cover letter simply
asked respondents to write comments anywhere on the questionnaire. About 28 percent
of the respondents wrote an average of 2.4 comments on the survey itself. The
comments clearly show that respondents believed that objectivity is important in survey
research. Many of the respondents suggested a third wording alternative, presumable one
that was more objective. Others directly addressed the ways in which the question
wording alternatives were biased.
Nine different statistical tests were used to determine whether researchers' question
wording choices were related to their personal opinions. A one-tailed test of significance
49was used because the direction of the relationship was predicted. The results of all nine
tests were in agreement. We therefore conclude that there was a significant positive
relationship between researchers' question wording choices and their personal opinions.
In this study, researchers decidedly selected the question wording that would sway public
response to favor their own opinion and the effect was quite substantial. Knowledge of
the personal opinion of a researcher improved our ability to predict their choice of
question wording by about seven percent when compared to only knowing which
wording was the most popular choice among respondents. The effect was found to exist
for each of the question wording pairs, although it was weaker for the drugs issue.
The same nine tests were performed for the second hypothesis to determine if
researchers' choices for question wording were related to the persuasion of the study
sponsor. The results of these tests were inconclusive. Five tests told us to reject the null
hypothesis and four told us to fail to reject the null hypothesis. Researchers' choices for
question wording may or may not be related to the persuasion of the sponsor. Our
conclusions depend upon the particular mathematical construct selected to evaluate the
relationship. The effect seemed to be most prominent for the cities, welfare, and Social
Security issues. Even though the effect was often significant, it was very small.
Knowledge of the persuasion of the sponsor increased our ability to predict a
respondent's question wording by 1.3 percent when compared to only knowing which
wording was the most popular choice among respondents.
Six statistical tests were used to evaluate the third hypothesis to determine if researchers'
choices for question wording were related to their self-assessed experience in
questionnaire design. The results of these tests were also inconclusive. Two tests told us
to reject the null hypothesis and four told us to fail to reject the null hypothesis. The
49effect seemed strongest for the crime and Social Security issues. Even though the effect
was significant, it was very small. Knowledge of the self-assessed experience level of a
respondent increased our ability to predict their question wording choice by 0.6 percent
when compared to only knowing which wording was the most popular choice among
respondents.
A multivariate logistic regression model was created to examine the relationship between
respondents question wording choices, and the combined effects of their personal
opinions, self-assessed experience, and the persuasion of the study sponsor. The model
significantly improved our ability to predict respondents' question wording choices by
about seven percent compared to only knowing which wording was the most popular
choice among respondents, c2(3)=70.5, p<.001. This was about the same as the logistic
regression model that only used respondents' personal opinions as the independent
variable. While the persuasion of the study sponsor and self-assessed experience were
significant, they did not improve our prediction ability in the multivariate model.
The fourth hypothesis was tested in two parts. The first part was to determine if there
was a difference between beginning and advanced researchers with respect to the degree
to which they incorporated their personal opinions into their choice of question wording.
Five different statistical tests were used to test this hypothesis. The results were
inconclusive. Two tests told us to reject the null hypothesis and three told us to fail to
reject the null hypothesis. The second part was to determine if there was a difference
between beginning and advanced researchers with respect to the degree to which they
incorporated the persuasion of the study sponsor into their choice of question wording.
The results were also inconclusive, Two tests told us to reject the null hypothesis and
three told us that we fail to reject the null hypothesis.
49
A final finding of this study is that respondents tended to favor one form of question
wording over the other. Some respondents consistently selected the question wording the
elicited more "too little is spent" responses, while a slightly greater number selected the
wording the elicited more "too much is spent" responses. The tendency of respondents to
consistently choose one form of question wording was significantly different than we
would have expected from the binomial expansion, c2(6)=85.2, p<.001.
Discussion and Recommendations
Many studies have demonstrated that question wording differences can evoke radically
different response distributions for public opinion surveys. The effect can be large and
researchers might draw different conclusions about the same issue, depending on their
choice of question wording. This study found that researchers tended to choose question
wording that favored their personal opinions. The effect was less pronounced for
advanced researchers; however, regardless of which statistical tests were used, our
conclusions remained the same.
It is important to point out that this form of bias would not cause a correspondingly linear
effect in a public opinion poll. Once a researcher chooses a particular form of question
wording, the change in public opinion corresponds to the cognitive aspects of that
question. Thus, small perceived changes in question wording to researchers can cause
large changes in public opinion response distributions.
An important finding of this study was that the relationship between researchers' question
wording choices and personal opinions exists in a delicate balance. Some researchers
49(more than expected by chance) clearly favored the question wordings that would elicit
more "too much is spent" responses, while a slightly smaller number (also more than
expected by chance) favored the question wordings that would elicit more "too little is
spent" responses. Researchers generally leaned one way or the other, but they did so in
balance with other researchers. If we were to look at a survey created by any individual
researcher, we would most likely find that it consistently favored one particular form of
question wording. However, the "sum" of all such surveys created by "all" researchers,
would nearly cancel the bias.
It is the conclusion of this study that a survey created by a single researcher is most likely
to be biased in a way that corresponds with that researcher's own opinion. In the best of
all worlds, a way to overcome that bias might be to have a team of independent
researchers create the questions for a survey and then use them as multiple measures of
the phenomena being studied. Of course, in most situations, it is not practical or
financially feasible to involve a number of researchers in the survey design process, but
there may be other alternatives that are also effective.
One method suggested by a couple of respondents was to hold a focus group of potential
respondents. The idea, of course, was to not to assess the overall opinion of the public,
but instead, to use the knowledge from the focus group to help shape the contents of the
public opinion survey. The idea is excellent, although not without problems. The main
problem with focus groups is that the opinions of the extroverted members of the group
tend to muffle the voices of those who are more introverted; shy members' opinions often
remain unspoken. Experienced moderators can minimize this effect, but it is still
present. Second, focus groups are usually quite small, consisting of eight to twelve
members. Minority opinions, that are shared by under ten or fifteen percent of the
49population, are likely to be missed altogether. Third, focus groups are expensive. The
cost of a facility, moderator, and remuneration to respondents is between two and three
thousand dollars per focus group, a substantial burden for many survey budgets.
A purely speculative approach might be to adapt the Delphi forecasting technique to the
design of surveys. A number of researchers could be surveyed through the mail to find
out what questions they would ask on a public opinion survey. The responses would be
collated and a summary would be prepared and returned to the researchers. They would
be asked again to design the questions, this time, with knowledge of the question
wording selected by the other researchers. The second round of the Delphi technique
usually results in a consensus or centering of opinions. There are two advantages that the
Delphi technique might have over the focus group. The first is that each person's opinion
is heard by all the other members; extroverted individuals have no advantage over
introverts. Thus, the Delphi technique is more likely to produce a wider range of
opinions. Second, the Delphi method would be less expensive than an equivalent focus
group. Some readers may argue that a focus group is more effective at capturing the
"flavor of the response"; however, this study found that researchers' comments were
abundant and uninhibited.
Another recommendation of this study is that researchers need to use multiple measures
of a phenomena. When public opinion researchers ask a single question about an issue,
their conclusions contain the bias inherent in the question wording they use. The only
way to study the degree of bias is to compare it to public response to other questions on
the same issue. Even then, the bias can only be expressed relative to other questions.
There is no absolute measure of this form of bias.
49When we are presented with the results from a survey, how do we know the nature of the
question wording bias, in the absence of other information? The answer is that we don't.
If the survey was prepared by a single researcher, then it probably reflects the personal
opinion of the researcher to some degree, but there is no way that we can ascertain the
nature of that bias without conducting additional research. We can only predict that it is
there.
There is also the issue of how to interpret multiple measures of a issue. One respondent
to this survey suggested that multiple measures be summed. This method would be
effective only when the question bias is balanced (i.e., half of the questions are biased
one way and half the other way). If a single researcher had designed a questionnaire, we
would not expect the bias to be balanced. On the contrary, the results of this study
indicate that we would expect a fairly strong bias in one direction or the other, and
without additional research, we could not even determine the direction of the bias, let
alone its strength. The idea of summing multiple measures of a issue is appealing, but
probably inappropriate in most cases. Each form of question wording taps a different
cognitive aspect of an issue. Rather than trying to combine multiple measures into a
single unified answer, it might be better to embrace the idea of complexity and focus on
the interrelationships among the various measures. This study does not provide guidance
in this area and progress is most like to come from the field of the cognitive psychology.
We were not able to draw conclusions for any of the other hypotheses. Or more
correctly, we would have drawn different conclusions depending on the specific
statistical technique we had chosen to test the hypotheses. Furthermore, it didn't make
any difference whether we had selected a parametric or nonparametric test, or whether
we called a scale ordinal or interval. The results were mixed without any discernible
49pattern. This, in itself is a disturbing finding. At the onset of any research project, the
investigator selects a statistical methodology appropriate to the data to be collected and
this becomes the operational definition of the phenomena being studied. Researchers
using different definitions of a phenomena might come to different conclusions.
When the phenomena was relatively strong, such as the relationship between question
wording choice and personal opinion, all statistical methods were in agreement.
However, when the phenomena was weak, some tests called it significant and others
didn't. The explanation, of course, lies in the fact the each of the tests was measuring
something slightly different than the others. The disturbing part is that regardless of
which test researchers selected, their conclusions would be technically correct. Yet, they
were all attempting to measure the same phenomena.
To the layperson, "significance" and "importance" are synonymous, but to the scientist,
they have entirely different meanings. Significance refers to the confidence that we
have in our findings. If we say that a significant relationship exists, it means that we are
very sure that there really is a relationship. It does not tell us directly about the strength
or magnitude of the relationship. When the sample size becomes sufficiently large (as in
this study), it becomes possible to detect very weak relationships. Furthermore, these
weak relationships might be "statistically significant" because we are very sure that they
exist. The larger the sample size, the smaller the relationship that we can detect as being
significant.
To say that a significant relationship exists only tells half the story. We might be very
sure that a relationship exists, but is it a strong, moderate, or weak relationship? After
finding a significant relationship, it is important to ascertain its strength. This study
49attempted to incorporate analytical techniques that would reveal both the significance and
strength of the relationships. For example, the probability of the chi-square statistic in
logistic regression is traditionally used to report the overall significance of the
relationship between the dependent variable and the combined effects of the independent
variables. The chi-square test reveals little about the actual strength of the relationship,
only how sure we are that it exists. Therefore, this study also used a t-test between
proportions based on the regression predictions. If the relationship was strong,
knowledge of the independent variables would significantly improve our ability to
predict the dependent variable, and if it were weak, we might see no actual improvement.
Thus, the t-test provided a significance test that focused more on the magnitude of the
effect, rather than the significance of the relationship.
Whether or not a researcher finds significance is a function of strength of the phenomena
being measured, the type of statistical test, and the number of observations in the data set.
This study did not draw conclusions regarding the relationship between researchers'
question wording choices and the persuasion of the study sponsor, or the self-assessed
experience levels of the researchers. However, even if there were significant
relationships, they do not appear to be important considerations to the survey research
design process. If the relationships exist, they are minor, and therefore it is the
recommendation of this study that they are not worthy of additional study.
On the other hand, researchers' personal opinions showed a relatively strong relationship
to their question wording choices. It is our recommendation that further research be
conducted to confirm the results of this study and to more fully investigate the nature of
this phenomenon.
49This study randomly assigned individuals to one of the three sponsorship conditions. In
the "real world"; however, researchers often have freedom to choose the sponsors they
become involved with. Researchers might gravitate to sponsors that support their own
opinions, or sponsors might have a tendency to hire researchers who support the
institution's goals. Thus, there may be a relationship between researchers' opinions and
sponsorship goals. The covariance between a researcher's opinion and the persuasion of
the sponsor was forced to zero in this study through the random assignment procedure;
however, this may not be the case in the "real world." Future research is needed to
explore this issue.
Finally, the most surprising finding of this study was the tendency of researchers to favor
one form of question wording over the other. This appeared to be the strongest
phenomenon uncovered by this study. Researchers clearly had a "favorite" form of
question wording and they regularly chose that form. Only 17 percent of the respondents
showed no sign of bias one way or the other and over fifty percent showed a moderate to
strong bias. This bias existed in most researchers to varying degrees and it was generally
much stronger than expected. We therefore recommend that this phenomenon be further
studied to explore the scope and magnitude of the effect.
This purpose of this study was determine whether or not survey researchers unknowingly
influence the results of a survey through their question wording choices. Our conclusion
is that they do, the effect is substantial, and that further research is needed in this area.
49
References
Alexander, C. S., & Becker, H. J. (1978). The use of vignettes in survey research. Public Opinion Quarterly 42 (1), 93-104.
Ayidiya, S., & McClendon, M. (1990). Response effects in mail surveys. Public Opinion Quarterly 54 (2), 229-247.
Barath, A., & Cannell, C. F. (1976). Effect of interviewer's voice intonation. Public Opinion Quarterly 40 (3), 370-373.
Bishop, G. F. (1987). Experiments with the middle response alternative in survey questions. Public Opinion Quarterly 51 (2), 220-231.
Bishop, G. F., Hippler, H. J., Schwarz, N., & Strack, F. (1988). A comparison of response effects in self administered and telephone surveys. In R. Groves (Ed.), Telephone survey methodology (pp. 321-340). New York: Wiley.
Bishop, G. F., Oldendick, R. W., & Tuchfarber, A. J. (1978). Effects of question wording and format on political attitude consistency. Public Opinion Quarterly 42 (1), 81-92.
Bishop, G. F., Oldendick, R. W., & Tuchfarber, A. J. (1982). Effects of presenting one versus two sides of an issue in survey questions. Public Opinion Quarterly 46 (1), 69-85.
Blair, E. (1977). More on the effects of interviewer's voice intonation. Public Opinion Quarterly 41 (4), 544-548.
Bradburn, N. M., & Mason, W. M. (1964). The effect of question order on response. Journal of Marketing Research 1 (4), 57-61.
Bradburn, N. M., & Miles, C. (1979). Vague quantifiers. Public Opinion Quarterly 43 (1), 92-101.
Burnkrant, R. E., & Howard, D. J. (1984). Effects of the use of introductory rhetorical question versus statements on information processing. Journal of Personality and Social Psychology 47 (6), 1218-1230.
Carp, F. M. (1974). Position effects on interview responses. Journal of Gerontology 29 (5), 581-587.
49Chase, C. I. (1969). Often is where you find it. American Psychologist 24 (11), 1043.
Clancey, K. J., & Wachsler, R. A. (1971). Positional effects in shared cost surveys. Public Opinion Quarterly 35 (2), 258-265.
Cliff, N. (1959). Adverbs as multipliers. Psychological Review 66 (1), 27-44.
Collins, W. A. (1970). Interviewers' verbal idiosyncrasies as a source of bias. Public Opinion Quarterly 34 (3), 416-422.
Dohrenwend, B. S., Colombotos, J., & Dohrenwend, B. P. (1968). Social distance and interviewer effects. Public Opinion Quarterly 32 (3), 410-422.
Erdos, P. L. (1957). How to get higher returns from your mail surveys. Printer's Ink 258 (8), 30-31.
Hakel, M. D. (1968). How often is often? American Psychologist 23 (7), 533-534.
Hanson, R. H., & Marks, E. S. (1958). Influence of the interviewer on the accuracy of survey results. Journal of the American Statistical Association 53 (282), 635-655.
Hedges, B. M. (1979). Question wording effects: Presenting one or both sides of a case. The Statistician 28, 83-99.
Hippler, H. J., & Schwarz, N. (1987). Response effects in surveys. In H. J. Hippler, N. Schwarz, & S. Sudman (Eds.), Social information processing and survey methodology (pp. 321-340). New York: Springer-Verlag.
Jackman, M. R. (1973). Education and prejudice or education and response-set? American Sociological Review 38 (3), 327-339.
Kalton, G., Collins, M., & Brook, L. (1978). Experiments in wording opinion questions. Applied Statistics 27 (2), 149-161.
Kraut, A. I., Wolfson, A. D., & Rothenberg, A. (1975) Some effects of position on opinion survey items. Journal of Applied Psychology 60 (6), 774-776.
Krosnick, J. A. (1989). Question wording and reports of survey results: The case of Louis Harris and Associates and Aetna Life and Casualty. Public Opinion Quarterly 53 (1), 107-113.
Levine, S., & Gordon, G. (1958). Maximizing returns on mail questionnaires. Public Opinion Quarterly 22 (4), 568-575.
49McFarland, S. G. (1981). Effects of question order on survey responses. Public
Opinion Quarterly 45 (2), 208-215.
Mosier, C. I. (1941). A psychometric study of meaning. Journal of Social Psychology 13, 123-140.
Mullner, R. M., Levy, P. S., Byre, C. S., & Matthews, D. (1982, Sept.-Oct.). Effects of characteristics of the survey instrument on response rates to a mail survey of community hospitals. Public Health Reports 97 (5), 465-469.
Noelle-Neumann, E. (1970). Wanted: Rules for wording structured questionnaires. Public Opinion Quarterly 34 (2), 191-201.
Parducci, A. (1968). Often is often. American Psychologist 23 (11), 828.
Payne S. L. (1951). The art of asking questions. Princeton, NJ: Princeton University Press.
Pepper, S., & Prytulak, L. S. (1974). Sometimes frequently means seldom: Context effects in the interpretation of quantitative expressions. Journal of Research in Personality 8, 95-101.
Petty, R. E., Cacioppo, J. T., & Heesacker, M. (1981). Effects of rhetorical questions on persuasion: A cognitive response analysis. Journal of Personality and Social Psychology 40 (3), 432-440.
Petty, R. E., Rennier, G. A., & Cacioppo, J. T. (1987). Assertion versus interrogation format in opinion surveys: Questions enhance thoughtful responding. Public Opinion Quarterly 51 (4), 481-494.
Phillips, D. L., & Clancy, K. J. (1972). Modeling effects in survey research. Public Opinion Quarterly 36 (2), 246-253.
Poe, G. S., Seeman, I., McLaughlin, J., Mehl, E., & Dietz, M. (1988). Don't know boxes in factual questions in a mail questionnaire: Effects on level and quality of response. Public Opinion Quarterly 52 (2), 212-222.
Rasinski, K. A. (1989). The effect of question wording on public support for government spending. Public Opinion Quarterly 53 (2), 388-394.
Robinson, R. A. (1952). How to boost returns from mail surveys. Printer's Ink 239 (10), 35-37.
Rugg, D., & Canreil, H. (1944). The wording of questions. In H. Cantril (Ed.), Gauging public opinion (pp. 23-50). Princeton, NJ: Princeton University Press.
49
Schaeffer, N. C. (1991). Hardly ever or constantly? Group comparisons using vague quantifiers. Public Opinion Quarterly 55 (3), 392-423.
Schuman, H., & Presser, S. (1977). Question wording as an independent variable in survey analysis. Sociological methods and research 6 (2), 151-170.
Schuman, H., & Presser, S. (1981). Questions and Answers in Attitude Surveys. New
York: Academic Press.
Schyberger, A. B. (1967). Study of interviewer behavior. Journal of Marketing Research 4 (1), 32-35.
Simpson, R. (1944). The specific meaning of certain terms indicating differing degrees of frequency. The Quarterly Journal of Speech 30, 328-330.
Skelly, F. R. (1954). Interviewer-appearance stereotypes as a possible source of bias. Journal of Marketing 19 (1), 74-75.
Sletto, R. F. (1940). Pretesting of questionnaires. American Sociological Review 5 (2), 193-200.
Smith, T. W. (1982). Conditional order effects. GSS Technical Report No. 33. Chicago: National Opinion Research Center.
Smith, T. W. (1987). That which we call welfare by any other name would smell sweeter: An analysis of the impact of question wording on response patterns. Public Opinion Quarterly 51 (1), 75-83.
Spector, P. (1981). Research design. Beverly Hills: Sage.
Sudman, S., & Bradburn, N. (1974). Response effects in surveys. Chicago: Aldine.
Swasy, J. L., & Munch, J. M. (1985). Examining the target of receiver elaborations: Rhetorical question effects on source processing and persuasion. Journal of Consumer Research 11, 877-886.
Tourangeau, R., & Rasinski, K. A. (1988). Cognitive processes underlying context effects in attitude measurement. Psychological Bulletin 103 (3), 299-314.
Tourangeau, R., Rasinski, K. A., Bradburn, N., & D'Andrade, R. (1989). Carryover effects in attitude surveys. Public Opinion Quarterly 53 (4), 495-524.
Turner, C. F., & Krauss, E. (1978). Fallible indicators of the subjective state of the nation. American Psychologist 33 (5), 456-470.
49
Walonick, D. S. (1993). StatPac Gold IV: Survey and marketing research edition. Minneapolis: StatPac, Inc.
Weiss, C. (1968). Validity of welfare mothers' interview responses. Public Opinion Quarterly 32 (2), 287-294.
Williams, J. A., Jr. (1968). Interviewer role performance: A further note on bias in the information interview. Public Opinion Quarterly 32 (2), 287-294.
Wilson, T. D., Dunn, D., Bybee, J., Hyman, D., & Rotondo, J. (1984). Effects of analyzing reasons on attitude-behavior consistency. Journal of Personality and Social Psychology 47 (1), 5-16
Zillman, D. (1972). Rhetorical elicitation of agreement in persuasion. Journal of Personality and Social Psychology 21 (2), 159-165.
49
APPENDIX A
Cover Letter & Questionnaire That Was Sent To Researchers
49
January 25, 1994
Dear Researcher:
Have you ever wondered about the best wording for an item on a questionnaire?
If you're like me, you probably spend a lot of time trying to figure out just the right wording for your survey questions. Sometimes the wording is obvious. Other times, its not clear which is the best question wording--or even if there is a "best" question wording.
We are conducting a very simple experiment to study how researchers formulate questions for opinion and attitude surveys. I think you'll find it interesting and different from your "run-of-the-mill" survey.
The enclosed questionnaire will take less than five minutes to complete. Your responses are very important, and they will tell us much about how researchers choose question wording options.
There are no right or wrong answers. Please complete the questionnaire as soon as possible, and mail it back in the enclosed pre-stamped envelope. Feel free to write comments anywhere on the questionnaire.
Thank you for your participation in this study.
Sincerely,
David S. WalonickPresident
Question Wording Survey
How would you rate your own questionnaire design skills? (Circle one)
a) beginner b) average c) expert
You have been hired by [an, a conservative, a liberal] organization to find out how the public feels about government spending on six social issues. During the planning meeting, you [discover that the organization favors increased, reduced spending levels. You also] learn that the organization is considering hiring you to conduct several other surveys in the future. As you leave the meeting, the Director says to you, "We really need to know the truth, so be objective."
All six of your survey Circle the question wording that Circle the response you would personallyquestions begin with: you would use in the survey. give to the question you selected.
Are we spending too much, too little,
or the right amount on...1. a) "...halting the rising crime rate?" a) too much
b) "...law enforcement?" b) too little c) the right amount
2.. a) "...drug rehabilitation?" a) too muchb) "...dealing with drug addiction?" b) too little
c) the right amount
3. a) "...assistance to the poor?" a) too muchb) "...welfare?" b) too little
c) the right amount
4. a) "...assistance to big cities?" a) too muchb) "...solving problems of big cities?" b) too little
c) the right amount
5. a) "...improving conditions of Blacks?" a) too muchb) "...assistance to Blacks?" b) too little
c) the right amount
6. a) "...Social Security?" a) too much
116b) "...protecting Social Security?" b) too little
c) the right amount
Thank you. Please send your completed survey to: StatPac Inc. 4425 Thomas Ave. S.
Minneapolis, MN 55410