dissertation - statpacstatpac.org/research-library/researcher-bias.doc · web viewtable 3:...

ABSTRACT

DO RESEARCHERS INFLUENCE SURVEY RESULTS WITH THEIR QUESTION WORDING CHOICES?

By

David S. Walonick

May, 1994

Abstract

A review of the literature revealed that there can be large differences in the way that

people respond to public opinion surveys depending on the phraseology of the questions.

This purpose of this study was to determine whether or not survey researchers

unknowingly influence the results of a survey through their question wording choices. A

survey was mailed to 953 people who had some involvement in the survey research

process. A short vignette told respondents that they had been hired to design a public

opinion survey. One third of the respondents were told that the sponsor for their study

was a conservative anti-spending group, another third were told that their sponsor was a

liberal pro-spending group, and for the last third, no sponsor was identified. The survey

presented a pair of question wording alternatives for six different social issues.

Respondents were asked to select the one they would use in a public opinion survey.

They were also asked how they would personally answer the question wording that they

had selected. A total of 361 researchers responded to the survey. A variety of different

statistical techniques were used to test each of the hypotheses, including the chi-square

statistic, gamma, the student's t-test between means, the one-sample t-test between

proportions, and logistic regression analysis. The results indicated that there was a fairly

strong tendency for researchers to consistently favor one form of question wording or the

other. In addition, researchers decidedly selected the question wording that would sway

public response to support their personal opinion. Only minor relationships were found

between question wording choices and the persuasion of the study sponsor or the self-

assessed experience level of the researcher.

ii

Table of Contents

I. Introduction....................................................................................................... 1

Statement of the Problem..................................................................................... 2

Purpose of the Proposed Research....................................................................... 3

Research Questions.............................................................................................. 4

Null Hypotheses.................................................................................................. 4

Significance of the Problem................................................................................. 5

Definitions of Terms............................................................................................ 6

II. Review of the Literature............................................................................ 9

Interviewer Effects.............................................................................................. 9

Interviewer Errors....................................................................................... 10

Social Distance Between the Interviewer and the Respondent..................... 10

Interviewer Verbal Cues.............................................................................. 11

Response Effects.................................................................................................. 12

Question Order............................................................................................ 13

Middle Alternatives..................................................................................... 15

"Don't Know" Alternatives.......................................................................... 16

Presenting One or Two Sides of an Issue.................................................... 17

Assertion Versus Interrogation Format........................................................ 20

Vague Quantifiers....................................................................................... 22

Wording of Questions and Response Alternatives....................................... 26iii

Attitude of the Survey Designer........................................................................... 29

III. Methodology................................................................................................. 30

Research Design.................................................................................................. 30

Sample Selection................................................................................................. 33

Analysis............................................................................................................... 36

Liberal and Conservative t-Test between Proportions................................. 38

One-Way Classification Chi-Square Test.................................................... 40

Two-Way Classification Chi-Square Test................................................... 42

t-Test between Means................................................................................. 42

Logistic Regression..................................................................................... 43

Multivariate Models.................................................................................... 47

Analyses of the Hypotheses Broken Down by QuestionWording Pairs............................................................................................. 49

Methodological Limitations................................................................................. 49

Validity and Reliability........................................................................................ 50

Procedures and Timetable.................................................................................... 53

IV. Results.............................................................................................................. 54

Response and Non-Response............................................................................... 54

Respondents' Comments...................................................................................... 56

Crime.......................................................................................................... 60

Drugs.......................................................................................................... 60

iv

Welfare....................................................................................................... 60

Cities........................................................................................................... 61

Blacks......................................................................................................... 62

Social Security............................................................................................ 62

Summary of Comments............................................................................... 63

Null Hypothesis Testing...................................................................................... 63

Null Hypothesis 1....................................................................................... 64

Null Hypothesis 2....................................................................................... 70

Null Hypothesis 3....................................................................................... 76

Multivariate Models to Test the First Three Hypotheses............................. 81

Null Hypothesis 4....................................................................................... 84

Additional Findings..................................................................................... 93

V. Conclusions and Recommendations.................................................... 97

Summary............................................................................................................. 97

Conclusions......................................................................................................... 99

Discussion and Recommendations....................................................................... 102

VI. References...................................................................................................... 110

VII. Appendices

Appendix A: Cover Letter and Questionnaire that was Sent to Researchers.......................................................................................................... 115

Appendix B: Vitae of David S. Walonick............................................................ 118

v

Tables

Table 1: A comparison of Simpson's and Hakel's findings on the meaning of vague quantifiers......................................................................................... 24

Table 2: Summary of Rasinski's results on issue labeling (1984-1986)..................... 28

Table 3: Rasinski's (1989) issue labels that were replicated in this study................... 32

Table 4: Construction of the liberal and conservative dichotomous scales................. 39

Table 5: Construction of the unbiased (neutral) scale................................................ 41

Table 6: Construction of the 2 x 3 contingency table for the two-way classification chi-square test........................................................................ 42

Table 7: Summary of nine statistical tests used for testing the first two hypotheses................................................................................................... 47

Table 8: Response rate information broken down by the sponsorship variable.......... 54

Table 9: Number of valid responses and completion rates for all items on the survey......................................................................................................... 55

Table 10: Number of comments and response percent ranked by frequency of mention....................................................................................................... 63

Table 11: Hypothesis 1 - Liberal t-test between proportions....................................... 64

Table 12: Hypothesis 1 - Conservative t-test between proportions.............................. 65

Table 13: Hypothesis 1 - One-way classification chi-square test................................. 65

Table 14: Hypothesis 1 - Two-way classification chi-square test................................ 66

Table 15: Hypothesis 1 - t-test between means............................................................ 67

Table 16: Hypothesis 1 - Logistic regression with dummy IV's (t-test)....................... 67

Table 17: Hypothesis 1 - Logistic regression with an interval IV (t-test)..................... 68

Table 18: Hypothesis 1 - Summary of conclusions for the first null hypothesis.......... 69

vi

Table 19: Hypothesis 1 - Probability levels for each of the nine test broken down by issue....................................................................................................... 70

Table 20: Hypothesis 2 - Liberal t-test between proportions....................................... 71

Table 21: Hypothesis 2 - Conservative t-test between proportions.............................. 71

Table 22: Hypothesis 2 - One-way classification chi-square test................................. 72



Table 25: Hypothesis 2 - Logistic regression with dummy IV's (t-test) and logistic regression with an interval IV (t-test)............................................. 74

Table 26: Hypothesis 2 - Summary of conclusions for the second null hypothesis...... 75

Table 27: Hypothesis 2 - Probability levels for each of the nine tests broken down by issue............................................................................................. 76



Table 30: Hypothesis 3 - Logistic regression with dummy IV's (t-test)....................... 78

Table 31: Hypothesis 3 -Logistic regression with an interval IV (t-test)...................... 79

Table 32: Hypothesis 3 - Summary of conclusions for the third null hypothesis......... 80

Table 33: Hypothesis 3 - Probability levels for each of the six tests broken down by issue....................................................................................................... 81

Table 34: Hypotheses 1-3 - Multivariate logistic regression with dummy IV's............ 82

Table 35: Hypotheses 1-3 - Multivariate logistic regression with interval IV's............ 83

Table 36: Hypotheses 1-3 - Logistic regression coefficients for the multivariate model with interval IV's.............................................................................. 83

Table 37: Hypothesis 4, part 1 - Two-way classification chi-square test using the liberal definition of the relationship............................................................ 85

vii

Table 38: Hypothesis 4, part 1 - Two-way classification chi-square test using the conservative definition of the relationship................................................... 85

Table 39: Hypothesis 4, part 1 - Two-way classification chi-square test using the unbiased definition of the relationship......................................................... 86

Table 40: Hypothesis 4, part 1 - t-test between means using the unbiased definition of the relationship....................................................................... 87

Table 41: Hypothesis 4, part 1 - One-way ANOVA using the unbiased definition of the relationship....................................................................................... 88

Table 42: Hypothesis 4, part 1 - Summary of conclusions for the fourth null hypothesis................................................................................................... 88

Table 43: Hypothesis 4, part 2 - Two-way classification chi-square test using the liberal definition of the relationship............................................................ 89

Table 44: Hypothesis 4, part 2 - Two-way classification chi-square test using the conservative definition of the relationship................................................... 90

Table 45: Hypothesis 4, part 2 - Two-way classification chi-square test using the unbiased definition of the relationship......................................................... 91

Table 46: Hypothesis 4, part 2 - t-test between means using the unbiased definition of the relationship....................................................................... 92

Table 47: Hypothesis 4, part 2 - One-way ANOVA using the unbiased definition of the relationship....................................................................................... 92

Table 48: Hypothesis 4, part 2 - Summary of conclusions for the fourth null hypothesis................................................................................................... 93

Table 49: Other findings - Two-way contingency table comparing responses to the binomial distribution............................................................................. 95

viii

CHAPTER IIntroduction

Is survey research valid? Most social scientists see the written survey as a way of

measuring the attitudes and beliefs of a population or subpopulation. Great care is taken

in the preparation of the survey instrument. Yet, little is known about the ways in which

a researcher's own beliefs might affect the results of a study.

In the last fifty years, thousands of studies have been conducted to examine the specific

characteristics of survey research. Clearly, the way people respond to questions (or

whether they respond at all) depends to some extent upon the characteristics of the

survey itself.

Do researchers unconsciously incorporate their personal beliefs or those of the sponsor

into a survey? Scholarly research involves a concentrated effort to maintain objectivity,

but in spite of the best of intentions, are researchers subconsciously "rigging" the results?

Is objectivity compromised by the choice or wording of questions?

Objectivity may be an illusion in surveys. Relativity may be a more appropriate model.

Obviously, understanding the relationship between the investigator and the respondent is

of paramount importance to the survey research community. Countless decisions are

made on the basis of survey research. It is imperative that we know if the research is

valid.

49Statement of the Problem

To what extent do survey researchers unknowingly incorporate their personal beliefs (or

those of the sponsor) into a survey, through their question wording choices?

There is substantial evidence to demonstrate that peoples' answers to questions are

influenced by a number of factors, including form effects, interviewer effects, and

response effects. Some studies have shown that these can dramatically alter the

conclusions that a researcher would draw from the results. Recent studies have

demonstrated that even minor changes in question wording can affect the way that people

respond. The implication is unavoidable: The validity of all survey research is in

question. Possibly, the concept of validity itself needs to be examined.

All survey researchers strive for objectivity. It might be considered the signature of the

scientific method. Yet, we are forced to wonder whether or not it is possible to remain

truly objective. How can we be sure that a questionnaire does not reflect the beliefs of

its creator(s)? Is it possible that researchers unknowingly incorporate their own attitudes

or those of the sponsor into the construction of their questionnaires? If so, are there

differences depending on the experience of the researcher? These are important

questions that strike at the heart of survey research.

There are many ways in which investigators might unknowingly influence the results of

their research. The dramatic impact of interviewer effects is well-documented. Form

effects and response effects have also been shown to be present in surveys. Working

together, these factors can create staggering changes in response distributions.

49Of particular interest are the effects of question wording. Seemingly minor alterations in

question wording can produce significant differences in respondents' answers. When a

survey is designed to predict behavior, question wording can be validated by correlating

respondents' answers with their behaviors. However, attitude and public opinion surveys

generally lack an objective standard by which to judge the validity of question wording.

In the absence of an objective reference, there is no way for a researcher to determine

which form of question wording produces the most accurate barometer of public opinion.

In other words, it may not be possible to determine which is the "best" form of question

wording.

Survey designers often make choices about question wording without validating

information. Why do they choose one form of a question over another? Do they

subconsciously select the question wording that will support their own views? Question

wording may be a form of modeling, where investigators consciously or subconsciously

project their own views onto those being studied.

Purpose of the Proposed Research

The purpose of this study was to understand the degree to which researchers

unknowingly incorporate their personal beliefs or those of the sponsor into a survey,

through their question wording choices.

Research Questions

Do survey researchers unknowingly influence the results of a survey through their

question wording choices?

49

Do the personal opinions of researchers influence their question wording choices?

Does the persuasion of the sponsor of a survey influence researchers' question wording

choices?

Are there differences between experienced and inexperienced researchers in their

question wording choices?

Are there differences between beginning and advanced researchers with respect to the

degree to which they incorporate their personal opinions and those of the sponsor into

their question wording choices?

Null Hypotheses

1. Researchers' choices for question wording are not related to their personal

opinions.

2. Researchers' choices for question wording are not related to the persuasion of the

study sponsor.

3. Researchers' choices for question wording are not related to their self-assessed

experience in questionnaire design.

494. There are no differences between beginning and advanced researchers with

respect to the degree to which they incorporate their personal opinions and those

of the sponsor into their question wording choices.

Significance of the Problem

Every day, a large number of decisions are made based on the results of survey research.

Companies use surveys to develop products and marketing strategies. Public service

organizations use surveys to identify subpopulations and to document need. Government

officials rely heavily on surveys to tap public opinion. Most members of our society

have been affected by decisions made on the basis of surveys. The assumption is that

survey research is valid. The "margin of error" is usually reported for surveys and this

might lull the users of survey research into believing the results are valid. However,

sampling error (i.e., margin of error) is only one form of bias. There are other sources of

bias with far greater potential to reduce our confidence in the results of a survey.

Many forms of research bias have been identified and studied. Very little is known;

however, about how question wording might reflect the personal beliefs of a researcher,

or those of the study sponsor. Our concept of validity encompasses the idea that the

researcher is independent of the phenomena being studied. The problem is that this

might not be true. This study will investigate whether or not researchers might

unknowningly affect the results of surveys through their question wording choices.

This study will benefit all survey researchers and the recipients of research results. If the

null hypotheses are not rejected, researchers will have more confidence in the validity of

survey research and their abilities to design objective and unbiased surveys. If the null

49hypotheses are rejected, then the validity of most surveys will be called into question and

researchers will be forced to closely examine the survey technique itself.

Definitions of Terms

Many terms common to survey research will be used in this study. It may be helpful to

define them at the onset.

Acquiescence: The tendency in respondents to agree more often then disagree.

Anchor: The words used to define the endpoints of an ordinal scale.

Change effect: The act of being interviewed promotes attitude formation or change

which otherwise might not have occurred.

Freezing effect: The act of being interviewed inhibits a change in a respondent which

otherwise might have occurred.

Interviewer effect: A change in a subject's responses due to some characteristic of the

interviewer, such as race, social distance, intonation, gestures, expectations, etc..

Modeling effect: A change in respondents' answers due to the conscious or unconscious

projection of the investigator's (laboratory experimenter, clinician, survey designer,

or interviewer) own views.

49Panel survey: A survey in which the same respondents are asked the same questions on

two or more occasions. A panel survey is also referred to as a repeated measures

experiment.

Probing: An attempt by an interviewer to get more information from the respondent by

asking the respondent to clarify or give a more detailed answer.

Response effects: A change in subjects' responses due to some characteristic of the

survey itself.

Response rate: The percentage of respondents that completes the survey, or that answer a

question.

Salient (items): Items on a survey that are meaningful or important to the respondent.

Split-ballot experiment: An experimental design where two or more groups of similar

respondents are exposed to different treatment conditions.

Vague quantifier: An adjective or adverb without precise meaning, that describes

frequency, quantity, or intensity.

Vignette: An elaborated description of a concrete situation.

49CHAPTER II

Review of the Literature

The survey process has been the subject of substantial research. Many of these studies

have focused on various ways of increasing response rates. This is important because it

defines an upper limit on the confidence we can place in a survey's results. Other studies

have examined the effects of the survey process itself. These are particularly interesting

because they directly address the issue of validity in survey research.

There is considerable evidence to show that the results of a survey can be influenced by

an interviewer. These include interviewer errors, the social distance between the

interviewer and the respondent, and interviewer verbal and visual cues. There are also

many studies that show that the results of a survey might be influenced by the serial order

and format of the questions and response alternatives. Strong response effects have been

documented from the inclusion of middle and "don't know" alternatives, the format of the

questions (e.g., assertion versus interrogation, or presenting one or both sides of an

issue), and question wording.

Interviewer Effects

"Interviewer effects" refer to changes in subjects' responses due to some characteristic of

the interviewer. The potential of an interviewer to affect peoples' responses has been

extensively studied. Effects have been documented as a result of an interviewer's race,

social distance from the respondent, intonations, gestures, and expectations. Early

studies concentrated on face-to-face interviews. For example, Skelly (1954) discussed

49the potential of bias when an interviewer had a stereotypical appearance. More recent

studies using telephone interviews have confirmed the effect of verbal cues.

Interviewer Errors

The most obvious interviewer effect is interviewer error. Hanson and Marks (1958)

found that interviewers frequently omitted or altered question wording. In addition, they

discovered that interviewer probing tended to alter respondents' initial replies to a

question. Schyberger (1967) reported that interviewers often deviated from instructions

and only small differences were found between experienced and inexperienced

interviewers in the degree of deviation.

Social Distance Between the Interviewer and the Respondent

In 1968, three different studies proposed a social distance model to describe the

relationship between the interviewer and the respondent. All three researchers reported

that respondents tended to acquiesce (i.e., response marginals changed nearly twenty

percent); however, they differed in their findings regarding the social distance condition

that would create the acquiescence. Weiss (1968) interviewed welfare mothers and

found that a smaller social distance between the interviewer and the respondent produced

a greater bias. Williams (1968) found that a greater social distance produced a greater

bias. Dohrenwend, Colombotos, and Dohrenwend (1968) concluded that the best

interview data is obtained when the interviewer is not too close, or not too far away, in

social distance to the respondent. They argued that a deviation in either direction could

introduce bias (i.e., interviewer effects).

49One of the most important findings, reported by Williams (1968), was that even when

social distance was held constant, the interviewer's role performance (i.e., rapport and

objectivity) still affected responses. "This suggests that objectivity is not only related to

interview bias but may be as significant as race of interviewer" (p. 291). Social distance

created some interviewer effects, but other interviewer effects were also present.

Interviewer Verbal Cues

Collins (1970) reported that the interviewers' verbal idiosyncrasies, such as vocabulary

and verbosity could strongly influence respondents and concluded by expressing strong

doubt in the validity of many survey interviews.

Phillips and Clancy (1972) studied 25 white females who were all experienced telephone

interviewers. They developed a survey to measure nine areas of interest to social

scientists--general happiness, religiosity, number of friends, current health status,

prejudice, doctor visits, mental health status, need for social approval, and dissimulation.

Dissimulation was measured through people's responses regarding their use of ten

nonexistent consumer products (e.g., books, movies, products, etc.). Ostensibly, as part

of their training, the interviewers were asked to complete the same interview that they

would be administering to respondents. A sample of 404 telephone customers was

selected and randomly assigned to the interviewers. Because of the small number of

interviewers, the researchers recoded the responses into "low" and "high" dichotomies on

each of the items. In eight of the nine measures, respondents' attitudes showed positive

correlations with the interviewers' attitudes, although the effect was small and not

statistically significant. They conclude that "because modeling effects are such a minor

source of bias, they are not worthy of further consideration" (p. 253).

49

In contrast, Barath and Cannell (1976) found that interviewers' voice intonations had

strong effects on respondents answers to questions. However, this effect did not seem to

apply to "yes-no" dichotomous questions (Blair, 1977). Blair's findings suggest that

Phillips and Clancy (1972) may have masked the strength of the interviewer effect by

recoding the interviewer responses into "high-low" dichotomies .

Response Effects

Response effects refer to a change in peoples' responses because of some characteristic of

the survey itself. The most obvious response effect might be created by the mode of the

survey (e.g., telephone, mail, face-to-face interview). Other examples of response effects

come from the characteristics of the survey instrument (e.g., the length of the

questionnaire, the order of question presentation, question wording, the order of the

response options, the use of no-opinion filters and middle-response options).

49Question Order

Many early researchers suggested that surveys should begin with a few non-threatening

and easy to answer items (Erdos, 1957; Robinson, 1952; Sletto, 1940). The rationale was

that people would not complete the survey if the first items were too difficult or

threatening.

More recent studies have found that the most important items should appear near the

beginning of a survey (Kraut, Wolfson, and Rothenberg, 1975; Levine and Gordon,

1958; Mullner, Levy, Byre, and Matthews., 1982). The rationale is that people generally

look at the first few questions before deciding whether or not to complete the

questionnaire (Levine and Gordon, 1958). In addition, respondents often send back

partially completed questionnaires, or they decide to terminate an interview before it is

completed. By putting the most important items near the beginning, the partially

completed surveys will still contain important information.

In a study of 5,842 hospital CEO's, Mullner et al. (1982) found that the response rate on

a written questionnaire was significantly affected by the order of the questions.

Questionnaires that began with the most salient items produced greater response rates

than those in the reverse order. Another team of investigators reported that questions in

the latter half a questionnaire were much less likely to contain extreme responses, and

they were also more likely to be omitted (Kraut, Wolfson, and Rothenberg, 1975).

Carp (1974) suggested that it may be necessary to present general questions before

specific ones in order to avoid response contamination. In contrast, McFarland (1981)

49reported that when specific questions were asked before general questions, respondents

showed greater interest in the general questions.

Most investigators have found that the order in which questions are presented can affect

the way that people respond (Noelle-Newmann, 1970; Schuman and Presser, 1981;

Smith, 1982; Sudman and Bradburn, 1974; Turner and Krauss, 1978; Tourangeau and

Rasinski, 1988; Tourangeau, Rasinski, Bradburn, and D'Andrade, 1989). Tourangeau

and Rasinski (1988) proposed a four-stage process to describe how people respond to

attitude questions. They conclude that the order of the questions can affect the entire

process.

Respondents first interpret the attitude question, determining what attitude the question is about. Then they retrieve relevant beliefs and feelings. Next, they apply these beliefs in rendering the appropriate judgment. Finally, they use this judgment to select a response. All four of the component processes can be affected by prior items. (p. 299)

In contrast, other researchers have reported that question-order does not effect responses.

Bradburn and Mason (1964) reported that interviews involving self-reports and self-

evaluations were unaffected by question order. Clancey and Wachsler (1971) found that

responses to questions were similar regardless of where the questions appeared in a

questionnaire. Smith (1982) argued that the serial order of the questions was less of a

problem in written surveys because respondents were free to go back and change

previous answers. Bishop, Hippler, Schwarz, and Strack (1988) compared response

effects in self-administered and telephone surveys. They also concluded that the serial

order of questions is less likely to produce response effects in written questionnaires

because respondents are free to read over the entire questionnaire, or re-read selected

portions. In telephone surveys, respondents do no have this option and their responses

49are more likely to be "off-the-top of their heads" (Hippler and Schwarz, 1987). In a

recent study, Ayidiya and McClendon (1990) reported that order effects existed, but to

varying degrees, depending on the questions.

Middle Alternatives

Bishop (1987) conducted a series of random-digit-dialed telephone surveys to determine

the effect of offering a middle response to subjects. Three societal issues (social security,

defense spending, and nuclear energy) were used to formulate questions with and without

a middle response. In these experiments, the middle responses were to maintain the

present social security benefits, to maintain the current level for defense spending, and to

operate only those nuclear power plants that are already built. Bishop found that

including or excluding a middle response could sufficiently change the results so that a

researcher would draw different conclusions about peoples' opinions, although the effect

was inconsistent. Furthermore, when the middle alternative was presented at the end of

the question, more people tended to select it.

Ayidiya and McClendon (1990) also found that more people select a middle response

when it is specifically offered. However, in contrast to Bishop (1987), they concluded

that excluding a middle alternative from an item would not alter the bipolar response

pattern for the item. In other words, a researcher would have come to the same

conclusions regardless of whether there was a middle response option or not.

"Don't Know" Alternatives

49Poe, Seeman, McLaughlin, Mehl, and Dietz (1988) studied the effect of including a

"don't know" (DK) option in a questionnaire consisting of factual questions. The sample

of 1,360 subjects were randomly assigned to one of two groups. One group was told to

check the DK box if they didn't know an answer and the other group was told to place a

question mark in the answer space if they didn't know an answer. The response rate to

both forms of their survey was about the same (61.5% with DK boxes and 58.2%

without). These investigators reported that the version with the DK box produced a

significantly higher percentage of "don't know" responses than the form without the DK

box. Furthermore, a telephone follow-up to 22 percent of the sample indicated that there

were no significant differences in error rates between the two question formats. The

authors found that some items produced large differences, while others did not.

However, "there was no single characteristic (e.g., potentially sensitive) that would

typlify the questions which had higher substantive responses in the format without DK

boxes" (p. 218).

Ayidiya and McClendon (1990) also studied whether the inclusion of a "don't know"

alternative alters response patterns. Their sample consisted of 532 households drawn

from the Akron, Ohio phone book and their response rate was 63 percent. They reported

that the inclusion of the "don't know" alternative significantly decreased the number of

salient responses.

Presenting One or Two Sides of an Issue

Schuman and Presser (1981) studied the effects of balanced (i.e., one that offers a second

substantive alternative) versus imbalanced questions over a period of several years. Four

social issues (gun control, abortion, unions, and the fuel shortage) were presented to

49respondents in a series of split-ballot experiments. Their balanced questions presented

respondents with two legitimate sides of an issue. For example, the balanced version of

their gun control question asked:

Would you favor a law which would require a person to obtain a police permit before he could buy a gun, or do you think such a law would interfere too much with the right of citizens to own guns? (p. 182)

In contrast, their imbalanced question presented only one side of the issue:

Would you favor a law that would require a person to obtain a police permit before he could buy a gun? (p. 182)

Schuman and Presser (1981) concluded that "it does not appear that purely formal

balance of attitude items makes a detectable difference in their univariate distributions"

(p. 184).

In contrast to Schuman and Presser (1981), many other researchers have found that

presenting subjects with a balanced question can significantly change subject's responses

(Bishop, Oldendick, and Tuchfarber, 1982; Hedges, 1979; Kalton, Collins, and Brook,

1978; Noelle-Neumann, 1970; Payne, 1951; Rugg and Cantril, 1944).

Noelle-Neumann (1970) conducted 300 interviews with nonworking housewives. One

form of the question asked, "Would you like to have a job, if this were possible?," and

the other form asked, "Would you prefer to have a job, or do you prefer to do just your

housework?" With the first form of the question, 19 percent said they did not want a job,

and with the second form, 68 percent said they did not want a job. Noelle-Neumann

stated that the differences were "so staggering that it is apparent that much research needs

to be done to establish the psychological and cognitive reasons for this" (p. 200).

49

Bishop, Oldendick, and Tuchfarber (1982) examined whether presenting balanced

questions affected the marginal frequencies and the number of "don't know" answers.

Two telephone surveys (six months apart) were conducted and data was collected from

1,218 respondents. The interviewer asked respondents their opinions on nine public

policy and social issues (e.g., the power of the federal government, government

guaranteed employment, fair treatment of blacks, equal opportunities for women,

government involvement in desegregation, etc.). The questions were randomized and

some subjects received only one side of a question, while others received a two-sided

version of the question.

An example of one of their one-sided questions is:

Some people feel that the government in Washington should see to it that every person has a job and a good standard of living. Do you have an opinion on this or not?(IF YES) Do you agree or disagree with the idea that the government should see to it that every person has a job and a good standard of living? (p.71)

The two-sided form of the same question was:

Some people feel that the government in Washington should see to it that every person has a job and a good standard of living. Others think the government should just let each person get ahead on his own. Have you been interested enough in this to favor one side over the other?(IF YES) Do you agree or disagree with the idea that the government should see to it that every person has a job and a good standard of living, or should it let each person get ahead on his own? (p. 71)

Note that this study attempted to balance both the background information ("Others think

the . . ."), and the question itself (". . . or should it let each person get ahead on his

own?"). The filter to remove the "no opinion" responses was also varied ("Do you have

49an opinion on this or not?" versus "Have you been interested enough in this to favor one

side or the other?").

The results indicate that stating two sides of an issue usually (in eight of the nine

comparisons) produced changes in the marginal frequencies; however, the difference was

only significant in five of the nine comparisons. They conclude that offering respondents

another choice on an issue "will not necessarily attract them, though it may make them

more likely to 'think about' expressing an opinion" (p. 75). In addition, they found that

less educated respondents tended to acquiesce (agree) more often. This agreed with

previous findings of Jackman (1973) and Schuman and Presser (1977). The type of filter

question to eliminate "no opinion" responses did not produce significant differences in

opinions.

The researchers discussed their results in terms of information most accessible to

respondents' memories. They hypothesized that presenting a second side to an issue

places it in the most recent memory of the respondent, thus increasing the likelihood that

it will be selected. In other words, subjects tend to respond "with the first thing that

comes to mind" (Bishop et al., 1982, p. 78). Less educated respondents tended to choose

the second side more often because they had less environmental context information to

compete with the most recent memory (i.e., the second side of the issue).

The model proposed by Bishop et al. (1982) is that we "look at the survey interview as a

microcosmic communication and persuasion experiment, in which the interviewer is

presenting one- versus two-sided statements that are more or less persuasive

communication which the respondent is asked to accept or reject" (p. 80). Our

understanding of the survey process would then depend on our ability to assess the

49strengths of the alternative positions on an issue. They conclude that presenting both

sides of an issue is not the answer, because the "other side of the issue" is not always

easily defined. Of particular importance was their recognition that the concept of

relativity brought all survey research into question.

. . . it becomes questionable whether we should say that one format or another is "biased" or "unbiased" since that is clearly a relative matter. Biased for whom? Only those with less than a high school education? Or does that depend upon the issue? For that matter, should we even bother to think about whether there is a "true" or "correct" wording and format for a given issue? Or are there only more or less useful ones for a given purpose of explanation or prediction? (p. 82-83)

Assertion Versus Interrogation Format

Several researchers have looked at the effects of using interrogation and assertion

formats. In a typical question using an interrogation format, the respondent is provided

with some brief background information about an issue and then asked a question about

the issue. In an assertion format, the respondent is provided with the same introductory

information; however, they are then asked to indicate their level of agreement or

disagreement with a particular assertion. This issue is of great concern to questionnaire

designers because agree/disagree attitude scales are so common in surveys.

In an early study by Zillmann (1972), subjects listened to a defense attorney's closing

arguments. One group heard assertion statements (e.g., "Johnny was a peaceful boy"),

while the other group heard interrogatories (e.g., "Johnny was a peaceful boy, wasn't

he?" Subjects who heard the interrogatory version recommended significantly shorter

prison sentences. The act of hearing a question changed the respondents' attitudes.

49Schuman and Presser (1981) conducted two studies to compare the assertion and

interrogation formats. In contrast to Zillmann, they concluded that "there is nothing

special about agree-disagree assertions as distinct from interrogative forms that produces

acquiescence" (p. 228).

Petty, Rennier, and Cacioppo (1987) studied the effects of wording survey items as either

questions or assertions. Ninety-one undergraduate students were the subjects of two

experiments. Both experiments involved students' attitudes towards a new product, when

preceded by either weak or strong background information. In one experiment,

background information was presented about a new calculator and the other was for a

new disposable razor. As predicted, strong background information encouraged students

to see the products as more desirable, F(1,45)=86.79, p<.001. However, the researchers

also found that the interrogation format caused greater polarization in subjects' responses,

F(1,45)=4.53, p<.04. When subjects were presented with strong background

information, the interrogation format produced more positive attitudes towards the

products, and when subjects were presented with weak background information, the

interrogation format produced more negative attitudes.

Petty, Rennier, and Cacioppo (1987) discuss their findings by hypothesizing that people

engage in greater cognitive processing when the interrogation format is used. That is,

asking a question produces greater item-relevant thinking than asking a person their level

of agreement or disagreement with a statement. However, when attitudes are used to

predict behavior, this is not necessarily a desirable feature. Wilson, Dunn, Bybee,

Hyman, and Rotondo (1984) found that thinking about one's attitude can actually reduce

the relationship between attitude and behavior. It was hypothesized that this most often

occurs when the attitude object has an affective, rather than cognitive, basis.

49

Other researchers have demonstrated that rhetorical questions are more effective at

persuasion that assertions (Burnkrant and Howard, 1984; Petty, Cacioppo, and

Heesacker, 1981; Swasy and Munch, 1985; Zillmann, 1972).

Vague Quantifiers

Survey researchers often ask respondents to judge frequency ("how often"), quantity

("how much"), or intensity ("how strongly"). The response alternatives for these

questions are usually presented as an ordinal scale made up of descriptive adjectives or

adverbs. These words have been appropriately dubbed "vague quantifiers," to emphasize

the imprecision of their meanings.

Mosier (1941) conducted one of the first studies in vague quantifiers and reported that

the meaning of these words varied between individuals. Mosier hypothesized that

"meaning" was distributed normally and that the mean represented the average

"meaning."

Simpson (1944) asked subjects to evaluate twenty different vague quantifiers and to give

each one a meaning by assigning a proportion to indicate its absolute frequency. Hakel

repeated the experiment in 1968. Both researchers used the median proportion to rank

the vague quantifiers in order of their perceived meanings. The rank order correlation

between the two experiments was .99; however, large differences were found between

the actual values of the medians. Hakel reported that "Variability is rampant. One man's

'rarely' is another man's 'hardly ever'" (p. 533). Table 1 shows a comparison of

Simpson's and Hakel's findings.

49

Cliff's (1959) research focused on words that intensify the phrase that they are modifying

(e.g. "quite", "very", "extremely"). These words have no value of their own, but rather,

they act like multipliers to move the meaning of the phrase closer to an extreme. Cliff

attempted to construct numeric coefficients to describe the degree to which intensifiers

altered the meaning of the phrase being modified. For example, Cliff found that "very

often" means 1.317 times as frequently as "often", and "slightly often" means .55 times

as frequently as "often."

Table 1

A Comparison of Simpson's and Hakel's Findings on the Meaning of Vague Quantifiers

Simpson (1944) Hakel (1968)

Word Median Word Median Always 99 Always 100Very often 88 Very often 87Usually 85 Usually 79Often 78 Often 74Generally 78 Rather often 74Frequently 73 Frequently 72Rather often 65 Generally 72About as often as not 50 About as often as not 50Now and then 20 Now and then 34Sometimes 20 Sometimes 29Occasionally 20 Occasionally 28Once in a while 15 Once in a while 22Not often 13 Not often 16Usually not 10 Usually not 16Seldom 10 Seldom 9Hardly ever 7 Hardly ever 8Very seldom 6 Very seldom 7Rarely 5 Rarely 5Almost never 3 Almost never 2Never 0 Never 0

49Mosier (1941), Simpson (1944), Hakel (1968), and Cliff (1959) proceeded under the

assumption that a continuum could be established and the meaning of a vague quantifier

could be placed at a precise point on the continuum for a given individual. Several

researchers have challenged this notion (Chase, 1969; Parducci, 1968; Pepper and

Prytulak, 1974). Instead, these researchers believe that the meanings of these words are

flexible and come from the context in which they are used.

Chase (1969) used Hakel's list of vague quantifiers to construct two scales. One scale

was made up of high-frequency quantifiers ("occasionally", "now and then", "about as

often as not", "usually", and "very often"). The other scale consisted of low-frequency

quantifiers ("seldom", "not often", "once in a while", "occasionally", and "generally").

Questions containing both scales were presented to 34 students. Chase found no

significant differences in the response distributions, regardless of the scale. In other

words, people judged the meaning of a vague quantifier relative to the other response

alternatives and not according to some absolute meaning.

Pepper and Prytulak (1974) provided additional evidence to suggest that vague

quantifiers lack absolute meaning. They hypothesized that the meaning of vague

quantifiers would be perceived relative to the frequency of the events they are modifying.

Respondents were asked to assign a numerical value to represent the meaning of vague

quantifiers for highly probable events (e.g., gunfire in a Hollywood western), and highly

improbable events (e.g., an airplane crash). As hypothesized, the numerical estimates for

"very often", "frequently", "sometimes", "seldom", and "almost never" changed

depending on the frequency of the event. When describing a high-probably event,

"often" meant more often than when it described a low-probability event.

49Bradburn and Miles (1979) asked respondents about the frequency of five positive and

five negative feelings. The response categories were "never", "not too often", "pretty

often", and "very often." After selecting a response category, respondents were asked

how many times a day this meant. The results of this study supported Cliff's (1959)

findings that "very often" is about 1.3 times as frequently at "often." This study also

found that the meaning of "not too often" was different, depending on whether it

described a positive or negative feeling (e.g., excited versus bored). The meanings of

"pretty often" and "very often" seemed to remain stable regardless of the feelings they

were describing.

Schaeffer (1991) used existing data on 1,172 adults to examine whether vague quantifiers

are interpreted differently by demographic groups of race, sex, education, and age.

Absolute frequency reports were compared to grouped response categories. Two

variables were investigated--excitement and boredom. Respondents were first asked,

"How often do you feel particularly... excited or interested in something?...[bored?]"

The response categories were the same as those used by Bradburn and Miles in 1979

("never", "not too often", "pretty often", and "very often"). After choosing a response

category, they were asked, "About how many times a week or a month did you mean?"

The log of the absolute frequencies were used for the analysis--the intent being to reduce

the effect of outliers. Schaeffer found significant differences in the meaning of the

response categories based on race F(1,1056)=4.85, p=.03, education F(2,1089)=20.71,

p<.01, and age F(3,1084)=18.17, p<.01. No differences were found between males and

females F(1,1094)=.46, p=.50. The results suggest that the choice of using absolute

frequencies versus response categories can change the conclusions that a researcher

would draw from the data.

49Wording of Questions and Response Alternatives

Many studies have shown that slight changes in the wording of a question or the response

alternatives can affect the way that people respond. Bishop et al. (1978) examined the

effect of question wording on several political issues and found that "the gap in the

magnitude of association generated by the two forms can only be described as massive"

(p. 85).

In a set of informal experiments, Krosnick (1989) found that alternative wordings for

response categories can significantly change marginal frequencies. Different labels for

the response scale changed the way people responded. Three different scales were tested:

1) "very acceptable", "somewhat acceptable", "not too acceptable", and "not acceptable at

all"; 2) "strongly favor", "somewhat favor", "favor a little", and "not favor at all";

3)"strongly support", "somewhat support", "support a little", and "not support at all."

The scales using "acceptable" and "favor" produced similar results, but the "support"

scale was strongly skewed toward the "not support at all" anchor.

Rasinski (1989) conducted a series of split-ballot experiments to examine the effects of

question wording (issue labels) on peoples' attitudes towards government spending

policies. The study asked respondents, "Are we spending too much, too little, or about

the right amount on...", and this was followed by a particular way of identifying a

government program. Large significant differences (p<.001) were noted between

responses for many of the wording variations. Furthermore, this questionnaire was

repeated for three consecutive years (1984-1986) and the differences in responses to the

wording variations remained stable over time. Table 2 summarizes Rasinski's results,

averaged for the three years.

49Table 2

Summary of Rasinski's Results on Issue Labeling (1984-1986)

"Are we spending too much, too little, or about the right amount on..."

Issue Label Alternative Issue Label Percent Difference

"halting the rising crime rate" (67.8%) "law enforcement" (55.7%) 12.1%

"dealing with drug addiction" (63.9%) "drug rehabilitation" (54.6%) 9.3%

"assistance to the poor" (64.0%) "welfare" (22.7%) 41.3%

"assistance to big cities" (19.9%) "solving problems of big cities" (48.6%) 28.7%

"assistance to Blacks" (27.6%) "improving conditions of Blacks" (35.6%) 8.0%

"Social Security" (53.2%) "protecting Social Security" (68.2%) 15.0%

Note: Percentages are the proportion of respondents that said "too little" is spent (averaged over three years).

Rasinski (1989) discussed the results by hypothesizing that different labels "may bring to

mind different associations, actually changing the stimuli to which respondents are

reacting" (p. 392). Smith (1987) had observed similar effects in labeling welfare issues;

however, Rasinski found that the effect was also present for a variety of other social

issues.

Rasinski's (1989) findings are disturbing. How can researchers have confidence in the

validity of their studies when small wording changes can have such a profound impact?

Seemingly, "assistance to the poor" and "welfare" are close in meaning, yet, the

difference in the way people respond to these issue labels exceeds 41 percent. A public

opinion researcher choosing one label might come to a completely different conclusion

than another researcher using a different label. Yet, both labels seem to be reasonable

choices for a survey designer.

49

Rasinski (1989) reports that his research contributes "new examples of successful and

failed question wording experiments" (p. 394). One might ask the important question,

"successful or failed for whom?" Without taking into account the relativity of the

researcher or sponsor, there is no way to ascertain which label was successful and which

one failed. Rasinski concludes that progress in this area will come from researchers in

the area of cognition and communication. Another possibility is that it will come from

the application of relativity to the research process. Regardless, these results cast strong

doubts on the validity of public opinion and attitude research.

Attitude of the Survey Designer

Many of the above mentioned researchers came to the conclusion that the results of a

survey could be substantially altered by the survey instrument or interviewer. Some

researchers examined the cognitive and contextual processes involved in respondents'

decisions. However, none of them looked at the role of the survey designer in the

creation of the questions.

This researcher could not locate any studies that investigated the relationship between

survey designers' attitudes and the questions they develop. This study will break new

ground in that it recognizes that survey instruments do not just "pop into being." On the

contrary, they are generally carefully constructed by people sincerely interested in

finding the truth on an issue. We know much about the way people respond to questions.

This study will add to our knowledge by looking at how researchers formulate those

questions.

49CHAPTER IIIMethodology

Do survey researchers unknowingly influence the results of a survey through their

question wording choices? This study tested four null hypotheses. These are:


opinions.


study sponsor.



4. There is no difference between beginning and advanced researchers with respect

to the degree to which they incorporate their personal opinions and those of the

sponsor into their choice of question wording.

Research Design

This study used a mail questionnaire to examine a nonprobability sample of researchers

who have some involvement in survey research. Subjects received a questionnaire that

asked them to "design" a public opinion poll to investigate how people feel about

government spending on six social issues.

49The questionnaire itself contained four components. The first component was a short

introductory paragraph that placed the respondents in a hypothetical situation where they

were hired by a sponsoring agency to conduct a public opinion survey. The second

component asked respondents to choose one of the two issue labels for each of the six

issues. The third component asked respondents how they would personally answer the

six questions. The final component asked them to provide a self-rating of their

experience as a designer of surveys (beginner, average, or expert). Appendix A contains

a copy of the cover letter and survey. The response alternatives for questions 2, 4, and 6

were reversed in order to nullify any serial-order or response set effects.

Two different forms of question wording were presented for each of the six social issues.

These are the same issue labels (i.e., question wording alternatives) used by Rasinski

(1989). They were selected for this study because of their demonstrated ability to

produce consistent differential response patterns. Respondents were asked to select the

issue labels that they would use in a public opinion poll.

Each question was prefaced with, "Are we spending too much, too little, or about the

right amount on . . ." The respondent then picked one of the issue labels to complete the

question. Table 3 shows the alternative issue labels for each of the six social issues. In

Rasinski's experiments, the first column of issue labels was more likely to produce a

higher percentage of people who say that "too little is spent."

49Table 3

Rasinski's (1989) Issue Labels that were Replicated in this Study

"Are we spending too much, too little, or about the right amount on . . ."

Issue <-------------------------- Alternative Issue Labels -------------------------->

Crime "halting the rising crime rate" "law enforcement"

Drug addiction "dealing with drug addiction" "drug rehabilitation"

Welfare "assistance to the poor" "welfare"

Cities "solving problems of big cities" "assistance to big cities"

Blacks "improving conditions of Blacks" "assistance to Blacks"

Social Security "protecting Social Security" "Social Security"

Note: This table is arranged such that the first column of issue labels is the one that people are more likely to respond that "too little is spent."

Subjects were randomly assigned to receive one of three different forms of the survey.

The only difference between the three forms was in a one paragraph introduction that

described the persuasion of the sponsor of the study.

You have been hired by a [liberal/conservative] organization to find out how the

public really feels about government spending on six social issues. During the

planning meeting, you [discover that the organization favors increased/reduced

spending levels. You also] learn that the organization is considering hiring you to

conduct several other surveys in the future. As you leave the meeting, the

Director says to you, "We really need to know the truth, so be objective."

In one form of the survey, the sponsor was a liberal organization favoring increased

program spending, and in another other form of the survey, the sponsor was a

conservative organization favoring reduced spending. In the third form of survey, the

49sponsor was not identified (i.e., the bracketed text was excluded). This served as a

control group for the sponsorship variable.

It is important to note that the opening paragraph also told respondents to "be objective."

This was included because it more closely approximates the "real-world" research

process. On a conscious level, sponsors of research are nearly always interested in

finding the "truth" on an issue, regardless of their own persuasion. If sponsorship effects

exist, they do so in spite of a sponsor's conscious desire to find the truth. In addition, it

was believed that the inclusion of the instructions to "be objective", would inform

respondents that they should not make deliberate attempts to satisfy the desires of the

sponsor.

At the onset of this study, it was assumed that respondents would probably understand

the purpose of this study as soon as they looked at the questionnaire. No attempts were

made to conceal the goals of this research and none were needed. In the "real world", if

researchers' personal opinions (or those of the sponsor) affect their choices of question

wording, then it happens in spite of researchers' conscious efforts to remain objective.

Sample Selection

The population for this study was all researchers who design surveys. This, of course,

includes a wide variety of people from many different disciplines and with varying levels

of survey design experience. Obviously, it was not possible to identify all members of

the population and therefore, a random sample could not be chosen. Since this study was

exploratory in nature, a convenience sample was appropriate. This nonprobability

49method is often used during preliminary and exploratory research efforts to get rough

estimates of the results.

Rasinski's (1989) experiments were conducted over a period of three years (1984-1986).

During that time, the minimum observed effect of issue labeling was 5.5 percent in 1985

(28.2% said too little is spent on "assistance to Blacks" versus 33.7% for "improving

conditions of Blacks"). Therefore, the sample size for this study was selected to allow a

5.5 percent difference to be significant at the 95 percent confidence level. The formula

to determine sample size given an expected difference between two percents is:

( P1 + P2 ) - D2

N= Z2 · ---------------------------- D2

where,

N is the required sample size.Z is the standard normal deviate for the desired confidence level.P1 and P2 are the proportions for the first and second groups.D is the difference between the two proportions.

This study used a one-tailed test of significance because the directions of the effects were

predicted. The Z value required to produce a one-tailed significance level of 95 percent

is 1.645. Thus, for this study, the required sample size was calculated to be 551.

(.282+.337) - (.282-.337)2 .619 - (.055)2 .615975

N = (1.645)2 x ------------------------------------- = 2.706 x ------------------ = 2.706 x ------------ = 551 (.282-.337)2 (.055)2 .003025

The sample was drawn from a list of 1,400 recent purchasers of StatPac Gold IV -

Survey and Marketing Research Edition (Walonick, 1993), a computer software package

49written by the author for the purpose of conducting and analyzing surveys. All

purchasers of the software are somehow involved in the process of survey design, data

collection, or analysis. Nearly all purchasers are businesses and roughly half are

marketing research companies. Foreign users (located outside the United States) were

eliminated from the list, leaving a total usable sample of 950 potential respondents.

From this sample, 551 subjects were randomly selected without replacement. Subjects

were assigned to one of the three sponsorship levels by selecting every third name from

the list of subjects. The survey was mailed on January 28, 1994. In this initial mailing,

184 of the surveys described the persuasion of the sponsor as conservative, 184 as liberal,

and 183 named no sponsor.

Two weeks after the initial mailing (February 11, 1994), a projection of the response rate

was made and additional surveys were mailed to the remaining 399 people in the sample.

This consisted of 133 surveys for each of the three sponsorship conditions. Thus, the

total for both mailings included 317 surveys citing a conservative sponsor, 317 citing a

liberal sponsor, and 316 with no defined sponsor.

The data collection phase of this project was completed three weeks after the second

mailing. A total of 953 questionnaires had been mailed. Out of those, 32 were returned

by the Post Office as undeliverable and the remaining 921 surveys were presumed to

have reached their intended destination.

Analysis

49This study was concerned with exploring the relationship between researchers and the

survey instruments they create. However, if relativity is an issue in survey research, then

this researcher and this study are also called into question.

Researchers often think of the word "relationship" as if it possessed a single and precise

definition. There are, in fact, many precise and unique ways to define a relationship. To

the scientist, a relationship is defined in terms of a mathematical construct. Some

statistical tests establish the existence or nonexistence of a relationship (e.g., chi-square

and t-test between proportions). Others examine the degree to which variables co-vary

(e.g., gamma and correlation coefficients); and still others look at relationships in terms

of variability that can be explained (e.g., regression and cannonical correlation).

In most studies, there are a variety of statistical techniques available to the researcher.

Depending on their choice of statistics, two researchers could draw different conclusions

from the same data. Furthermore, both researchers might be correct. The apparent

paradox can only be solved by examining the underlying assumptions associated with the

researcher's definitions of "relationship." If relativity is an issue in survey research, then

this research had to consider the possibility that the statistical methodology for this study

might also be biased. In an attempt to compensate for this potential source of error, this

study utilized a variety of statistical tests that might be used to answer the research

questions, rather than being limited to a single technique.

The following is a description of the statistical tests that were performed for the first two

hypotheses. Each of tests each represent a different operational definition of the word

"related." While the text only discusses the first hypothesis (researchers' personal

49opinions), the same testing procedure was applied to the second (persuasion of the

sponsor).

The various tests were chosen because they represent commonly used statistical

techniques. Some tests may be questionable to the reader because they blur the

distinction between ordinal and interval data. However, they are reported in this study

because they represent the kinds of analyses frequently found in the literature. For

example, Likert scales are often viewed as ordinal data and reported as counts and

percents. Other times, Likert scale items are summed and averaged to create subscales

and reported as means and standard deviations...an assumption of interval data. The

intent of this study is not to debate whether a Likert scale is ordinal or interval. The fact

that many studies report mean averages for Likert scale items was sufficient for its

inclusion in this study.

The astute reader might also notice that several of the statistical techniques are actually

designed to test for differences rather than measure the strength of a relationship directly

(e.g., a t-test). However, in the special case of a dichotomous dependent variable, a

significant difference is a relationship.

49Liberal and Conservative t-Tests Between Proportions (Tests 1 and 2)

The first two tests view the relationship from a dichotomous perspective...either the

relationship exists or it doesn't. The relationship between a respondent's opinion and

their choice of question wording was quantified by creating two nominal scales. One

nominal scale utilized a liberal definition of "relationship" and the other adopted a more

conservative measure. While both scales contain bias, it would be acceptable for a

researcher to adopt either scale before beginning the research. For example, a researcher

might argue that a very conservative definition of "relationship" had been adopted in

order to be more assured that the results would not suggest a relationship when there

actually was none. Another researcher might adopt a liberal definition of "relationship"

during an initial exploratory research effort to determine if there were any evidence of a

phenomena..

"Relationship" refers to the apparent harmony (or consistency) between a respondent's

personal opinion on an issue and their question wording choice. If a positive relationship

exists, then a respondent who believes too much is being spent on crime would be more

likely to select the question wording that favors more "too much is spent" responses.

Similarly, a person who believes too little is being spent on crime would be more likely

to select the question wording that favors "too little is spent." If a negative relationship

exists, a person who believes too much is being spent on crime would be more likely to

select the question wording that favors "too little is spent" responses and visa versa.

Both scales look at each pair of questions in terms of whether or not a respondent's

personal opinion was (or was not) consistent with their question wording choice. In

other words, for any given pair of questions, the relationship was viewed as

49dichotomous... either it existed or it did not. If a respondent's personal opinion was

consistent with their question wording choice, the contribution to the scale was one and if

it was inconsistent, the contribution was zero.

The difference between the liberal and conservative dichotomous indicators is in how

they interpret "the right amount" response. The liberal indicator views "the right

amount" response as evidence of a relationship, while the conservative scale interprets it

as no relationship. Table 4 reveals the construction of the scales for the liberal and

conservative dichotomous indicators.

Table 4

Construction of the Liberal and Conservative Dichotomous Scales

Liberal ConservativeQuestion Wording That Respondent's Dichotomous DichotomousThe Respondent Selected Opinion Indicator Indicator

favors "too much is spent" too much +1 +1favors "too much is spent" right amount +1 0favors "too much is spent" too little 0 0favors "too little is spent" too much 0 0favors "too little is spent" right amount +1 0favors "too little is spent" too little +1 +1

Using the liberal definition of "relationship", we expected to see a relationship in 66.7

percent (4/6) of the question/opinion pairs, even if there was actually no true relationship.

Using the conservative definition of "relationship", we expected to see a relationship in

33.3 percent (2/6) of the question/opinion pairs. If there were positive or negative

relationships between researchers' opinions and their question wording choices, we

would see deviations away from the values expected by chance.

49A t-test between proportions was used to compare the "percent of question/opinion

comparisons that showed a relationship" with the "percent that would be expected by

chance." If the t-statistic was significant, the null hypothesis was rejected and it was

concluded that a relationship did exist between the two variables. The two tests are

reported as the "Liberal t-Test Between Proportions" and the "Conservative t-Test

Between Proportions."

One-Way Classification Chi-Square Test (Test 3)

Construction of an unbiased (or neutral) scale overcame the problem of how to classify a

response of "the right amount." Instead of defining "relationship" as a dichotomous

variable (yes or no), the neutral scale viewed the relationship as a matter of degree,

varying from minus one to plus one. "The right amount" was considered a neutral

response and assigned a value of zero. Table 5 shows the construction of the scale.

49Table 5

Construction of the Unbiased (Neutral) Scale

Question Wording That Respondent's UnbiasedThe Respondent Selected Opinion Indicator

favors "too much is spent" too much +1favors "too much is spent" right amount 0favors "too much is spent" too little -1favors "too little is spent" too much -1favors "too little is spent" right amount 0favors "too little is spent" too little +1

For each question/opinion comparison, if a respondent's personal opinion was consistent

with their question wording choice, it was classified as "a positive relationship" and the

value for that comparison was plus one. If a respondent answered "the right amount"

(regardless of their question wording choice), it was interpreted as "no relationship" and

assigned a value of zero. Inconsistent responses were assigned a value of minus one.

Thus, the scale was similar to that of a correlation coefficient, ranging from minus one to

plus one. If only chance were operating, we would predict that the scores would be

equally distributed over this range and the mean average would be zero.

Test number three involved a one-way classification chi-square test to determine if the

observed frequencies were significantly different than would be expected by chance.

Since there are three possible scores for each question/opinion comparison (i.e., -1, 0,

and +1), the expected frequency for each cell was 33.3 percent of the sample. A

significant chi-square statistic would indicate that the observed frequencies were different

than would be expected by chance and the null hypothesis would be rejected. This test is

reported as the "One-Way Classification Chi-Square Test."

49Two-Way Classification Chi-Square Test (Test 4)

The fourth test involved a two-way (2 x 3) contingency table analysis using the chi-

square statistic and gamma. A significant chi-square statistic indicated that the observed

frequencies were significantly different than the expected frequencies and the null

hypothesis was rejected. Gamma is interpreted like a correlation coefficient and thus

provided an easily understood measure of the strength of a relationship. This test is

reported as the "Two-Way Classification Chi-Square Test." Table 6 reveals the

construction of the 2 x 3 contingency table.

Table 6

Construction of the 2 x 3 Contingency Table for the Two-Way Classification Chi-Square Test

Personal Opinion

__________________________________________Question Wording Choice Too Little The Right Amount Too Much_______________________________________________________________________

Favors "too little spent" --- --- ---

Favors "too much is spent" --- --- ---

t-Test Between Means (Test 5)

If a researcher assumes that a Likert scale is interval data, then a comparison of mean

averages is an acceptable method of hypothesis testing. A t-test between means was used

to compare the mean personal opinion for those who chose the question wording that

favors "too little is spent", with the mean personal opinion for those who chose the

49question wording that favors "too much is spent." If the t-statistic was significant, we

concluded that there was difference between the means, which implied the existence of a

relationship. This test is reported as the "t-Test Between Means."

Logistic Regression (Tests 6, 7, 8 and 9)

Logistic regression provides a multivariate nonlinear model appropriate for a

dichotomous dependent variable. The null hypothesis implies that knowing the personal

opinions of researchers would not improve our ability to predict which issue label they

would choose. A total of four tests were based on the logistic regression model. For all

four, the dependent variable was the respondents' choice of question wording and the

independent variable was the respondent's personal opinion. Two of the models viewed

the independent variable as ordinal and the other two as interval.

In logistic regression, the chi-square statistic reveals whether there is a significant

relationship between the combined effect of the independent variable(s) and the

dependent variable. Thus, it's interpretation is similar to the F-ratio in standard multiple

regression. Unlike standard multiple regression; however, logistic regression lacks an

intuitive measure of how well the regression model actually improves our ability to

predict the dependent variable. In standard multiple regression, the r-squared statistic is

the proportion of variability in the dependent variable that can be explained by the

independent variables. Traditional logistic regression has no similar intuitive statistic.

When there are a large number of cases, independent variables with very small effects

can meet the requirements for significance. Researchers using logistic regression usually

report the chi-square statistic and its associated significance level, but this reveals little

49about how much the regression model actually improves our ability to predict the

dependent variable. In fact, in the extreme condition where the sample is sufficiently

large and the effect is relatively small, it is possible for a logistic regression model to

show a significant chi-square statistic, even when there is no actual improvement in our

ability to predict the dependent variable. This apparent paradox can occur because the

logistic growth model is a nonlinear approximation of a dichotomous variable.

The value of a logistic regression model is found in its capacity to improve a researcher's

ability to predict the dependent variable. While it is important to look at the significance

of the model, it is perhaps even more important to look at the magnitude of the effect.

One way to understand the magnitude of effect in logistic regression is to ask the

questions:

1. How well can we predict the dependent variable in the absence of any

information about the independent variables?

2. How well can we predict the dependent variable with the knowledge of the

independent variables?

"How well" is readily quantified as the percent of our predictions that are correct. By

comparing the "percent correct" with and without knowledge of the independent

variables, we can get a quantitative measure of the magnitude of the effect.

The "percent correct" without knowledge of the independent variables is obtained from

the frequency distribution of the dependent variable. For example, if seventy-five

percent of the sample had a dependent variable equal to one, our best guess for the value

of the dependent variable in the absence of all other information would be one, and we

49would be wrong twenty-five percent of the time. Thus, the "percent correct" would be

seventy-five percent.

The "percent correct" with knowledge of the independent variables is obtained from the

regression model. For each case, the logistic function yields the probability that the

dependent variable is equal to one. If the probability for a case is .5 or higher, our best

prediction for that case is that the dependent variable is equal to one. If the probability is

less than .5, we would predict that the dependent variable is equal to zero. By comparing

our predictions with the actual values for the dependent variable, we can calculate the

"percent correct" with knowledge of the independent variables.

The magnitude of effect is the difference between the two predictions (with and without

knowledge of the independent variables). A t-test between proportions was used to

determine if the difference is statistically significant. This method was deemed superior

to the chi-square statistic because it provided direct information regarding the magnitude

of the effect, rather than the significance level of the equation. The results relate to our

improved ability to predict the dependent variable and not simply to a mathematical

construct.

Tests six and seven viewed researchers' personal opinions as an ordinal scale. The

"researchers' opinion" variable was converted to dummy variables and these became the

independent variables for the regression model. Since there was substantial nonresponse,

all three dummy variables were included in the regression model without creating a

collinearity problem. That is, "no response" became the standard and was excluded from

the model. Thus, both models encompass the idea that "no personal opinion" may also

be predictive of question wording choice. The two tests are reported as "Logistic

49Regression with Dummy IV's (Chi-Square)" and "Logistic Regression with Dummy IV's

(t-Test).

The eighth and ninth tests were the same regression models except that the independent

variable was viewed as an interval scale and therefore, it could be used directly in the

regression without creating dummy variables. If a respondent did not give a personal

opinion to an item, that data pair was excluded from the analyses. They are reported as

"Logistic Regression with an Interval IV (Chi-Square)" and "Logistic Regression with an

Interval IV (t-Test)."

Table 7 contains a summary of the nine tests that were used to examine the first two

hypotheses. All nine were performed for each of the first two hypotheses. Each test

used a different mathematical construct to define the "relationship" between two

variables.

49Table 7

Summary of Nine Statistical Tests Used for Testing the First Two Hypotheses

1. Liberal t-Test Between Proportions2. Conservative t-Test Between Proportions3. One-Way Classification Chi-Square Test4. Two-Way Classification Chi-Square Test5. t-Test Between Means6. Logistic Regression with Dummy IV's (Chi-Square Test)7. Logistic Regression with Dummy IV's (t-Test)8. Logistic Regression with an Interval IV (Chi-Square Test)9. Logistic Regression with an Interval IV (t-Test)

The first three statistical tests were not appropriate for evaluating the third hypothesis

because there is no way to define what constitutes a positive or negative relationship

between researchers' question wording choices and their self-assessed experience.

However, the other six tests were used to evaluate the third hypothesis.

Multivariate Models

The multivariate model asks whether the combined knowledge of the independent

variables (researchers' opinions, persuasion of the sponsor, and self-assessed experience)

improves our ability to predict the dependent variable (question wording choice).

Logistic regression was used to explore this question. As in the previous logistic

regression models, both the chi-square test and the t-test between proportions were used

to evaluate the model.

The four multivariate tests parallel tests 6, 7, 8, and 9. The only difference is that all

three independent variables were simultaneously included in the model. These tests are

referred to as:

49Multivariate Logistic Regression with Dummy IV's (Chi-Square Test)Multivariate Logistic Regression with Dummy IV's (t-Test)Multivariate Logistic Regression with Interval IV's (Chi-Square Test)Multivariate Logistic Regression with Interval IV's (t-Test)

The fourth research question was whether there were differences between beginning and

advanced researchers with respect to the degree to which they incorporate their personal

opinions and those of the sponsor into their choice of question wording. Two sets of

tests were conducted, one to examine the personal opinions of the researchers and the

other to look at the persuasion of the study sponsor.

The liberal, conservative, and unbiased measures of "relationship" were compared for the

different levels of self-assessed experience, using a two-way classification chi-square

test. A significant chi-square statistic would indicate that there was a relationship

between self-assessed experience and the degree to which they incorporated their

personal opinions or those of the sponsor. Additionally, the unbiased or neutral scale

was used for two additional tests. The first was a t-test between means, comparing

beginning with advanced researchers. The second was a one-way ANOVA, where the

dependent variable was the unbiased scale and the factor variable was the self-assessed

experience level. These five tests are referred to as:

49Two-Way Classif. Chi-Square Test - Liberal DefinitionTwo-Way Classif. Chi-Square Test - Conservative DefinitionTwo-Way Classif. Chi-Square Test - Unbiased Definitiont-Test between Means - Unbiased DefinitionOne-Way ANOVA - Unbiased Definition

Analyses of the Hypotheses Broken Down By Question Wording Pairs

All of the previous analyses were conducted on all issues combined. Each respondent

contributed six records to the final model (one for each issue). The same analyses were

performed for each of the six question wording pairs individually. This was done to

investigate whether the magnitude of the relationship was dependent on the particular

question wording pair.

Methodological Limitations

One shortcoming of all mail-surveys is the possibility of a low response rate. For this

study, it was determined that 551 completed surveys would be required to detect a 5.5

percent difference as significant at the 95 percent confidence level using a one-tailed test.

There is considerable information regarding response rates to mail surveys for the

general public. However, virtually no information exists regarding response rates of the

research community itself. Therefore, this study replaced nonresponders with new

subjects. While this technique had the potential of adding bias to the results, it provided

a sufficient number of respondents for performing the desired analyses.

Another possible shortcoming of this study was that researchers might consistently select

one form of question wording, regardless of the other variables. While this was not

49likely to occur for all six issues labels, it could have necessitated eliminating one or more

of the issue labels pairs from the analyses, since the data would not provide

discriminating information. A similar (remote) possibility existed for the researcher's

experience variable.

Another possible shortcoming of this study was that researchers might already be

familiar with the results of Rasinski's question wording experiments. This might have

had the effect of influencing their choice of issue labels, but it is not clear what the

direction or magnitude of this bias would be. One might assume that more experienced

researchers were more likely to be familiar with Rasinski's findings, but again, there is no

basis for predicting what this bias might be.

A final limitation of this study was that the sample may not be representative of all

survey researchers. The universe of survey researchers is diverse and difficult to

identify. The sample selection technique that was used in this study taps only a small

segment of the population. Since the quality of the sample could not be determined, this

study does not make inferential statements about the population.

Validity and Reliability

It is difficult to discuss the validity of a survey instrument designed to investigate the

validity of surveys in general. Validity refers to the accuracy or truthfulness of a

measurement. Are we measuring what we think we are? "Validity itself is a simple

concept, but the determination of the validity of a measure is elusive" (Spector, 1981, p.

14).

49Most researchers would not knowingly administer a survey with the intention of

distorting the results. Face validity is an issue of integrity. The determination of face

validity is based on the subjective opinion of the researcher. Unless a researcher

knowingly uses faulty procedures or instruments, a study will have face validity. This

study had face validity. This researcher scrutinized and modified the experimental

design and was satisfied that it would accurately measure the desired constructs.

This study also had content validity. A thorough literature review was performed and

this researcher was convinced that the questionnaire would adequately test the four

hypotheses. In addition, other researchers were be asked to review the experimental

design and questionnaire before the study design was finalized.

This study attempted to establish concurrent (criterion-related) validity. Logistic

regression and discriminant function analysis produce a mathematical model, where the

dependent variable can be predicted from one or more independent variables, at the same

point in time.

It is less clear whether or not this study has construct validity. Taken at face value, the

theoretical foundations of this study are sound. However, it is important to keep in mind

that researchers were asked to design a hypothetical survey with a hypothetical sponsor.

There is no way to be assured that this was representative of researchers operating in the

"real world." However, the vignette has been shown to be an effective technique that can

increase both validity and reliability in opinion surveys (Alexander and Becker, 1978).

In addition, this study imposed the constraint that the researcher had to choose between

two alternative issue labels. In a real survey, researchers would be free to create their

49own question wording; they would not be limited to the issue labels presented in this

study. Nevertheless, this study was exploratory in nature and the results were interpreted

as such.

Rasinski (1989) reported that the six pairs of issue labels produced consistent and stable

differences in peoples' responses from 1984 to 1986. The down-turn in the United States

economy has created a more conservative spending attitude and it would not be

surprising to learn that the mood of the public regarding government spending has

changed since 1986. However, no attempt was made to compare the findings of this

study to Rasinski's. This study used Rasinski's issue labels because they have

demonstrated reliability in their ability to produce different responses. In other words,

this study was concerned with relative response patterns, not absolute values.

The questionnaire for this study covered six different social issues; however, the actual

issues were not important for this study. Each pair of issue labels provided a

measurement of the phenomena being studied, thus resulting in six measurements for

each respondent. While this was not the same as equivalent-form reliability, it did allow

for repeated measures within individuals, thereby adding to our confidence in the

findings. Cronbach's alpha was used as a measure of the tendency of respondents to

favor one form of question wording.

Procedures and Timetable

This study began immediately following the approval of the research proposal and

approval of the Request for Exemption from Committee Review of Research Involving

Human Subjects.

49

Selection of the sample took approximately one week. Preparing envelopes and

questionnaires required an additional week. The mailing of the questionnaires began

approximately two weeks after the study began. Three weeks after the initial mailing,

the response rate was calculated and an additional mailing was made to all the remaining

potential subjects.

The analyses and final report was completed approximately three months after the first

mailing.

49CHAPTER IV

Results

Response and Non-Response

Response rate was tracked on a daily basis to determine whether or not there was a

difference in response based upon which of the three forms the respondent received. The

total number of returns was similar for the three different forms and it appears that the

sponsorship variable did not influence response rate to the questionnaire.

A total of 953 surveys were mailed. Thirty-two were returned by the post office as

undeliverable and 921 are presumed to have reached their intended destination. The final

number of usable surveys was 361; thus, the overall response rate was 39.2 percent, a

fairly high usable response for mail surveys without follow-up. Table 8 shows the final

returns and response rates for each of the three forms of the survey.

Table 8

Response Rate Information Broken Down by the Sponsorship Variable

Not Usable Response Mailed Deliverable Returns Rate Overall (all surveys combined) 953 32 361 39.2%Conservative sponsor 318 9 119 38.5%Liberal sponsor 318 13 116 38.0%No sponsor identified 317 10 126 41.0%

Many respondents failed to answer all the items on the questionnaire. Omitted items

were frequently accompanied by text explaining why they did not choose either form of

49question wording and some even suggested a third wording alternative. Table 9 shows

the completion rate for each of the items on the questionnaire. The percentages are based

upon the total number of usable returns (N=361).

Table 9

Number of Valid Responses and Completion Rates for All Items on the Survey

Researcher's self-assessed experience 328 (90.9%)

Made A Valid Question Made a Valid PersonalIssue Wording Selection Opinion Selection

Crime 359 (99.4%) 330 (91.4%)

Drugs 352 (97.5%) 321 (88.9%)

Welfare 352 (97.5%) 325 (90.0%)

Cities 357 (98.9%) 326 (90.3%)

Blacks 344 (95.3%) 308 (85.3%)

Social Security 358 (99.2%) 324 (89.8%)

Nine percent of those who returned the survey did not specify their survey design

experience level. This was the first item on the survey and presumably not likely to be

overlooked. Pretesting had not shown a problem and a review of respondents' comments

did not reveal any reasons why they may have omitted the item. We suspect that many

respondents inadvertently left this item blank, possibly because it was not numbered.

Future studies might number the item, or highlight it in some other way, in order to

improve the response rate.

49Twenty-three respondents (6.4%) answered most of the six question wording choices, but

left all six of the personal opinion questions blank. A few respondents provided an

explanation why they had omitted the personal opinion questions; however, in most

cases, they were just left blank. The most plausible explanation is that these people

simply did not realize that they were supposed to answer this group of questions.

Although the instructions were printed in bold, they may have interpreted the questions

as part of the questionnaire that they would be sending to the public. We suspect that this

reflects a minor design flaw in the survey instrument. It is not surprising that the

deficiency did not surface in pretesting, since it appeared in such a small proportion of

the surveys.

Respondents' Comments

The cover letter that accompanied the questionnaire asked respondents to write

comments anywhere on the questionnaire. This untraditional approach to the solicitation

of comments proved to be extremely effective. Over a quarter (27.7%) of all the

returned questionnaires contained one or more comments. Counting only those

respondents that made at least one comment, the average number of comments per

questionnaire was 2.4. While it was not the original intent of this study to summarize the

comments, there are presented here because they reveal much about this survey and the

survey design process in general.

Reaction to the survey was mixed. One respondent wrote, "I've been designing surveys

for about 10 years and this is one of the strangest survey's I've ever seen or completed."

Another commented that it was a "very clever study." A few respondents seemed

suspicious of the survey. "Who's paying for this-a government agency? It sure doesn't

49sound like a 'question wording survey.'" Fifteen respondents specifically requested a

copy of the results.

This survey was offensive to a couple of subjects. One subject, who refused to complete

the questionnaire, described how tired he/she was of "researcher bashing."

Because I have integrity and am tired of having my profession bashed for "misuse" all the time, I would turn down any offers and rescind any bids. None of the questions are bias free or clear enough to know what the respondent would be answering.

Other respondents stressed the importance of objectivity in survey research. In one case,

a respondent refused to answer the personal opinion questions because "that in itself

would not be objective." It was clear that objectivity was important to many

respondents. Some even expressed open contempt for researchers who would allow their

professional judgment to be influenced by the persuasion of the sponsor or the potential

for additional work.

Your survey is testing the hypothesis that designers are going to worry more about their next design job than a good instrument. These people lack the character needed to maintain credibility to compete for design jobs.

This is nonsense. I attempt to do the most objective job regardless of the client's motivations.

Why do you care if I'm liberal or conservative or whatever? Being objective has to be the first priority-regardless of future research opportunities.

Several respondents wrote flowing narrative describing the methods they use to

overcome the problem of question wording bias. The common thread in these responses

was the use of a panel or focus group to develop the most appropriate question wording.

49Each of the alternative casts a different light on the issues. The alternatives give different "set-up" and provoke different emotions. None are any more or less objective than the rest. I find the "truth" is seldom acquired by a single question. Truth is always subject to context. Therefore, I generally probe an issue with several slightly different questions. In other words, I ask the question 3 or 4 different ways and then look at the sum of the results. Actually, if I don't know the right wording to use, I usually do a couple focus groups first. The focus groups usually show which wording to use.

We develop most of our questions by asking a panel of individuals to suggest and cross check the questions for neutrality, clarity, appropriate language level, bias, etc. This is often accomplished by soliciting the opinion of extreme and opposite ideologies (i.e. an adaptation of Likert techniques) and then pre-testing the instrument ... No survey instrument is perfect and I've made my mistakes, but all too many researchers seek the answers desired by the customer. We always tell our customers that we will try our best to obtain objective answers, although the results may not be what is desired.

Only two respondents made statements to suggest that they might knowingly bias the

results, although one (or both) may have been joking.

Being objective depends on how badly I need the job.

Of course, the situation is hypothetical. It's likely I would not get myself in the position of working for such a group. But if for some reason I did, I would word the questions so they got the results they want. Hey, I'm a professional, Ya Know?

Eight respondents indicated that they needed to know whether the "we" in "Are we

spending..." referred to the federal, state, or local government. In order to correct this

ambiguity, future researchers might consider changing the wording to "Is the federal

government spending...".

The most frequent comment was a short general statement indicating that the respondent

had a problem with one or both of the question wording choices. Some comments were

directed at the lack of specificity in the questions. These included words like "vague",

49"broad", "unclear", and "ambiguous." Other comments addressed the issue of bias

directly and contained phrases like "loaded words", "opinionated", "leading questions",

"negative connotations", "limiting", and "biased." A large number of respondents

pointed out that the questions were addressing two distinct and separate issues.

Occasionally, respondents refused to answer an item and stated that neither choice was

acceptable. They often suggested a third alternative wording.

A substantial number of respondents wrote in "DK" or "don't know" for some of the

personal opinion questions. Others circled "the right amount" and added a qualifier

pointing out that "the right amount" is not synonymous with appropriate spending, or

they added some other qualifying statement.

The right amount, but misdirected.

This was hard to do. Felt that more information should be provided to obtain accurate perceptions. I really feel that the money is not well spent therefore, one dollar spent is too much!

49Crime

Examples of alternative wording suggested by respondents include:

...protecting law-abiding citizens.

...getting and keeping criminals off the street.

...reducing crime.

Two respondents correctly pointed out that crime rate is not rising in many areas.

Another stated that, "You have assumed that spending has a direct effect upon crime rate.

I doubt that it does." Another respondent noted that "law enforcement is only one

method for dealing with crime, and then, after the fact."

Drugs


...the drug epidemic.

...decreasing the illegal use of drugs.

...the social problems of medicine and chemical misuse.

Several respondents commented that alcohol abuse was also part of the drug problem.

Others focused on the difference between rehabilitation and prevention. "Rehabilitation

is one after-the-fact method. The choice I circled could include prevention." "We need

more education about the hazards of drug and alcohol abuse." Surprisingly, no

comments were made about the emotional connotations of the word "addiction."

Welfare


...public assistance programs.

...ending poverty.

...helping the homeless.

Many respondents commented on this issue. A large number mentioned that the salient

issue is not how much money, but rather, how the money is used. Typical responses

49were that we are "spending the right amount but in the wrong places", or "the right

amount if how the money is used is restructured." One respondent wrote that we "need

to revamp the whole system." Others focused more on more specific solutions. "It's not

just a spending issue. What I think 'the poor' need is more quality education which may

or may not require more dollars."

Several respondents commented that "welfare" had severe negative connotations. One

respondent exhibited an almost knee-jerk reaction to the word "welfare" and wrote, "Why

spending? Perhaps pulling, pushing or kicking would be more effective." A few

respondents pointed out that "welfare is only one type of assistance to the poor. What

about inner city job programs, etc."

Cities


...urban revitalization.

...urban decay.

...solving THE problems of OUR big cities.

There were fewer comments for this question compared to the other issues. One

respondent pointed out that "assistance is different than 'solving' or 'improving'", and then

went on to suggest that multiple questions were needed to explore the issue. Another

respondent, who didn't like either choice, wrote that "the emphasis should be on

partnership, research, and discovering what works" . Another stated that both questions

were "assuming that the wrong medicine will cure the patient."

Blacks


49...helping economically and socially disadvantaged persons....affirmative actions for all minorities....programs targeted to the African American community....improving THE ECONOMIC conditions of AFRICAN AMERICANS....helping minorities....helping disadvantaged people improve their situation.

This question generated more comments than the other issues. About twenty-five

respondents pointed out that one should "never use the term 'Blacks' in today's race

conscious society." Respondents most often referred to this as "politically incorrect", but

others used words such as "offensive", "racist", "discriminatory", "emotional trigger",

"condescending", "paternalistic", and "patronizing." Most suggested using "African

Americans" instead of "Blacks", although a few respondents suggested "minority groups"

and "people of color."

Social Security

No respondents suggested alternative wording for this question, although there were a

few comments stating that the question wording was not sufficiently clear. One

respondent wrote, "Do you mean Social Security payments? Social Security deductions

(FICA)? Too vague. Don't understand this question." Another respondent asked, "Do

you want to measure what's spent on the program, or what's spent by special interest,

lobbyist, etc., to keep it?" One respondent pointed out that the word "protecting" implied

that Social Security is in jeopardy.

49Summary of Comments

A total of 239 comments were made by 100 respondents. Table 10 shows the

approximate number of comments made for each of the issues, as well as the percent of

respondents who made a valid question wording selection. As we might expect, there

appears to be a negative relationship between the number of comments and the response

percent...as response goes down, the number of comments goes up.

Table 10

Number of Comments and Response Percent Ranked by Frequency of Mention

Issue Number of Comments Response Percent

General (nonspecific) 64 Not appropriate

Blacks 42 95.3%

Welfare 38 97.5%

Crime 29 99.4%

Drugs 27 97.5%

Cities 23 98.9%

Social Security 16 99.2%

Null Hypothesis Testing

The first two null hypotheses were tested using nine different operational definitions of

"relationship." These are referred to as:

1. Liberal t-Test between Proportions 2. Conservative t-Test between Proportions 3. One-Way Classification Chi-Square Test 4. Two-Way Classification Chi-Square Test

49 5. t-Test between Means 6. Logistic Regression with Dummy IV's (Chi-Square Test) 7. Logistic Regression with Dummy IV's (t-Test) 8. Logistic Regression with an Interval IV (Chi-Square Test) 9. Logistic Regression with an Interval IV (t-Test)

Null Hypothesis 1: Researchers' choices for question wording are not related to their

personal opinions.

Using the "Liberal t-Test between Proportions", we observed a relationship in 71.6

percent of the question pairs. Table 11 shows that the t-statistic was significant,

t(1932)=1.834, p=.033, thus we reject the null hypothesis and conclude that there was a

significant relationship between researchers' personal opinions and their question

wording choices.

Table 11

Liberal t-Test between Proportions___________________________________________________________________Total number of comparisons = 1934 N Percent t df p___________________________________________________________________

Observed relationships 1385 71.6 1.834 1932 .033

Relationships expected by chance 1289 66.7

The "Conservative t-Test between Proportions" showed a relationship in 41.0 percent of

the question pairs. Table 12 reveals that the t-test between proportions was significant,

t(1932)=3.926, p<.001. Therefore, we reject the null hypothesis and conclude that a

significant relationship did exist between researchers' personal opinions and their

question wording choices.

49Table 12

Conservative t-Test between Proportions___________________________________________________________________Total number of comparisons = 1934 N Percent t df p___________________________________________________________________



The "One-Way Classification Chi-Square Test" using the unbiased or neutral measure of

the relationship revealed a strong significant difference between the observed distribution

and one that would be expected by chance if there were no true relationship,

c2(2)=52.006, p<.001. Therefore, we reject the null hypothesis and conclude that a

relationship did exist between researchers' personal opinions and their question wording

choices. Table 13 shows the counts and percents for the one-way contingency table.

Table 13

One-Way Classification Chi-Square Test

Positive No NegativeRelationship Relationship Relationship c2 df p

792 (41.0%) 593 (30.7%) 549 (28.4%) 52.006 2 .000

The "Two-Way Classification Chi-Square Test" revealed a significant relationship

between researchers' personal opinions and their question wording choices,

c2(2)=61.178, p<.001, thus we reject the null hypothesis. Gamma indicated a low to

moderate strength relationship (l=.287). Therefore, about eight percent of the variability

in question wording choices can be explained by knowing the personal opinions of

49researchers. Table 14 shows the counts and row percents for the two-way contingency

table.

Table 14

Two-Way Classification Chi-Square Test

N=1934 Personal Opinion_________________________________________

Question Wording Choice Too Little Right Amount Too Much______________________________________________________________________

Favors "too little spent" 522 (55.9%) 246 (26.3%) 166 (17.8%)

Favors "too much is spent" 383 (38.3%) 347 (34.7%) 270 (27.0%)

The "t-Test between Means" revealed a strong significant difference in personal opinions

between those who chose the question wording that favors "too little is spent" and those

who chose the wording that favors "too much is spent", t(1932)=7.501, p<.001.

Therefore, we reject the null hypothesis and conclude that there was a significant

relationship between researchers' personal opinions and their question wording choices.

Table 15 reveals that regardless of which question wording choice respondents chose,

their personal opinions tended to fall somewhere between "too little" and "the right

amount" is spent (i.e., between zero and one). A mean of zero would indicate that all

respondents felt that "too little" is spent, while a mean of one would indicate they thought

"the right amount" was being spent.

49Table 15t-Test between Means

______________________________________________________________________

Question wording choice N Mean SD t df p______________________________________________________________________

Favors "too little is spent" 934 .619 .769 7.501 1932 .000

Favors "too much is spent" 1000 .887 .801

The "Logistic Regression with Dummy IV's" models produced a highly significant chi-

square statistic, c2(3)= 67.732, p<.001. Thus, we conclude that a significant relationship

existed between researchers' personal opinions and their question wording choices. The

t-statistic was also significant, thereby indicating that knowing researchers' personal

opinions improves our ability to predict which question wording choice a researcher

would choose, t(2120)=2.884, p=.002. Table 16 reveals that the regression model

increases our prediction accuracy by 6.6 percent, from 52.5 percent to 59.1 percent.

Table 16

Logistic Regression with Dummy IV's (t-Test)

DV=Question wording choiceIV=Personal opinion of the researcher___________________________________________________________________

Total variable pairs = 2122 N Percent t df p___________________________________________________________________

Correct with knowledge of the personal opinion of the researcher 1254 59.1 2.884 2120 .002

Correct without knowledge of thepersonal opinion of the researcher 1115 52.5

49The "Logistic Regression with an Interval IV" model also revealed a significant

relationship, c2(1)= 55.279, p<.001. Table 17 shows that this model was slightly better

than logistic regression with dummy variables. Our ability to predict the dependent

variable increased 7.1 percent (from 51.7 to 58.9 percent) and the t-statistic was highly

significant, t(1932)=3.018, p=.001.

Table 17

Logistic Regression with an Interval IV (t-Test)

DV=Question wording choiceIV=Personal opinion of the researcher___________________________________________________________________


Correct with knowledge of the personal opinion of the researcher 1139 58.9 3.018 1932 .001

Correct without knowledge of thepersonal opinion of the researcher 1000 51.7

Table 18 is a summary of the conclusions that were drawn from each of the statistical

tests. In every test, the null hypothesis was rejected and therefore we conclude that there

was a significant relationship between researchers' personal opinions and their question

wording choices.

49Table 18

Summary of Conclusions for the First Null Hypothesis

Null hypothesis: Researchers' choices for question wording are not related to their personal opinions.

Operational Definition of Relationship Conclusion

Liberal t-Test between Proportions Reject Conservative t-Test between Proportions RejectOne-Way Classification Chi-Square Test RejectTwo-Way Classification Chi-Square Test Rejectt-Test between Means RejectLogistic Regression with Dummy IV's (Chi-Square) RejectLogistic Regression with Dummy IV's (t-Test) RejectLogistic Regression with an Interval IV (Chi-Square) RejectLogistic Regression with an Interval IV (t-Test) Reject

The final step in testing the first hypothesis was to determine whether there were

differences between the six sets of question wording pairs. Was the effect present for

some issues, but not for others? The same nine operational definitions of "relationship"

were used to test each of the six question wording pairs. Table 19 contains a summary of

the significance levels for all 54 statistical tests. For all issues except drugs, researchers'

personal opinions seemed to be significantly related to their question wording choices.

Note that a probability of one for a logistic regression model means that there was no

improvement in our ability to predict the dependent variable as a result of the regression

equation.

49Table 19

Probability Levels for Each of the Nine Tests Broken Down by Issue

Crime Drugs Welfare Cities Blacks Social Security

N=330 N=321 N=325 N=326 N=308 N=324

Liberal t-Test between Proportions

.101 .216 .109 .098 .050 .012

Conservative t-Test between Proportions

.444 .212 .000 .013 .182 .055


.003 .004 .000 .000 .000 .000


.001 .160 .000 .001 .002 .000

t-Test between Means .001 .227 .000 .000 .000 .000

Logistic Regression with Dummy IV's (Chi-Square)

.002 .192 .000 .001 .000 .000


1.000 .256 .324 .174 .055 .363

Logistic Regression with an Interval IV (Chi-Square)

.001 .452 .000 .000 .000 .000


1.000 1.000 .319 .285 .151 .355

Null Hypothesis 2: Researchers' choices for question wording are not related to the

persuasion of the study sponsor.

Using the "Liberal t-Test between Proportions", we observed a relationship in 70.1

percent of the question pairs. Table 20 shows that the t-statistic was not significant,

t(2120)=1.353, p=.088, thus we fail to reject the null hypothesis and conclude that there

was no significant relationship between researchers' question wording choices and the


49Table 20

Liberal t-Test between Proportions___________________________________________________________________Total number of comparisons = 2122 N Percent t df p___________________________________________________________________



The "Conservative t-Test between Proportions" showed a relationship in 34.8 percent of

the question pairs. Table 21 reveals that the t-test between proportions was not

significant, t(2120)=.838, p=.201. Therefore, we fail to reject the null hypothesis and

conclude that there was no significant relationship between researchers' wording choices

and the persuasion of the study sponsor.

Table 21

Conservative t-Test between Proportions___________________________________________________________________Total number of comparisons = 2122 N Percent t df p___________________________________________________________________



The "One-Way Classification Chi-Square Test" using the unbiased or neutral measure of

the relationship revealed a strong significant difference between the observed distribution

and one that would have been expected by chance if there were no true relationship,

c2(2)=11.475, p=.003. Therefore, we reject the null hypothesis and conclude that a

relationship did exist between researchers' question wording choices and the persuasion

49of the study sponsor. Table 22 shows the counts and percents for the one-way

contingency table.

Table 22


Positive No NegativeRelationship Relationship Relationship c2 df p

739 (34.8%) 749 (35.3%) 634 (29.9%) 11.475 2 .003


between researchers' personal opinions and their question wording choices,

c2(2)=8.660, p<.013, thus we reject the null hypothesis. Gamma indicated that the

strength of the relationship was extremely small (l=.099). Table 23 shows the counts

and row percents from the contingency table.

Table 23


N=2122 Persuasion of the Study Sponsor_________________________________________

Question Wording Choice Liberal None Conservative______________________________________________________________________

Favors "too little spent" 351 (34.9%) 346 (34.4%) 310 (30.8%)


The "t-Test between Means" revealed a significant difference in the persuasion of the

study sponsor between those who chose the question wording that favors "too little is

spent" and those who chose the wording that favors "too much is spent", t(2120)=2.810,

49p=.003. A one-tailed test of significance was used because the direction of the difference

was predicted. Therefore, we reject the null hypothesis and conclude that there was a

significant relationship between the persuasion of the study sponsor and researchers'

question wording choices. Although the difference was significant, it is relatively small.

In Table 24, a mean of less than one indicates a more liberal sponsor, while a mean

greater than one indicates a more conservative sponsor.

Table 24

t-Test between Means

______________________________________________________________________


Favors "too little is spent" 1007 .959 .810 2.810 2120 .003

Favors "too much is spent" 1115 1.057 .797

Both logistic regression models had significant chi-square statistics. The "Logistic

Regression with Dummy IV's" model, c2(2)= 8.658, p=.003, was only slightly higher

than the "Logistic Regression with an Interval IV" model, c2(1)= 7.881, p<.005. From

the chi-square statistics, we would conclude that a significant relationship existed

between the persuasion of the study sponsor and researchers' question wording choices.

Table 25 reveals that the regression models increased our prediction accuracy by only 1.3

percent...from 52.5 percent to 53.8 percent. The t-statistic was the same for both models

and it was not significant, t(2120)=0.581, p=.281. Therefore, we conclude that knowing

the persuasion of the study sponsor does not significantly improve our ability to predict

which question wording choice a researcher would choose.

Table 25

49

Logistic Regression with Dummy IV's (t-Test) and Logistic Regression with an Interval IV (t-Test)

DV=Question wording choiceIV=Persuasion of the study sponsor___________________________________________________________________


Correct with knowledge of the persuasion of the study sponsor 1142 53.8 0.581 2120 .281

Correct without knowledge of thepersuasion of the study sponsor 1115 52.5


tests. The overall results are inconclusive. Whether or not one rejects (or fails to reject)

the null hypothesis depends upon the operational definition adopted by the researcher at

the onset of the study. Therefore, we are unable to draw a summary conclusion

regarding the relationship between researchers' question wording choices and the


49Table 26

Summary of Conclusions for the Second Null Hypothesis

Null hypothesis: Researchers' choices for question wording are not related to the persuasion of the study sponsor.


Liberal t-Test between Proportions Fail to reject Conservative t-Test between Proportions Fail to rejectOne-Way Classification Chi-Square Test RejectTwo-Way Classification Chi-Square Test Rejectt-Test between Means RejectLogistic Regression with Dummy IV's (Chi-Square) RejectLogistic Regression with Dummy IV's (t-Test) Fail to rejectLogistic Regression with an Interval IV (Chi-Square) RejectLogistic Regression with an Interval IV (t-Test) Fail to reject

The final step in testing the second hypothesis was to determine whether there were

differences between the six sets of question wording pairs. Was the effect present for

some issues, but not for others? The same nine operational definitions of "relationship"

were used to test each of the six question wording pairs. Table 27 contains a summary of

the significance levels for all fifty-four statistical tests. The effect was clearly strongest

for the cities and welfare issues.

49Table 27

Probability Levels for Each of the Nine Tests Broken Down by Issue


N=359 N=352 N=352 N=357 N=344 N=358

Liberal t-Test between Proportions

.346 .478 .217 .183 .267 .289

Conservative t-Test between Proportions

.439 .335 .245 .204 .363 .335


.604 .651 .142 .075 .289 .384


.485 .932 .086 .079 .392 .139



.485 .933 .089 .078 .392 .142


1.000 .458 1.000 .419 .280 1.000


.488 .744 .039 .027 .201 .179


1.000 .458 1.000 .419 .280 1.000

Null Hypothesis 3: Researchers' choices for question wording are not related to their

self-assessed experience in questionnaire design.

The "Liberal t-Test between Proportions", "Conservative t-Test between Proportions"

and the "One-Way Classification Chi-Square Test" were not used for testing the third

hypothesis because they do not provided meaningful operational definitions of positive or

negative relationships. The other tests used two-tailed probabilities for significance

testing because theory does not predict the direction of the relationship, .


between researchers' self-assessed experience and their question wording choices,

c2(2)=8.148, p=.017, thus we reject the null hypothesis. Gamma indicated a low strength

49relationship (l=.111). Closer examination of the contingency table revealed that the

significance was primarily due to the fact that self-assessed expert researchers tended to

choose the question wording that favors "too much is spent." Table 28 shows the

contingency table and column percents.

Table 28


N=1925 Self-Assessed Experience_________________________________________

Question Wording Choice Beginner Average Expert______________________________________________________________________

Favors "too little spent" 191 (51.5%) 460 (47.3.3%) 246 (42.3%)


The "t-Test between Means" revealed a small significant difference in researchers'

experience levels between those who chose the question wording that favors "too little is

spent" and those who chose the wording that favors "too much is spent", t(1925)=2.852,

p<.004. Therefore, we reject the null hypothesis and conclude that there was a

significant relationship between researchers' self-assessed experience and their question

wording choices. Table 29 reveals that researchers who chose the question wording that

favors "too much is spent" tended to rate themselves as more experienced than those who

chose the question wording that favors "too little is spent."

49Table 29

t-Test between Means

______________________________________________________________________


Favors "too little is spent" 897 1.061 .769 2.852 1923 .004

Favors "too much is spent" 1028 1.152 .692

The "Logistic Regression with Dummy IV's" model produced a significant chi-square

statistic, c2(3)= 14.275, p=.003, thus indicating that there was a relationship between

researchers' self-assessed experience and their question wording choices. However, our

ability to predict the dependent variable increased by less than two percent (from 52.5

percent to 54.1 percent). As shown in table 30, the t-test revealed that the improvement

was not significant, t(2120)=0.715, p=.475. Therefore, we fail to reject the null

hypothesis and conclude that knowing a researchers' self-assessed experience does not

significantly improve our ability to predict their question wording choices.

Table 30


DV=Question wording choiceIV=Self-assessed experience level of the researcher___________________________________________________________________


Correct with knowledge of the self-assessed experience of the researcher 1149 54.1 0.715 2120 .475

Correct without knowledge of the self-assessed experience of the researcher 1115 52.5

49

The "Logistic Regression with an Interval IV" model also revealed a significant

relationship, c2(1)= 8.119, p=.004. However, as shown in table 31, this model was even

worse than logistic regression with dummy variables. Our ability to predict the

dependent variable improved by less than one percent (from 53.4 to 54.0 percent). Table

31 shows that this was not a significant improvement, t(1923)=.241, p=.809, and thus we

fail to reject the null hypothesis.

Table 31


DV=Question wording choiceIV=Self-assessed experience level of the researcher___________________________________________________________________


Correct with knowledge of the self-assessed experience of the researcher 1139 54.0 0.241 1923 .809

Correct without knowledge of the selfassessed experience of the researcher 1028 53.4


tests. Regardless of whether researchers' self-assessed experience is viewed as an ordinal

or interval scale (i.e., nonparametric or parametric), the results are mixed. The predictive

models failed miserably, yet the parametric t-test and the nonparametric chi-square test

showed a significant relationship.

49Table 32

Summary of Conclusions for the Third Null Hypothesis

Null hypothesis: Researchers' choices for question wording are not related to their self-assessed experience in questionnaire design.


Two-Way Classification Chi-Square Test Rejectt-Test between Means RejectLogistic Regression with Dummy IV's (Chi-Square) Fail to rejectLogistic Regression with Dummy IV's (t-Test) Fail to rejectLogistic Regression with an Interval IV (Chi-Square) Fail to rejectLogistic Regression with an Interval IV (t-Test) Fail to reject

The final step in testing the third hypothesis was to determine whether there were

differences between the six sets of question wording pairs. The same six operational

definitions of "relationship" were used to test each of the six question wording pairs.

Table 33 contains a summary of the significance levels for all thirty-six statistical tests.

The relationship between the self-assessed experience and question wording choice was

strongest (and often significant) for the crime and Social Security issues.

49Table 33

Probability Levels for Each of the Six Tests Broken Down by Issue


N=326 N=319 N=319 N=324 N=312 N=325


.059 .462 .911 .680 .316 .037



.040 .388 .957 .576 .115 .073


1.000 .567 1.000 1.000 .346 1.000


.020 .271 .695 .831 .173 .010


1.000 .869 1.000 1.000 .660 1.000

Multivariate Models to Test the First Three Null Hypotheses

As a final test of the first three hypotheses, two multivariate models were developed.

The dependent variable for both models was the question wording choice and the three

independent variables were the 1) personal opinion of the researcher, 2) persuasion of the

study sponsor, and 3) self-assessed experience level of the researcher. One logistic

regression model viewed the independent variables as ordinal data and the other interval

data. Dummy variables were created for the model where the independent variables was

viewed as ordinal. Thus, the models cover both parametric and nonparametric

interpretations of the independent variables. Both the chi-square statistic and the t-test

between proportions were examined.

The "Multivariate Logistic Regression with Dummy IV's" model produced a significant

chi-square statistic, c2(8)= 88.715, p<.001. Table 34 shows that the t-test between

proportions was also significant, t(2120)=2.563, p=.008. We conclude that knowing the

49independent variables improved our ability to predict researcher's question wording

choices by about six percent and that this improvement was significant.

Table 34

Multivariate Logistic Regression with Dummy IV's

DV=Question wording choiceIV's=Researchers personal opinions Persuasion of the study sponsor Self-assessed experience level of the researcher___________________________________________________________________


Correct with knowledge of allthree independent variables 1239 58.4 2.563 2120 .008

Correct without knowledge of theindependent variables. 1115 52.5

The "Multivariate Logistic Regression with Interval IV's" model also revealed a

significant relationship, c2(3)= 70.498, p<.001. Table 35 shows that this model

performed slightly better than logistic regression with dummy variables. Knowledge of

the independent variables significantly improved our ability to predict the dependent

variable by nearly seven percent, t(1741)=2.746, p=.003.

49Table 35

Multivariate Logistic Regression with Interval IV's

DV=Question wording choiceIV's=Researchers personal opinions Persuasion of the study sponsor Self-assessed experience level of the researcher___________________________________________________________________


Correct with knowledge of allthree independent variables 1038 59.6 2.746 1741 .003

Correct without knowledge of theindependent variables. 917 52.6

Table 36 shows the logistic regression coefficients and probabilities for each of the

independent variables. Since all three variables were scaled from zero to two, the

coefficients represent the relative importance of the variables in the prediction model.

Researchers' personal opinions was the most important, the persuasion of the study

sponsor was second, and the self-assessed experience was least important in the

prediction model. However, all three variables made significant contributions.

Table 36

Logistic Regression Coefficients for the Multivariate Model with Interval IV's

DV=Question wording choiceIndependent Variable Coefficient Std. Error T-Ratio Prob.Experience Level .224 .071 3.157 .002Persuasion of the sponsor .249 .061 4.087 .000Personal opinion .402 .063 6.405 .000

49Null Hypothesis 4: There are no differences between beginning and advanced

researchers with respect to the degree to which they incorporate their personal opinions

and those of the sponsor into their choice of question wording.

There are two parts to this null hypothesis. The first part addresses the degree to which

researchers incorporate their personal opinions and whether there are differences

depending on the self-assessed experience levels of the researchers. The second part

looks at the degree to which researchers' incorporate the persuasion of the study sponsor

and whether there are differences depending on the self-assessed experience levels of the

researchers. Seven tests were used to examine both parts of this null hypothesis. These

tests use the same three operational definitions of "relationship" (i.e., liberal,

conservative, and unbiased) that were used to test the first two hypotheses.

A two-way contingency table was prepared using the "Liberal t-Test between

Proportions" and the self-assessed experience level of the researchers. The chi-square

was significant, c2(2)=6.045, p=.049, thus we reject the null hypothesis and conclude that

there was a significant difference between beginning and advanced researchers with

respect to the degree to which they incorporated their personal opinions into their

question wording choices. Expert researchers were less likely to incorporate their

opinions than beginning or average researchers. Gamma indicated a low strength

relationship (l=-.097). Table 37 shows the counts and column percents for the

contingency table.

49Table 37

Two-Way Classification Chi-Square Test using the Liberal Definition of the Relationship


Beginner Average Expert______________________________________________________________________

No observed relationship 90 (26.8%) 240 (26.9%) 168 (32.7%)

Observed relationship 246 (73.2%) 653 (73.1%) 346 (67.3%)

A two-way contingency table was prepared using the "Conservative t-Test between


was not significant, c2(2)=4.285, p=.117, thus we fail to reject the null hypothesis and

conclude that there was no significant difference between beginning and advanced

researchers with respect to the degree to which they incorporated their personal opinions

into their question wording choices. Gamma was very close to zero, (l=.005). Table 38

shows the counts and column percents for the contingency table.

Table 38

Two-Way Classification Chi-Square Test using the Conservative Definition of the Relationship





49A two-way contingency table using the unbiased or neutral measure of the relationship

revealed a weak significant relationship between researchers' self-assessed experience

levels and the degree to which they incorporated their personal opinions into their

question wording choices. The chi-square was significant, c2(4)=10.933, p=.027, thus

we reject the null hypothesis and conclude that there was a significant relationship.

Close analysis of Table 39 reveals that the relationship was due mostly to the fact that

experienced researchers more often chose question wording that favores a response

opposite their own opinions, and beginning researchers were more likely to show no

relationship between their personal opinions and their question wording choices.

Table 39

Two-Way Classification Chi-Square Test using the Unbiased Definition of the Relationship

Self-Assessed ExperienceN=1743 _________________________________________


Negative relationship 90 (26.8%) 240 (26.9%) 168 (32.7%)

No relationship 124 (36.9%) 277 (31.0%) 150 (29.2%)

Positive relationship 336 (36.3%) 376 (42.1%) 196 (38.1%)

A t-test between means was performed to compare beginning and advanced researchers

on the strength of the relationship between their personal opinions and their question

wording choices. The unbiased or neutral measure of the relationship was compared

between beginning and advanced researchers. Table 40 shows that the t-statistic was not

significant, t(848)=.708, p=.240, thus, we fail to reject the null hypothesis and conclude

that there is no significant difference between beginning and advanced researchers with

49respect to the degree in which they incorporate their personal opinions into their question

wording choices. If we were to adopt an ordinal interpretation of the unbiased scale, the

Mann-Whitney U statistic supports the same conclusion, U(848)=88444, p=.550.

Table 40

t-Test between Means using the Unbiased Definition of the Relationship

______________________________________________________________________

Self-Assessed Experience N Mean SD t df p______________________________________________________________________

Beginner 336 .095 .790 0.708 848 .240

Advanced 514 .054 .707

The same hypothesis was tested using a one-way analysis of variance. The dependent

variable was the unbiased measure of the relationship between researchers' personal

opinions and their question wording choices. The factor variable was the self-assessed

experience level of the researchers and the three levels were beginning, average, and

advanced. Table 41 shows that the F-ratio was not significant, F(2,1740)=2.420, p=.089,

thus, we fail to reject the null hypothesis. If we were to adopt an ordinal interpretation of

the unbiased scale, the Kruskal-Wallis test statistic supports the same conclusion,

KW(2)=4.180, p=.124.

49Table 41

One-Way ANOVA using the Unbiased Definition of the Relationship

DV=Unbiased definition of the relationship______________________________________________________________________

Sum of MeanSource of Variation df Squares Squares F p______________________________________________________________________

Self-Assessed Experience 2 3.245 1.623 2.420 .089

Error 1740 1166.715 0.671

Total 1742 1169.960

A summary of the first five tests are shown in Table 42. The tests looked at the degree to

which researchers incorporated their personal opinions and whether there were

differences depending on their self-assessed experience levels. The results of these tests

are inconclusive. Researchers can draw different conclusions depending on which

statistical method they choose.

Table 42

Summary of Conclusions for the Fourth Null Hypothesis

Null hypothesis: There are no differences between beginning and advanced researchers with respect to the degree to which they incorporate their personal opinions into their question wording choices.


Two-Way Classif. Chi-Square Test - Liberal Definition Reject Two-Way Classif. Chi-Square Test - Conservative Definition Fail to rejectTwo-Way Classif. Chi-Square Test - Unbiased Definition Rejectt-Test between Means - Unbiased Definition Fail to rejectOne-Way ANOVA - Unbiased Definition Fail to reject

49The second set of seven tests looked at the degree to which researchers' incorporate the

persuasion of the study sponsor and whether there are differences depending on the self-

assessed experience levels of the researchers. These tests use the same three operational

definitions of "relationship" (i.e., liberal, conservative, and unbiased) as the previous

five.

A two-way contingency table was prepared using the "Liberal t-Test between


was not significant, c2(2)=3.504, p=.173, thus, we fail to reject the null hypothesis and

conclude that there was no significant difference between beginning and advanced

researchers with respect to the degree to which they incorporated the persuasion of the

study sponsor into their question wording choices. Table 43 shows the counts and

column percents for the contingency table.

Table 43

Two-Way Classification Chi-Square Test using the Liberal Definition of the Relationship





A two-way contingency table was prepared using the "Conservative t-Test between


was significant, c2(2)=7.944, p=.019, thus we reject the null hypothesis and conclude that

49there was a significant difference between beginning and advanced researchers with

respect to the degree to which they incorporated the persuasion of the study sponsor.

Gamma indicated a very weak negative relationship, (l=-.034). Table 44 shows the

counts and column percents for the contingency table.

Table 44

Two-Way Classification Chi-Square Test using the Conservative Definition of the Relationship





A two-way contingency table using the unbiased or neutral measure of the relationship

revealed a weak significant relationship between researchers' self-assessed experience

levels and the degree to which they incorporated the persuasion of the study sponsor into

their question wording choices. The chi-square was significant, c2(4)=12.317, p=.015,

thus we reject the null hypothesis and conclude that there was a significant relationship.

Gamma showed a weak negative relationship, (l=-.048). The counts and column

percents are displayed in contingency Table 45.

49Table 45

Two-Way Classification Chi-Square Test using the Unbiased Definition of the Relationship



Negative relationship 98 (26.4%) 277 (28.5%) 185 (31.8%)

No relationship 149 (40.2%) 315 (32.4%) 207 (35.6%)

Positive relationship 124 (33.4%) 380 (39.1%) 190 (32.6%)

A t-test between means was performed to compare beginning and advanced researchers

on the strength of the relationship between their question wording choices and the

persuasion of the study sponsor. The unbiased or neutral measure of the relationship was

compared between beginning and advanced researchers. Table 46 shows that the t-

statistic was not significant, t(951)=1.170, p=.121, thus, we fail to reject the null

hypothesis and conclude that there was no significant difference between beginning and

advanced researchers with respect to the degree in which they incorporated the

persuasion of the sponsor into their question wording choices. If we were to adopt an

ordinal interpretation of the unbiased scale, the Mann-Whitney U statistic supports the

same conclusion, U(951)=112439.5, p=.280.

49Table 46

t-Test between Means using the Unbiased Definition of the Relationship______________________________________________________________________

Self-Assessed Experience N Mean SD t df p______________________________________________________________________

Beginner 371 .070 .771 1.170 951 .121

Advanced 582 .009 .803

The same hypothesis was tested using a one-way analysis of variance. The dependent

variable was the unbiased measure of the relationship between researchers' personal

opinions and their question wording choices. The factor variable was the self-assessed

experience level of the researchers and the three levels were beginning, average, and

advanced. Table 47 shows that the F-ratio was not significant, F(2,1922)=2.673, p=.069,

thus we fail to reject the null hypothesis. If we were to adopt an ordinal interpretation of

the unbiased scale, the Kruskal-Wallis test statistic supports the same conclusion,

KW(2)=4.868, p=.088.

Table 47

One-Way ANOVA using the Unbiased Definition of the Relationship

DV=Unbiased definition of the relationship______________________________________________________________________

Sum of MeanSource of Variation df Squares Squares F p______________________________________________________________________

Self-Assessed Experience 2 3.452 1.726 2.673 .069

Error 1922 1241.220 0.646

Total 1924 1244.672

49A summary of the second five tests are shown in Table 48. The tests looked at the

degree to which researchers' incorporated the persuasion of the sponsor and whether

there were differences depending on the their self-assessed experience levels. The results

of these tests are inconclusive. Researchers can draw different conclusions depending on

which statistical method is chosen.

Table 48

Summary of Conclusions for the Fourth Null Hypothesis

Null hypothesis: There are no differences between beginning and advanced researchers with respect to the degree to which they incorporate the persuasion of the study sponsor into their question wording choices.


Two-Way Classif. Chi-Square Test - Liberal Definition Fail to reject Two-Way Classif. Chi-Square Test - Conservative Definition RejectTwo-Way Classif. Chi-Square Test - Unbiased Definition Rejectt-Test between Means - Unbiased Definition Fail to rejectOne-Way ANOVA - Unbiased Definition Fail to reject

Additional Findings

A final question that we might ask is whether or not respondents consistently chose one

form of question wording over the other. For example, some respondents might have

consistently selected the question wording that elicits more "too little is spent" responses,

while others selected the question wording that elicits more "too much is spent"

responses.

A scale was constructed to examine whether or not respondents consistently chose the

same form of question wording. The scale was a simple count of the number of times

49that a respondent selected the question wording that favored "too little is spent." Thus,

the scale could vary from zero to six, where a zero indicated that a respondent never

selected that wording and a six indicated that they always selected the wording.

If we view each question wording pair as an independent comparison, then it becomes

easy to calculate the expected distribution If only chance is operating. A good analogy is

to view each question wording pair like the flip of a coin. For any given comparison,

there is a fifty-fifty probability that the respondent will select a particular question

wording. The binomial expansion of (a+b)6 provides coefficients for determining the

proportion of respondents for each of the six possible values. A chi-square test was used

to determine whether the distribution of the scale was significantly different than the one

predicted by the binomial distribution.

Table 49 shows the contingency table for the comparison of the predicted binomial

distribution with the number of times that respondents selected the question wording that

favored "too little is spent." The chi-square was highly significant, c2(6)=85.216,

p<.001, thus indicating that the distribution was different than would be expected by

chance. Close examination of the table shows that respondents tended to favor one form

of question wording or the other. Instead of approximating the binomial distribution, the

scale was relatively flat.

Table 49

Two-Way Contingency Table Comparing Responses to the Binomial Distribution

Number of comparisons where respondent selectedwording that favors a response of "too little is spent"______________________________________________

49N=361 0 1 2 3 4 5 6

______________________________________________

Observed 46 59 62 61 49 59 25(12.7%) (16.3%) (17.2%) (16.9%) (13.6%) (16.3%) (6.9%)

Binomial prediction 6 34 85 113 85 34 6(1.6%) (9.4%) (23.4%) (31.3%) (23.4%) (9.4%) (1.6%)

Descriptive statistics were also computed for the scale. The mean average of the scale

was close to three (M=2.79, SD=1.83) and the skewness was near zero (SK=.08),

therefore, we conclude that the distribution was not heavily lopsided one way or the

other. However, the kurtosis indicated that the distribution was rather flat (k=1.88) and

the Kolmogorov-Smirnov statistic for normality confirmed that the shape of the

distribution was significantly different than the normal bell-shaped curve (KS=2.47,

p<.01).

Cronbach's alpha provides a measure of the internal consistency among a group of items.

In most testing instruments, we would be pleased to see that a group of items had high

reliability. For example, if a teacher said that a test was highly reliability, it would mean

that most of the items in the instrument discriminated well between students who knew

the material and those that didn't. In this study though, high "reliability" means that

respondents were likely to consistently favor one form of question wording. Thus, a

rather dramatic paradox becomes apparent. For this study, high reliability means high

bias and low reliability means low bias. A researcher who consistently selects a specific

form of question wording would produce a survey that contains the greatest bias. A

researcher who chooses questions at random would produce a survey with the least bias.

49Cronbach's alpha was .69, a moderately high value. This provides further support that

researchers favored one form of question wording or the other.

49CHAPTER V

Conclusions and Recommendations

Summary

A review of the literature revealed that there can be large differences in the way that

people respond to public opinion surveys depending on the phraseology of the questions.

Seemingly minor changes in question wording can have enormous impact on people's

responses. In some cases, researchers would draw different conclusions on an issue

depending on their choice of question wording.

This study examined the degree to which researchers incorporated their own opinions

and those of the study sponsor into their question wording choices. It was hypothesized

that researchers unknowingly select phraseology that produces public response

supportive of their own opinions, or those of the study sponsor. The literature provided

many examples of studies that examined public response to question wording

alternatives, but none were found that looked at the role of the researcher in the creation

of those questions.

This purpose of this study was determine whether or not survey researchers unknowingly

influence the results of a survey through their question wording choices. This study

tested four null hypotheses using a variety of statistical techniques. The null hypotheses

were:


opinions.


study sponsor.



4. There is no difference between beginning and advanced researchers with respect

to the degree to which they incorporate their personal opinions and those of the

sponsor into their choice of question wording.

A survey was designed using six pairs of social issue labels that had been studied by

Rasinski over a three-year period (1984-86). These question wording pairs were selected

because they were known to evoke different responses from the public.

The survey was mailed to 953 people who had some involvement in the survey research

process. A short vignette told respondents that they had been hired to design a public

opinion survey. One third of the respondents was told that the sponsor for their study

was a conservative anti-spending group, another third was told that their sponsor was a

liberal pro-spending group, and for the last third, no sponsor was identified. The vignette

explained that the respondent might also be hired to conduct additional surveys in the

future; however, the director of the organization specifically told them, "We really need

to know the truth, so be objective."

The survey presented both forms of question wording and asked respondents which one

they would use in their own survey. It also asked how they would personally answer the

49question wording that they had selected. This was repeated for each of the six social

issues.

A variety of different statistical techniques were used to test each of the hypotheses.

Each of the tests provided a different operational definition for the "relationship"

between the variables. The primary techniques were contingency table analysis using the

chi-square statistic and gamma, the student's t-test between means, the one-sample t-test

between proportions, and logistic regression analysis.

Conclusions

A total of 361 usable surveys was returned. The response rate was about 40 percent, a

respectable return for mail surveys without follow-up mailings. There were no

significant differences in response rates depending on which of the three forms the

respondent received.

This study used an untraditional approach to elicit comments. The cover letter simply

asked respondents to write comments anywhere on the questionnaire. About 28 percent

of the respondents wrote an average of 2.4 comments on the survey itself. The

comments clearly show that respondents believed that objectivity is important in survey

research. Many of the respondents suggested a third wording alternative, presumable one

that was more objective. Others directly addressed the ways in which the question

wording alternatives were biased.

Nine different statistical tests were used to determine whether researchers' question

wording choices were related to their personal opinions. A one-tailed test of significance

49was used because the direction of the relationship was predicted. The results of all nine

tests were in agreement. We therefore conclude that there was a significant positive

relationship between researchers' question wording choices and their personal opinions.

In this study, researchers decidedly selected the question wording that would sway public

response to favor their own opinion and the effect was quite substantial. Knowledge of

the personal opinion of a researcher improved our ability to predict their choice of

question wording by about seven percent when compared to only knowing which

wording was the most popular choice among respondents. The effect was found to exist

for each of the question wording pairs, although it was weaker for the drugs issue.

The same nine tests were performed for the second hypothesis to determine if

researchers' choices for question wording were related to the persuasion of the study

sponsor. The results of these tests were inconclusive. Five tests told us to reject the null

hypothesis and four told us to fail to reject the null hypothesis. Researchers' choices for

question wording may or may not be related to the persuasion of the sponsor. Our

conclusions depend upon the particular mathematical construct selected to evaluate the

relationship. The effect seemed to be most prominent for the cities, welfare, and Social

Security issues. Even though the effect was often significant, it was very small.

Knowledge of the persuasion of the sponsor increased our ability to predict a

respondent's question wording by 1.3 percent when compared to only knowing which

wording was the most popular choice among respondents.

Six statistical tests were used to evaluate the third hypothesis to determine if researchers'

choices for question wording were related to their self-assessed experience in

questionnaire design. The results of these tests were also inconclusive. Two tests told us

to reject the null hypothesis and four told us to fail to reject the null hypothesis. The

49effect seemed strongest for the crime and Social Security issues. Even though the effect

was significant, it was very small. Knowledge of the self-assessed experience level of a

respondent increased our ability to predict their question wording choice by 0.6 percent

when compared to only knowing which wording was the most popular choice among

respondents.

A multivariate logistic regression model was created to examine the relationship between

respondents question wording choices, and the combined effects of their personal

opinions, self-assessed experience, and the persuasion of the study sponsor. The model

significantly improved our ability to predict respondents' question wording choices by

about seven percent compared to only knowing which wording was the most popular

choice among respondents, c2(3)=70.5, p<.001. This was about the same as the logistic

regression model that only used respondents' personal opinions as the independent

variable. While the persuasion of the study sponsor and self-assessed experience were

significant, they did not improve our prediction ability in the multivariate model.

The fourth hypothesis was tested in two parts. The first part was to determine if there

was a difference between beginning and advanced researchers with respect to the degree

to which they incorporated their personal opinions into their choice of question wording.

Five different statistical tests were used to test this hypothesis. The results were

inconclusive. Two tests told us to reject the null hypothesis and three told us to fail to

reject the null hypothesis. The second part was to determine if there was a difference

between beginning and advanced researchers with respect to the degree to which they

incorporated the persuasion of the study sponsor into their choice of question wording.

The results were also inconclusive, Two tests told us to reject the null hypothesis and

three told us that we fail to reject the null hypothesis.

49

A final finding of this study is that respondents tended to favor one form of question

wording over the other. Some respondents consistently selected the question wording the

elicited more "too little is spent" responses, while a slightly greater number selected the

wording the elicited more "too much is spent" responses. The tendency of respondents to

consistently choose one form of question wording was significantly different than we

would have expected from the binomial expansion, c2(6)=85.2, p<.001.

Discussion and Recommendations

Many studies have demonstrated that question wording differences can evoke radically

different response distributions for public opinion surveys. The effect can be large and

researchers might draw different conclusions about the same issue, depending on their

choice of question wording. This study found that researchers tended to choose question

wording that favored their personal opinions. The effect was less pronounced for

advanced researchers; however, regardless of which statistical tests were used, our

conclusions remained the same.

It is important to point out that this form of bias would not cause a correspondingly linear

effect in a public opinion poll. Once a researcher chooses a particular form of question

wording, the change in public opinion corresponds to the cognitive aspects of that

question. Thus, small perceived changes in question wording to researchers can cause

large changes in public opinion response distributions.

An important finding of this study was that the relationship between researchers' question

wording choices and personal opinions exists in a delicate balance. Some researchers

49(more than expected by chance) clearly favored the question wordings that would elicit

more "too much is spent" responses, while a slightly smaller number (also more than

expected by chance) favored the question wordings that would elicit more "too little is

spent" responses. Researchers generally leaned one way or the other, but they did so in

balance with other researchers. If we were to look at a survey created by any individual

researcher, we would most likely find that it consistently favored one particular form of

question wording. However, the "sum" of all such surveys created by "all" researchers,

would nearly cancel the bias.

It is the conclusion of this study that a survey created by a single researcher is most likely

to be biased in a way that corresponds with that researcher's own opinion. In the best of

all worlds, a way to overcome that bias might be to have a team of independent

researchers create the questions for a survey and then use them as multiple measures of

the phenomena being studied. Of course, in most situations, it is not practical or

financially feasible to involve a number of researchers in the survey design process, but

there may be other alternatives that are also effective.

One method suggested by a couple of respondents was to hold a focus group of potential

respondents. The idea, of course, was to not to assess the overall opinion of the public,

but instead, to use the knowledge from the focus group to help shape the contents of the

public opinion survey. The idea is excellent, although not without problems. The main

problem with focus groups is that the opinions of the extroverted members of the group

tend to muffle the voices of those who are more introverted; shy members' opinions often

remain unspoken. Experienced moderators can minimize this effect, but it is still

present. Second, focus groups are usually quite small, consisting of eight to twelve

members. Minority opinions, that are shared by under ten or fifteen percent of the

49population, are likely to be missed altogether. Third, focus groups are expensive. The

cost of a facility, moderator, and remuneration to respondents is between two and three

thousand dollars per focus group, a substantial burden for many survey budgets.

A purely speculative approach might be to adapt the Delphi forecasting technique to the

design of surveys. A number of researchers could be surveyed through the mail to find

out what questions they would ask on a public opinion survey. The responses would be

collated and a summary would be prepared and returned to the researchers. They would

be asked again to design the questions, this time, with knowledge of the question

wording selected by the other researchers. The second round of the Delphi technique

usually results in a consensus or centering of opinions. There are two advantages that the

Delphi technique might have over the focus group. The first is that each person's opinion

is heard by all the other members; extroverted individuals have no advantage over

introverts. Thus, the Delphi technique is more likely to produce a wider range of

opinions. Second, the Delphi method would be less expensive than an equivalent focus

group. Some readers may argue that a focus group is more effective at capturing the

"flavor of the response"; however, this study found that researchers' comments were

abundant and uninhibited.

Another recommendation of this study is that researchers need to use multiple measures

of a phenomena. When public opinion researchers ask a single question about an issue,

their conclusions contain the bias inherent in the question wording they use. The only

way to study the degree of bias is to compare it to public response to other questions on

the same issue. Even then, the bias can only be expressed relative to other questions.

There is no absolute measure of this form of bias.

49When we are presented with the results from a survey, how do we know the nature of the

question wording bias, in the absence of other information? The answer is that we don't.

If the survey was prepared by a single researcher, then it probably reflects the personal

opinion of the researcher to some degree, but there is no way that we can ascertain the

nature of that bias without conducting additional research. We can only predict that it is

there.

There is also the issue of how to interpret multiple measures of a issue. One respondent

to this survey suggested that multiple measures be summed. This method would be

effective only when the question bias is balanced (i.e., half of the questions are biased

one way and half the other way). If a single researcher had designed a questionnaire, we

would not expect the bias to be balanced. On the contrary, the results of this study

indicate that we would expect a fairly strong bias in one direction or the other, and

without additional research, we could not even determine the direction of the bias, let

alone its strength. The idea of summing multiple measures of a issue is appealing, but

probably inappropriate in most cases. Each form of question wording taps a different

cognitive aspect of an issue. Rather than trying to combine multiple measures into a

single unified answer, it might be better to embrace the idea of complexity and focus on

the interrelationships among the various measures. This study does not provide guidance

in this area and progress is most like to come from the field of the cognitive psychology.

We were not able to draw conclusions for any of the other hypotheses. Or more

correctly, we would have drawn different conclusions depending on the specific

statistical technique we had chosen to test the hypotheses. Furthermore, it didn't make

any difference whether we had selected a parametric or nonparametric test, or whether

we called a scale ordinal or interval. The results were mixed without any discernible

49pattern. This, in itself is a disturbing finding. At the onset of any research project, the

investigator selects a statistical methodology appropriate to the data to be collected and

this becomes the operational definition of the phenomena being studied. Researchers

using different definitions of a phenomena might come to different conclusions.

When the phenomena was relatively strong, such as the relationship between question

wording choice and personal opinion, all statistical methods were in agreement.

However, when the phenomena was weak, some tests called it significant and others

didn't. The explanation, of course, lies in the fact the each of the tests was measuring

something slightly different than the others. The disturbing part is that regardless of

which test researchers selected, their conclusions would be technically correct. Yet, they

were all attempting to measure the same phenomena.

To the layperson, "significance" and "importance" are synonymous, but to the scientist,

they have entirely different meanings. Significance refers to the confidence that we

have in our findings. If we say that a significant relationship exists, it means that we are

very sure that there really is a relationship. It does not tell us directly about the strength

or magnitude of the relationship. When the sample size becomes sufficiently large (as in

this study), it becomes possible to detect very weak relationships. Furthermore, these

weak relationships might be "statistically significant" because we are very sure that they

exist. The larger the sample size, the smaller the relationship that we can detect as being

significant.

To say that a significant relationship exists only tells half the story. We might be very

sure that a relationship exists, but is it a strong, moderate, or weak relationship? After

finding a significant relationship, it is important to ascertain its strength. This study

49attempted to incorporate analytical techniques that would reveal both the significance and

strength of the relationships. For example, the probability of the chi-square statistic in

logistic regression is traditionally used to report the overall significance of the

relationship between the dependent variable and the combined effects of the independent

variables. The chi-square test reveals little about the actual strength of the relationship,

only how sure we are that it exists. Therefore, this study also used a t-test between

proportions based on the regression predictions. If the relationship was strong,

knowledge of the independent variables would significantly improve our ability to

predict the dependent variable, and if it were weak, we might see no actual improvement.

Thus, the t-test provided a significance test that focused more on the magnitude of the

effect, rather than the significance of the relationship.

Whether or not a researcher finds significance is a function of strength of the phenomena

being measured, the type of statistical test, and the number of observations in the data set.

This study did not draw conclusions regarding the relationship between researchers'

question wording choices and the persuasion of the study sponsor, or the self-assessed

experience levels of the researchers. However, even if there were significant

relationships, they do not appear to be important considerations to the survey research

design process. If the relationships exist, they are minor, and therefore it is the

recommendation of this study that they are not worthy of additional study.

On the other hand, researchers' personal opinions showed a relatively strong relationship

to their question wording choices. It is our recommendation that further research be

conducted to confirm the results of this study and to more fully investigate the nature of

this phenomenon.

49This study randomly assigned individuals to one of the three sponsorship conditions. In

the "real world"; however, researchers often have freedom to choose the sponsors they

become involved with. Researchers might gravitate to sponsors that support their own

opinions, or sponsors might have a tendency to hire researchers who support the

institution's goals. Thus, there may be a relationship between researchers' opinions and

sponsorship goals. The covariance between a researcher's opinion and the persuasion of

the sponsor was forced to zero in this study through the random assignment procedure;

however, this may not be the case in the "real world." Future research is needed to

explore this issue.

Finally, the most surprising finding of this study was the tendency of researchers to favor

one form of question wording over the other. This appeared to be the strongest

phenomenon uncovered by this study. Researchers clearly had a "favorite" form of

question wording and they regularly chose that form. Only 17 percent of the respondents

showed no sign of bias one way or the other and over fifty percent showed a moderate to

strong bias. This bias existed in most researchers to varying degrees and it was generally

much stronger than expected. We therefore recommend that this phenomenon be further

studied to explore the scope and magnitude of the effect.

This purpose of this study was determine whether or not survey researchers unknowingly

influence the results of a survey through their question wording choices. Our conclusion

is that they do, the effect is substantial, and that further research is needed in this area.

49

References

Alexander, C. S., & Becker, H. J. (1978). The use of vignettes in survey research. Public Opinion Quarterly 42 (1), 93-104.

Ayidiya, S., & McClendon, M. (1990). Response effects in mail surveys. Public Opinion Quarterly 54 (2), 229-247.

Barath, A., & Cannell, C. F. (1976). Effect of interviewer's voice intonation. Public Opinion Quarterly 40 (3), 370-373.

Bishop, G. F. (1987). Experiments with the middle response alternative in survey questions. Public Opinion Quarterly 51 (2), 220-231.

Bishop, G. F., Hippler, H. J., Schwarz, N., & Strack, F. (1988). A comparison of response effects in self administered and telephone surveys. In R. Groves (Ed.), Telephone survey methodology (pp. 321-340). New York: Wiley.

Bishop, G. F., Oldendick, R. W., & Tuchfarber, A. J. (1978). Effects of question wording and format on political attitude consistency. Public Opinion Quarterly 42 (1), 81-92.

Bishop, G. F., Oldendick, R. W., & Tuchfarber, A. J. (1982). Effects of presenting one versus two sides of an issue in survey questions. Public Opinion Quarterly 46 (1), 69-85.

Blair, E. (1977). More on the effects of interviewer's voice intonation. Public Opinion Quarterly 41 (4), 544-548.

Bradburn, N. M., & Mason, W. M. (1964). The effect of question order on response. Journal of Marketing Research 1 (4), 57-61.

Bradburn, N. M., & Miles, C. (1979). Vague quantifiers. Public Opinion Quarterly 43 (1), 92-101.

Burnkrant, R. E., & Howard, D. J. (1984). Effects of the use of introductory rhetorical question versus statements on information processing. Journal of Personality and Social Psychology 47 (6), 1218-1230.

Carp, F. M. (1974). Position effects on interview responses. Journal of Gerontology 29 (5), 581-587.

49Chase, C. I. (1969). Often is where you find it. American Psychologist 24 (11), 1043.

Clancey, K. J., & Wachsler, R. A. (1971). Positional effects in shared cost surveys. Public Opinion Quarterly 35 (2), 258-265.

Cliff, N. (1959). Adverbs as multipliers. Psychological Review 66 (1), 27-44.

Collins, W. A. (1970). Interviewers' verbal idiosyncrasies as a source of bias. Public Opinion Quarterly 34 (3), 416-422.

Dohrenwend, B. S., Colombotos, J., & Dohrenwend, B. P. (1968). Social distance and interviewer effects. Public Opinion Quarterly 32 (3), 410-422.

Erdos, P. L. (1957). How to get higher returns from your mail surveys. Printer's Ink 258 (8), 30-31.

Hakel, M. D. (1968). How often is often? American Psychologist 23 (7), 533-534.

Hanson, R. H., & Marks, E. S. (1958). Influence of the interviewer on the accuracy of survey results. Journal of the American Statistical Association 53 (282), 635-655.

Hedges, B. M. (1979). Question wording effects: Presenting one or both sides of a case. The Statistician 28, 83-99.

Hippler, H. J., & Schwarz, N. (1987). Response effects in surveys. In H. J. Hippler, N. Schwarz, & S. Sudman (Eds.), Social information processing and survey methodology (pp. 321-340). New York: Springer-Verlag.

Jackman, M. R. (1973). Education and prejudice or education and response-set? American Sociological Review 38 (3), 327-339.

Kalton, G., Collins, M., & Brook, L. (1978). Experiments in wording opinion questions. Applied Statistics 27 (2), 149-161.

Kraut, A. I., Wolfson, A. D., & Rothenberg, A. (1975) Some effects of position on opinion survey items. Journal of Applied Psychology 60 (6), 774-776.

Krosnick, J. A. (1989). Question wording and reports of survey results: The case of Louis Harris and Associates and Aetna Life and Casualty. Public Opinion Quarterly 53 (1), 107-113.

Levine, S., & Gordon, G. (1958). Maximizing returns on mail questionnaires. Public Opinion Quarterly 22 (4), 568-575.

49McFarland, S. G. (1981). Effects of question order on survey responses. Public

Opinion Quarterly 45 (2), 208-215.

Mosier, C. I. (1941). A psychometric study of meaning. Journal of Social Psychology 13, 123-140.

Mullner, R. M., Levy, P. S., Byre, C. S., & Matthews, D. (1982, Sept.-Oct.). Effects of characteristics of the survey instrument on response rates to a mail survey of community hospitals. Public Health Reports 97 (5), 465-469.

Noelle-Neumann, E. (1970). Wanted: Rules for wording structured questionnaires. Public Opinion Quarterly 34 (2), 191-201.

Parducci, A. (1968). Often is often. American Psychologist 23 (11), 828.

Payne S. L. (1951). The art of asking questions. Princeton, NJ: Princeton University Press.

Pepper, S., & Prytulak, L. S. (1974). Sometimes frequently means seldom: Context effects in the interpretation of quantitative expressions. Journal of Research in Personality 8, 95-101.

Petty, R. E., Cacioppo, J. T., & Heesacker, M. (1981). Effects of rhetorical questions on persuasion: A cognitive response analysis. Journal of Personality and Social Psychology 40 (3), 432-440.

Petty, R. E., Rennier, G. A., & Cacioppo, J. T. (1987). Assertion versus interrogation format in opinion surveys: Questions enhance thoughtful responding. Public Opinion Quarterly 51 (4), 481-494.

Phillips, D. L., & Clancy, K. J. (1972). Modeling effects in survey research. Public Opinion Quarterly 36 (2), 246-253.

Poe, G. S., Seeman, I., McLaughlin, J., Mehl, E., & Dietz, M. (1988). Don't know boxes in factual questions in a mail questionnaire: Effects on level and quality of response. Public Opinion Quarterly 52 (2), 212-222.

Rasinski, K. A. (1989). The effect of question wording on public support for government spending. Public Opinion Quarterly 53 (2), 388-394.

Robinson, R. A. (1952). How to boost returns from mail surveys. Printer's Ink 239 (10), 35-37.

Rugg, D., & Canreil, H. (1944). The wording of questions. In H. Cantril (Ed.), Gauging public opinion (pp. 23-50). Princeton, NJ: Princeton University Press.

49

Schaeffer, N. C. (1991). Hardly ever or constantly? Group comparisons using vague quantifiers. Public Opinion Quarterly 55 (3), 392-423.

Schuman, H., & Presser, S. (1977). Question wording as an independent variable in survey analysis. Sociological methods and research 6 (2), 151-170.

Schuman, H., & Presser, S. (1981). Questions and Answers in Attitude Surveys. New

York: Academic Press.

Schyberger, A. B. (1967). Study of interviewer behavior. Journal of Marketing Research 4 (1), 32-35.

Simpson, R. (1944). The specific meaning of certain terms indicating differing degrees of frequency. The Quarterly Journal of Speech 30, 328-330.

Skelly, F. R. (1954). Interviewer-appearance stereotypes as a possible source of bias. Journal of Marketing 19 (1), 74-75.

Sletto, R. F. (1940). Pretesting of questionnaires. American Sociological Review 5 (2), 193-200.

Smith, T. W. (1982). Conditional order effects. GSS Technical Report No. 33. Chicago: National Opinion Research Center.

Smith, T. W. (1987). That which we call welfare by any other name would smell sweeter: An analysis of the impact of question wording on response patterns. Public Opinion Quarterly 51 (1), 75-83.

Spector, P. (1981). Research design. Beverly Hills: Sage.

Sudman, S., & Bradburn, N. (1974). Response effects in surveys. Chicago: Aldine.

Swasy, J. L., & Munch, J. M. (1985). Examining the target of receiver elaborations: Rhetorical question effects on source processing and persuasion. Journal of Consumer Research 11, 877-886.

Tourangeau, R., & Rasinski, K. A. (1988). Cognitive processes underlying context effects in attitude measurement. Psychological Bulletin 103 (3), 299-314.

Tourangeau, R., Rasinski, K. A., Bradburn, N., & D'Andrade, R. (1989). Carryover effects in attitude surveys. Public Opinion Quarterly 53 (4), 495-524.

Turner, C. F., & Krauss, E. (1978). Fallible indicators of the subjective state of the nation. American Psychologist 33 (5), 456-470.

49

Walonick, D. S. (1993). StatPac Gold IV: Survey and marketing research edition. Minneapolis: StatPac, Inc.

Weiss, C. (1968). Validity of welfare mothers' interview responses. Public Opinion Quarterly 32 (2), 287-294.

Williams, J. A., Jr. (1968). Interviewer role performance: A further note on bias in the information interview. Public Opinion Quarterly 32 (2), 287-294.

Wilson, T. D., Dunn, D., Bybee, J., Hyman, D., & Rotondo, J. (1984). Effects of analyzing reasons on attitude-behavior consistency. Journal of Personality and Social Psychology 47 (1), 5-16

Zillman, D. (1972). Rhetorical elicitation of agreement in persuasion. Journal of Personality and Social Psychology 21 (2), 159-165.

49

APPENDIX A

Cover Letter & Questionnaire That Was Sent To Researchers

49

January 25, 1994

Dear Researcher:

Have you ever wondered about the best wording for an item on a questionnaire?

If you're like me, you probably spend a lot of time trying to figure out just the right wording for your survey questions. Sometimes the wording is obvious. Other times, its not clear which is the best question wording--or even if there is a "best" question wording.

We are conducting a very simple experiment to study how researchers formulate questions for opinion and attitude surveys. I think you'll find it interesting and different from your "run-of-the-mill" survey.

The enclosed questionnaire will take less than five minutes to complete. Your responses are very important, and they will tell us much about how researchers choose question wording options.

There are no right or wrong answers. Please complete the questionnaire as soon as possible, and mail it back in the enclosed pre-stamped envelope. Feel free to write comments anywhere on the questionnaire.

Thank you for your participation in this study.

Sincerely,

David S. WalonickPresident

Question Wording Survey

How would you rate your own questionnaire design skills? (Circle one)

a) beginner b) average c) expert

You have been hired by [an, a conservative, a liberal] organization to find out how the public feels about government spending on six social issues. During the planning meeting, you [discover that the organization favors increased, reduced spending levels. You also] learn that the organization is considering hiring you to conduct several other surveys in the future. As you leave the meeting, the Director says to you, "We really need to know the truth, so be objective."

All six of your survey Circle the question wording that Circle the response you would personallyquestions begin with: you would use in the survey. give to the question you selected.

Are we spending too much, too little,

or the right amount on...1. a) "...halting the rising crime rate?" a) too much

b) "...law enforcement?" b) too little c) the right amount

2.. a) "...drug rehabilitation?" a) too muchb) "...dealing with drug addiction?" b) too little

c) the right amount

3. a) "...assistance to the poor?" a) too muchb) "...welfare?" b) too little

c) the right amount

4. a) "...assistance to big cities?" a) too muchb) "...solving problems of big cities?" b) too little

c) the right amount

5. a) "...improving conditions of Blacks?" a) too muchb) "...assistance to Blacks?" b) too little

c) the right amount

6. a) "...Social Security?" a) too much

116b) "...protecting Social Security?" b) too little

c) the right amount

Thank you. Please send your completed survey to: StatPac Inc. 4425 Thomas Ave. S.

Minneapolis, MN 55410

dissertation - statpacstatpac.org/research-library/researcher-bias.doc · web viewtable 3:...

Documents