the effect of familiarity with the response category labels on item response to likert scales

The Effect of Familiarity with the Response Category Labels

on Item Response to Likert Scales

Bert WeijtersMaggie Geuens

Hans Baumgartner

The response category labeling effect

Motivating Example a French researcher wants to replicate an empirical

finding that was established in the U.S. using data based on consumer self-reports in France;

in the English questionnaire, a Likert scale with endpoints of ‘strongly disagree’ and ‘strongly agree’ was used;

should the French researcher use ‘fortement d’accord’ or ‘tout à fait d’accord’?


Research questions Do the labels attached to the response scale

categories influence response behavior (i.e., how many respondents endorse the extreme scale categories)?

What causes this effect? How can the effect be mitigated? What are the implications for multilingual and

monolingual surveys?


Research questions (cont’d) various characteristics of rating scales have been

studied, but the problem of choosing appropriate labels for the response categories has been largely ignored;

this is surprising because category labels typically apply to many if not all of the items in a questionnaire;

if differences in responding to survey items as a function of the category labels have been acknowledged, the effect has generally been attributed to the perceived intensity of the labels (intensity hypothesis);

in this research we propose the familiarity hypothesis (i.e., scale categories marked by labels that are used more often in day-to-day language are more likely to be endorsed) and contrast it with the intensity hypothesis;


Literature review certain aspects of the labels attached to the response

categories on rating scales can have systematic effects on people’s responses to questionnaires:

□ the range of response alternatives provided can influence respondents’ answers to questions (Schwarz et al. 1988);

□ the response alternatives provided (e.g., feeling really irritated “several times a day” to “less than twice a week” vs. “more than once every 3 months” to “less than once a year”) may affect the interpretation of the question (Schwarz et al. 1988);

□ use of different numeric values (-5 to +5 vs. 0 to 10) can change the meaning of endpoint labels such as “not at all successful” (Schwarz et al. 1991);

can endpoint labels in Likert scales that differ in terms of the amplifier used (e.g., “strongly” vs. “completely” (dis)agree) change responses?


Inferences based on therange of response alternatives

provided Schwarz et al. (1985) question about hours spent watching TV

Response options (low range)

% Response options (high range)

%

< ½ hr 7.4 <2 ½ hrs 62.5

½ to 1 hr 17.7 2 ½ to 3 hrs 23.4

1 to 1 ½ hrs 26.5 3 to 3 ½ hrs 7.8

1 ½ to 2 hrs 14.7 3 ½ to 4 hrs 4.7

2 to 2 ½ hrs 17.7 4 to 4 ½ hrs 1.6

> 2 ½ hrs 16.2 > 4 ½ hrs 0.0


The intensity hypothesis intensity is defined as the degree or extent of the

attribute expressed by the label (e.g., degree of agreement or disagreement, extent of liking);

prior research shows that scale anchors in general (e.g., adjectives for evaluating products, such as “good”, “terrific”, or “superior”, as in Wildt and Mazis 1978) and amplifiers used in Likert scales (e.g., “slightly”, “somewhat” or “very much” agree, as in Spector 1976) differ in perceived intensity;

more intense labels represent more extreme positions, which should be endorsed less often (e.g., agree vs. strongly agree; superior vs. very good);


The intensity hypothesis (cont’d) Wyatt and Meyers (1987) found that when the extremes

of the response scale were anchored by narrower or less absolute labels (i.e., “agree” and “disagree”), responses were distributed more evenly across all five scale steps, whereas when the response scale was bordered by wider or more absolute labels (i.e., “strongly agree” and “strongly disagree”), responses were concentrated more on the intermediate scale steps;

even more subtle adverbial modifiers (e.g., strongly vs. completely agree) may influence response behavior;


The intensity hypothesis (cont’d) prior evidence that different intensities are

associated with different adverbs (e.g., Cliff 1959; Smith et al. 2009), but little evidence that different adverbs lead to differential category endorsement;

Hintensity: Endpoint response categories are endorsed less frequently if their labels are more intense.


The familiarity hypothesis according to the open choice model in linguistics, the

only constraint on the concatenation of words is that the rules of grammar be respected;

in contrast, the idiom principle states that combinations of words are used in conventional patterns, which leads to the phenomenon of collocation:

certain word combinations co-occur more often than would be expected based on their individual frequencies (e.g., strong tea vs. powerful tea);

“of the large repertoire of amplifiers available for expressing a high degree of intensity, speakers rely on a rather limited set of items, and only a few of these are used with great frequency” (Altenberg 1991, p. 133)


The familiarity hypothesis (cont’d) formulaic sequences such as collocations are not only used

more frequently by language users and are thus more familiar, but are also processed more quickly (Conklin and Schmitt 2008; Durrant 2008);

based on research on meta-cognitive experiences, this suggests that more familiar, high-frequency labels are more likely to be endorsed (Alter et al. 2007; Fang, Singh, and Ahluwalia 2007; Hawkins and Hoch 1992; Unkelbach 2007; Winkielman et al. 2003) :

□ repeated exposure to a stimulus has beneficial effects on processing fluency;

□ repeated and more fluently processed statements are more likely to be rated as true;

□ stimulus repetition and fluent processing increase liking, preference and confidence judgments;


The familiarity hypothesis (cont’d) since collocations have been shown to be processed

more quickly, familiar (vs. unfamiliar) labels, because of their greater processing fluency, should be chosen more confidently as the true and preferred response option;

Arce-Ferrer (2006) showed that respondents who were less familiar with the meaning of the intermediate scale categories were more likely to engage in extreme responding and therefore less likely to endorse response options with which they were not familiar;

Hfamiliarity: Endpoint response categories are endorsed more frequently if their labels are more familiar.


Two alternative hypotheses to explain the effect of response

category labelsIntensity hypothesis:

Hintensity: Endpoint response categories are endorsed less frequently if their labels are more intense.

Familiarity hypothesis:

Hfamiliarity: Endpoint response categories are endorsed more frequently if their labels are more familiar.


Study 1: Scaling intensity and familiarity

• Do different methods for scaling the intensity and familiarity of response category labels lead to similar results?

If the intensity or familiarity of scale labels is to have a reliable effect on responses to questionnaires, consistent differences in the perceived intensity and fluency of category labels should emerge across respondents.

• Can we identify endpoint labels that vary significantly in intensity and familiarity for use in subsequent studies?

We need two labels that imply contradictory responses under the intensity and familiarity hypotheses.


Study 1 (cont’d)• Label intensity

– Direct ratings of intensity (0 = neutral; 10 = 100% agreement)– Pairwise comparisons of intensity (“Which expression indicates the

stronger sense of agreement?”)

• Label familiarity– Direct ratings of familiarity (0 = we never use this term in day-to-day

language; 10 = we use this term very often in day-to-day language)– Pairwise comparisons of familiarity (“Which expression is more

commonly used in day-to-day language?”)– Lexical decision task (press a button labeled ‘end category label’ or

‘not an end category label’ for 6 endpoint labels and five non-endpoint labels)

– Word frequency counts in corpora of texts (Google hits, available for specific word combinations in particular countries and languages)


Study 1: Method Sample 1: 83 undergraduates; pairwise comparisons

of intensity and familiarity of six endpoint labels; Sample 2: 112 respondents (mean age 32.03, 66%

female) from an online panel; direct ratings of intensity and familiarity on 11-point scales;

Sample 3: 125 undergraduates (57% female); lexical decision task;


Study 1: Results Intensity Familiarity Summary measures

Dutch label (free translation)

Direct Rating

Paired comparison

Direct Rating

Paired comparison

Response latency

Google hits

Intensity Familiarity

Sterk eens (strongly agree)

7.89 (.15)1

.94 (.13)1

3.62 (.27)2

1.14 (.11)1

1011.53 (55.01)1 510 - .99 - .74

Zeer eens (very much agree)

7.60 (.18)1

1.43 (.12)2

2.49 (.24)1

1.65 (.12)2

1002.93 (54.86)1 755 - .95 - .74

Zeker eens (certainly agree)

7.78 (.20)1

2.11 (.11)3

5.05 (.28)3

2.40 (.10)3

989.72 (54.61)1 n.a.d - .56 - .22

Uitgesproken eens (distinctly agree)

8.57 (.22)2

2.98 (.18)4

2.81 (.28)1

1.18 (.13)1

1021.87 (54.73)1 55

.25 - .83

Helemaal eens (fully agree)

9.54 (.12)3

3.72 (.13)5

8.62 (.16)4

4.24 (.08)4

724.88 (55.83 2 131000 1.10 1.23

Volledig eens (completely agree)

9.56 (.10)3

3.82 (.12)5

8.59 (.18)4

4.39 (.08)4

672.73 (55.21)2 110000

1.16 1.25


Study 1: Results (cont’d) for intensity, the correlation of the means obtained from

the paired comparison and direct rating tasks is .92; the correlations of the means derived from the four

familiarity methods range from .94 to .97; thus, there is considerable consistency in respondents’

judgments of the perceived intensity and familiarity of different category labels;

‘sterk eens’ (strongly agree) consistently emerged as one of the least intense and least familiar labels, while ‘volledig eens’ (completely agree) surfaced as one of the most intense and most familiar labels;


Study 2 Direct test of the intensity and familiarity

hypotheses:

The endorsement rate for a high intensity and high fluency label should be relatively low if the intensity hypothesis is true, and it should be relatively high if the fluency hypothesis is true.


Measuring response distributions

A major challenge is to measure differences in response distributions that are not item-specific and independent of substantive content;

To do this, we need to observe patterns of responses across heterogeneous items (i.e., items that do not share common content but have the same response format):

Deliberately designed scales consisting of heterogeneous items (Greenleaf 1992)

Random samples of items from scale inventories (Weijters, Geuens & Schillewaert 2010)


Study 2: Method online survey with Dutch-speaking panel members of an

online market agency (N = 218); the respondents ranged in age from 20 to 65 years (M =

43.2, SD = 11.7), 47 % were female, and 58% had schooling beyond secondary school;

respondents were randomly assigned to questionnaires varying the endpoint labels (5-point scale):

□ ‘completely (dis)agree’ (high intensity/familiarity)□ ‘strongly (dis)agree’ (low intensity/familiarity)

questionnaire consisted of □ 16 heterogeneous items (4 pages with 4 items per page) taken from

unrelated scales (e.g., “Air pollution is an important worldwide problem”, “I often give compliments to others”);

□ pairwise comparisons of the two response category labels in terms of intensity and familiarity;


Study 2: Results The manipulation of intensity/familiarity was

successful; The findings support the familiarity hypothesis:

Intensity Familiarity Mean number of extreme responses

Strongly agree 22% 10% 3.1 (.26)

Completely agree 78% 90% 4.4 (.33)


Study 3 the results of Study 2 are presumably due to the fact that more

familiar labels are more easily processed and that this ease of processing inadvertently influences respondents’ answers to survey questions;

as long as the relevance of meta-cognitive experiences is not called into question, people consider this information as diagnostic and incorporate it into their judgments by relying on naïve theories such as, “If the information comes to my mind easily, it must be true or I must like it”;

however, when the diagnosticity or informational value of meta-cognitive experiences is called into question, people discount this information and either turn to alternative naïve theories such as “The information comes to mind easily because I have often heard it” or use the cognitive content of the stimulus;


Study 3 (cont’d) this suggests that making respondents aware that

more familiar response labels may attract more responses and that this may lead them to more readily select the category label “completely (dis)agree” should eliminate the previously observed familiarity effect;


Study 3: Method Online survey with 122 respondents of a university panel

(67.2% women, average age of 29 years); 2 x 2 between-participant design:

intensity/familiarity of the endpoint labels manipulated as in the previous study (‘completely agree’ vs. ‘strongly agree’);

awareness of the label familiarity effect (depending on whether the following instructions were given before or after the collection of the DV):

In questionnaires, there are several different ways of labeling response categories (e.g., “strongly (dis)agree” or “completely (dis)agree”). Previous research has shown that labels that are used more commonly in day-to-day language are more often selected as a response. This happens irrespective of the true opinion of the respondent on the subject of the question.

DV is the number of endpoints responses to 16 heterogeneous questions;


Study 3: Results the findings support the familiarity hypothesis in the

unaware condition, but the effect goes away when respondents are made aware of the label familiarity effect :

not aware aware1

1.5

2

2.5

3

3.5

4

4.5

5

stronglycompletely

Nr o

f end

poin

t res

pons

es


Implications of the category labeling effect

for cross-cultural research response category labels can affect findings in a single-language

context (e.g., meta-analytic comparisons), but they are particularly important in cross-cultural research, where labels have to be translated;

two types of translation: literal idiomatic

some authors have emphasized the need to choose scale anchors that are equal in intensity (e.g., Harzing 2006), and prior research has demonstrated that supposedly similar terms may differ in intensity across languages (e.g., ‘definitely’ vs. ‘bestimmt’; see Smith et al. 2009);

however, translated adverbial modifiers may also differ in familiarity;


Schematic representation of the translation process

(based on Bassetti and Cook 2011)


Study 4: Method approx. 200 English- or French-speaking

respondents in five regions (nationality/language combinations) of North America and Europe;

five endpoint labels in each language; 16 heterogeneous items from Greenleaf (1992),

rated on 5-point scales; pairwise comparisons of the six labels plus “agree”

or “d’accord” in terms of intensity and familiarity;


Study 4: Method (cont’d)

France USA Canada UK Total

Language French 227 0 203 0 430

English 0 185 196 187 568

Total 227 382 399 187 998

Version English French

Strongly agree Fortement d'accordCompletely agree Complètement d'accordExtremely agree Extrêmement d'accordDefinitely agree Définitivement d'accordFully agree Entièrement d'accordVery much agree Tout à fait d'accord


Study 4: ResultsIntensity and familiarity ratings by

region

Note: Correlation between the familiarity ratings and the natural logarithm of the number of Google hits was at least .88.


Study 4: Results linear regression of the number of endpoint

responses on label intensity label familiarity 4 dummy variables representing the five regions

only label familiarity had a significant effect (Standardized B = .38, p < .05, R² = .14);

in other words, the number of endpoint responses increases as a function of label familiarity, regardless of country and language;


Study 5 demonstration that familiarity is a viable determinant

of extreme responding differences between regions in a large-scale international survey;

illustration of how to construct and use relative measures of familiarity and extreme responding based on secondary data only;


Study 5: Method 13,520 respondents from 17 European regions; 16 heterogeneous items based on Greenleaf (1992); use of fully labeled 7-point response scales; familiarity:

relative measure of familiarity as the natural logarithm of the ratio of the number of Google hits for the 1st and 7th category (strongly agree or disagree) to the number of Google hits for the 2nd and 6th category (agree or disagree);

endorsement: relative endorsement of the 1st and 7th vs. the 2nd and 6th response categories (natural logarithm);


N female M age SD age

Belgium, Dutch 644 51% 41.0 11.1

Belgium, French 371 51% 40.5 11.7

UK, English 908 56% 41.8 11.3

Germany, German 993 50% 39.3 11.0

Hungary, Hungarian 1003 51% 38.3 11.8

Slovakia, Slovakian 1063 50% 38.2 12.1

Poland, Polish 802 37% 32.2 11.0

Netherlands, Dutch 1046 50% 40.8 11.4

France, French 1000 51% 39.4 11.9

Spain, Spanish 934 50% 37.8 10.5

Romania, Romanian 970 50% 37.9 11.5

Turkey, Turkish 914 43% 32.5 9.4

Italy, Italian 939 50% 39.0 10.6

Switzerland, French 303 51% 42.5 9.7

Switzerland, German 606 48% 43.5 9.4

Switzerland, Italian 50 56% 32.9 8.7

Sweden, Swedish 974 49% 39.9 11.3

Total 13520 49% 38.7 11.4

Sample descriptive statistics Pan-European study (Study 7 and 8)


Study 5: Results

Note: Standardized B = .68, p < .05, R² = 46%.

-8.00 -7.00 -6.00 -5.00 -4.00 -3.00 -2.00 -1.00 0.00-1.40

-1.20

-1.00

-0.80

-0.60

-0.40

-0.20

0.00

Relative endpoint label familiarity (based on search engine data)

Rel

ativ

e en

dors

emen

t of e

ndpo

ints

in s

urve

y da

ta


Study 5: Results (cont’d) prior research has generally attributed differences in

response distributions in cross-cultural comparisons to nationality and national culture;

our findings demonstrate that different labels may vary in terms of familiarity, which can lead to different response patterns across languages;

in particular, if the endpoint label used in a certain language is more familiar than the one used in another language (relative to the adjacent category label), it is likely that the endpoint will be selected more frequently in the former than in the latter language;


Discussion: Summary of findings response category labels that are more commonly used in day-

to-day language (i.e., that are more familiar) lead to higher endorsement of their associated response categories;

respondents do not simply scale response categories along an intensity dimension and then map their latent response to the best-matching category, but they are also influenced by the familiarity of the labels;

the category label familiarity effect can be eliminated by making respondents aware of the potentially biasing effect of label familiarity,

the problem may be particularly serious in cross-cultural research when different languages are used;

however, researchers can control for differences in label familiarity across languages based on secondary data;


Consequences of the response category label effect

if certain labels attract more responses, this leads to bias;

Baumgartner and Steenkamp (2001) discuss how extreme responding biases scale scores:□ if the modal scale response is above the midpoint,

average scores will be inflated;□ if the modal scale response is below the midpoint,

average scores will be deflated; relationships between variables can also be biased;


Consequences of the response category label effect (cont’d)

imagine a situation in which the strength of a relationship is compared across two groups and labels that differ in familiarity are used to collect data in the two groups;

the DV, an attitudinal variable (ATT), is measured on an agreement rating scale, and the IV (e.g., AGE in years) is measured on an objective scale and hence not affected by differences in label familiarity;

compared to respondents in the unfamiliar label condition, respondents in the familiar label condition who have a moderately positive or negative true attitude will exhibit a more extreme positive or negative observed attitude because they are more likely to endorse the endpoints;

this can result in a steeper observed slope and thus a stronger relationship between the objective antecedent and the observed attitude in the familiar label condition;

The response category labeling effectConsequences of the

response category label effect (cont’d)

using data from Study 2:□ ATT: “I try to avoid food that is high in cholesterol”□ IV: Age in years

20 25 30 35 40 45 50 55 60 65 701

2

3

4

5

completelystrongly

Age

I try

to a

void

food

that

is h

igh

in c

hole

ster

ol


Implications formultilingual survey research

translations usually imply a trade-off between the attempt to be literal and the attempt to be idiomatic;

optimizing equivalence: use response category labels that are equally familiar in different languages (rather than literal translations or words with equal intensity);

e.g., the German and Dutch labels “vollkommen einverstanden” and “volledig eens” are literal translations (similar to “completely agree”), but in German this expression is more familiar, resulting in more endpoint responses than in Dutch(based on Study 5);

back-translation of response category labels may not help because it may result in literal rather than idiomatic translations and the familiarity of the labels in different languages may differ;


Identifying appropriate endpoint labels

in two languages

the effect of familiarity with the response category labels on item response to likert scales

Documents

response category labels

response categories

item response

response options high

appropriate labels

labels intensity hypothesisin

meaning of endpoint

extreme scale categories