demand forecasting behavior: system neglect and change ... · (harvey 2007). regarding managerial...

1

Demand Forecasting Behavior: System Neglect and Change Detection

Mirko Kremer Smeal College of Business, Pennsylvania State University, University Park, Pennsylvania 16802,

[email protected]

Brent Moritz Carlson School of Management, University of Minnesota , Minneapolis, Minnesota 55455,

[email protected]

Enno Siemsen Carlson School of Management, University of Minnesota , Minneapolis, Minnesota 55455,

[email protected]

Abstract:

This research analyzes how individuals make forecasts based on time series data, and tests

an intervention designed to improve forecasting performance. Using data from a controlled

laboratory experiment, we find that forecasting behavior systematically deviates from

normative predictions: Forecasters over-react to errors in relatively stable environments, but

under-react to errors in relatively unstable environments. Surprisingly, the performance loss

due to systematic judgment biases is larger in stable than in unstable environments. In a

second study, we test an intervention designed to mitigate these biased reaction patterns. In

order to reduce the salience of recent demand signals, and emphasize the environment

generating theses signals, we require forecasters to prepare a forecast in other time-series

before returning to their original time-series. This intervention improves forecasting

performance.

Keywords: forecasting; behavioral operations; system-neglect; exponential smoothing

Working Paper Draft - Please do not Distribute

2

1. Introduction

Demand forecasting in time series environments is fundamental to many operational decisions. Poor

forecasts can result in inadequate capacity, excess inventory or inferior customer service. Given the

importance of good forecasts to operational success, quantitative methods of time-series forecasting are

well known and widely available (cf. Makridakis, Wheelwright and Hyndman 1998). Despite the fact that

companies frequently have access to time-series history and sophisticated quantitative methods embedded

in forecasting software, empirical evidence shows that real world forecasting frequently relies on human

judgment. In a study of 240 U.S. corporations, while over 90% of companies reported having access to

some forecasting software (Sanders and Manrodt 2003a), only 29% of firms primarily use quantitative

forecasting methods, 30% primarily use judgmental methods while the remaining 41% apply both

quantitative and judgmental methods (Sanders and Manrodt 2003b). Although quantitative analysis based

on a time-series may often provide the basis for a forecast, it is a common practice to alter such forecasts

based on human judgment (Fildes, Goodwin, Lawrence and Nikolopoulos 2009).

A recent trend in operations management research is to study operational decisions from a behavioral

perspective (Bendoly, Donohue and Schultz 2006). While much research in behavioral operations

management is devoted to inventory decision making, Schweitzer and Cachon (2000, p. 419) highlight

the importance of explicitly separating the forecasting task from the inventory decision task:

“While the forecasting task typically requires managerial judgment, the task of converting a

forecast into an order quantity can be automated. A firm may reduce decision bias by asking

managers to generate forecasts that are then automatically converted into order quantities.”

Thus, inventory decisions can (and frequently should) be decomposed: When choosing an order quantity,

an individual has to estimate the probability distribution of future demand; derive a service level; and

then use demand distribution and service level to determine an order quantity. Biased judgments of

demand distributions would result in sub-optimal inventory decisions. For example, Schweitzer and

Cachon (2000) investigate newsvendor decision making under stationary and known demand

distributions, a setting where demand forecasting is theoretically irrelevant. A key finding is that order

quantities are on average biased towards mean demand, relative to the expected profit maximizing order

quantity. This biased ordering has been attributed to unsystematic randomness in decision making (Su

2008) as well as the more systematic biases like demand chasing (Kremer et al. 2010), i.e., the tendency

to adjust orders toward previous demand. In a more complex “beergame” setting, Croson and Donohue

(2003) observe the bullwhip effect, i.e. upstream order amplification in the supply chain, with participants

who face a known and stationary demand distribution. Croson, Donohue, Katok and Sterman (2009)

observe this effect even with constant and deterministic demand. In sum, existing experimental evidence

3

suggests that biased judgments of demand distributions can strongly affect the quality of higher-order

decisions like purchasing, inventory or capacity planning. Therefore, the analysis of judgmental

forecasting is crucial for a better understanding of decision making in operations management.

Extensive literature on human judgment in time-series forecasting exists (Lawrence, Goodwin,

O‟Connor and Onkal 2006). Central findings include the wide-spread use of heuristics such as anchor-

and-adjustment, as well as the importance of feedback and task decomposition on forecasting

performance. However, the overall findings remain somewhat inconclusive, in part because forecasting

behavior appears sensitive to different components of the time-series. Further, the judgmental forecasting

literature is typically concerned with the detection of predictable changes in a time series, such as trends

or seasonality (Harvey 2007). In contrast, our research is focused on individual reaction to unpredictable

change in time-series. We ask the following two research questions: First, how do individuals create time-

series forecasts in unstable environments? Second, what can managers do to improve forecasting

performance?

We study these questions in a laboratory setting that allows for precise normative predictions:

forecasting a time series generated by a perturbed random walk. Across a wide range of environmental

conditions, we show that time-series forecasting behavior is described by an error-response model.

However, forecasters tend to over-react to forecast errors in more stable environments and under-react to

forecast errors in less stable environments. This pattern is consistent with the system neglect hypothesis

(Massey and Wu 2005) which posits that forecasters place too much weight on recently observed forecast

errors relative to the environment that produces these signals. To explore how to improve forecasting

performance, we therefore design and test an intervention which builds directly on the system neglect

hypothesis. Instead of making forecasts for a single time-series (our base study), we require subjects to

make forecasts for multiple time-series in parallel in our second study, in an attempt to reduce the relative

salience of a recent signals and re-emphasize the demand environment underlying/common to all time-

series. We find that this simple intervention can improve forecasting performance.

We proceed in this paper as follows. The next section outlines the academic literature that relates to

our research. In §3 we discuss our theoretical developments. In §4 we discuss the results of our first

study, which is focused on understanding human judgment in time-series analysis tasks. Section 5 is

devoted to the results of our second study, which emphasizes managerial interventions to improve human

judgment in time-series analysis tasks. We discuss our results and conclude the paper in §6.

2. Related Literature

Existing research on judgmental forecasting provides vast but somewhat inconclusive empirical evidence

regarding forecasting performance, cognitive processes, and managerial interventions. Many studies have

4

been devoted to comparing the performance of human forecasts to quantitative forecasting methods, but

the empirical evidence is not consistent (Lawrence et al. 1985, Carbone and Gorr 1985, Sanders 1992,

Fildes et al. 2009). The literature has also investigated a variety of cognitive processes underlying the

evolution of judgmental forecasts, such as different variations of the anchoring and adjustment heuristic

(Harvey 2007). Regarding managerial interventions, judgmental forecast accuracy can improve with

performance feedback (e.g., Stone and Opel 2000) and task properties feedback (e.g., Sanders 1997), but

the effectiveness of these levers depends on specific contextual elements of the forecasting task

(Lawrence et al. 2006). Existing research on judgmental time-series forecasting examines pattern

detection, i.e. how well human subjects can identify trends and seasonal changes in a noisy time series

(Andreassen and Kraus 1990; Lawrence and O'Connor 1992; Bolger and Harvey 1993; Lawrence and

O‟Connor 1995). In contrast, our research focuses on change detection, i.e. how subjects separate random

noise from unsystematic level changes.

When observing signal variation in a time-series, a forecaster needs to identify if there is substantive

(and persistent) cause for this variation, or whether variation just represents noise with no implications for

future observations. The ability to distinguish substantive change from random variation has been studied

extensively in the literature on regime change detection (Barry and Pitz 1979). A central conclusion from

regime change research is that people under-react to change in environments that are unstable and have

precise signals, and overreact in environments that are stable with noisy signals (Griffin and Tversky

1992). This seemingly contradictory reaction pattern has been reconciled by the system-neglect

hypothesis (Massey and Wu 2005), which posits that individuals overweigh signals relative to the

underlying system which generates the signals.

A related stream of research in financial economics seeks to explain the pattern of short-term under-

reaction and long-term overreaction to information, often observed in stock market investment decisions

(Poteshman 2001). Some theoretical work has been devoted to explaining this behavioral pattern, e.g. by

linking such behavior to the “gambler‟s fallacy” or the “hot-hand effect” (Barberis et al. 1998, Rabin

2002, Rabin and Vayanos 2009). In an asset pricing context, Brav and Heaton (2002) illustrate how an

over-/underreaction pattern arises from biased information processing of investors subject to the

representativeness heuristic (Kahnemann and Tversky 1972) and conservatism (Edwards 1968), and show

how this pattern can also arise from a fully Bayesian investor lacking structural knowledge about the

possible instability of the time-series. Experimental tests of this “mixed-reaction pattern” include

Bloomfield and Hales (2002) and Asparouhova, Hertzel and Lemmon (2009).

A central difference between our research and existing research on human change detection patterns is

the complexity of the judgment environment. In Massey and Wu (2005), participants face binary signals

(red or blue balls) which can be generated from two regimes (draws from two urns with fixed proportions

5

of red and blue balls in each). Given a sequence of signals, the experimental task is to identify when a

regime change (i.e. a switch from one urn to the other) has occurred. Further, as subjects have a perfect

knowledge of the system parameters (the proportion of blue balls in either urn) there is no ambiguity

concerning the relevant world. This environment fits a binary forecasting task where a well-known

phenomenon needs to be detected (for example, when a bull market turns into a bear market). Similarly,

in Bloomfield and Hales (2002) and Asparouhova et al. (2009), participants face a fairly simple series of

signals generated from a symmetric random walk. Brav and Heaton (2002) illustrate their theoretical

considerations in an environment where a series of independently and identically distributed assets exhibit

a single structural break which shifts the asset distribution only once during the time series. A central

question of our research is whether the over-reaction/under-reaction patterns observed in such fairly

simple settings translate to the relatively richer environment of time-series demand forecasting under

frequent change. Further, beyond trying to understand human reaction patterns, our study designs and

tests an intervention to mitigate biases and the resulting performance losses.

3. Theory

To begin our theory development, it is important to briefly characterize the judgment task underlying a

time-series forecast. In essence, a forecaster (=she) needs to decide whether observed variation in the time

series data provides a reason to modify a previous forecast in the next period. We illustrate this judgment

task in Figure 1.

Figure 1: The Challenge of Time-Series Analysis

If she interprets variation purely as random noise, she can ignore this variation and not change her

forecast (i.e. a long-run average, the circle in Figure 1). If she believes that variation represents a change

Noise

Change in Level

Trend

Time

Demand

Average

Past Present Future

Variation indicates …

Variation

6

in the underlying level of the time series, the most recent demand observations contain more information

about the future than past observations, and need to receive more weight in the forecast. Her forecast is

then close to the square in Figure 1. Finally, if she believes that this variation is indicative of a trend, (an

ongoing change in the level), she would extrapolate the existing variation to re-occur in the future, and

her forecast would be close to the triangle in Figure 1.

In practice, these choices are not mutually exclusive. A forecaster may decide that variation is partially

due to noise, and partially due to a level change, and therefore create a forecast somewhere in between the

square and the circle in Figure 1. Or she may believe that variation represents both a level change and a

trend. The key challenge is differentiating level changes from noise. While our empirical analysis will

control for individuals potentially detecting illusory trends, our simulated demand environment does not

contain trends, and a comprehensive discussion of trend detection is beyond the scope of this paper.

3.1 Demand Environment

We assume that forecasters react to demand observations in time intervals indexed by t, without any

additional information on future demand realizations beyond that which is contained within the time

series. The level of our time series changes according to a random walk. If we define μt to be the level at

time t, the level at the next regular observation at time t + 1 is given by μt+1 = μt + Vt, where Vt is a

normally distributed random variable with mean 0 and standard deviation c. The demand observation Dt

in each time period is then a normally distributed random variable with mean μt and standard deviation n.

Roughly put, the standard deviation c captures the notion of change, i.e. permanent shocks to the time-

series, while the standard deviation n captures the noise surrounding the level, i.e. temporary shocks to the

time-series. With a change parameter c, the level of the time series in the next period has a 68% chance of

being within +/- c of the level in the current period. The noise parameter n implies a 68% chance of the

actual demand observation being less than +/- n away from the true level. Figure 2 illustrates how the

shape of a representative time series depends on these two parameters.

While allowing for randomly changing levels μt, a time series from this data-generating process has no

underlying systematic trend or seasonality. Although real time-series often contain such elements,

methods to de-trend and de-seasonalize data are available (Winters 1960). For simplicity, our research is

focused on real-world data that has gone through such modifications, or data that can be well described by

Brownian motion, such as energy demand or airline passengers (Marathe and Ryan 2005). Importantly,

this simplification allows us to study the decision task of differentiating level changes from noise, without

further confounding such judgments with the estimation of trends and seasonal elements. Further, the

demand process we consider provides a simple normative benchmark: Single exponential smoothing

(Harrison 1967, McNamara and Houston 1987).

7

Figure 2: Sample Demand Paths for different c and n (𝝁𝟎 = 𝟓𝟎𝟎).

3.2 Normative Benchmark

Structurally, the single exponential smoothing forecast Ft+1 (made in period t for period t + 1), is a

weighted average of the most recent demand observation and the previous forecast, 𝐹𝑡+1 = 𝛼𝐷𝑡 +

1 − 𝛼 𝐹𝑡 = 𝐹𝑡 + 𝛼(𝐷𝑡 − 𝐹𝑡). The latter part of this equation highlights how the forecast 𝐹𝑡+1 is driven

by a response to the forecast error. The appropriate smoothing level 𝛼∗ is a function of the change (c) and

noise (n) parameters governing the time series. To further characterize 𝛼∗, it is useful to introduce the

concept of weight of evidence. We formally define this weight as the change-to-noise ratio W = c2/n2,

which increases as the degree of change in the time-series (c) rises, and decreases as the noise in the time

series (n) intensifies. Intuitively, W measures the reliability with which an observed forecast error

represents a change in the level of a time series. With low W (variations in demand are mostly noise),

forecast errors should be discarded and should not influence behavior. With a high W (variations in

De

man

d

Time

Condition 1, c=0, n=10

De

man

d

Time


De

man

d

Time


De

man

d

Time

Condition 4, c=10, n=40D

em

and

Time

Condition 5, c=40, n=10D

em

and

Time


8

demand are mostly level changes), forecast errors should strongly influence forecasts. This intuition is

formally supported by Harrison (1967) and McNamara and Houston (1987) who show that1

𝛼∗ 𝑊 =2

1+ 1+4 1𝑊

. (1)

Note that the optimal smoothing constant in Eq. (1) depends only on the change-to-noise ratio W, while

the demand time series is driven by absolute levels of c and n. For example in Figure 2, condition 3 (c =

10 and n = 10) and condition 6 (c = 40 and n = 40) have the same W, and therefore the same associated

𝛼∗(W). The optimal forecasting mechanism for our demand environment is

𝐹𝑡+1 = 𝐹𝑡 + 𝛼∗ 𝑊 𝐷𝑡 − 𝐹𝑡 . (2)

3.3 Behavioral Forecasting and System Neglect

The previous section outlines how single exponential smoothing with α*(W) is optimal for a random walk

described by c and n. In this section we discuss forecasting behavior relative to this normative benchmark.

From a behavioral perspective, Eq. (2) poses two critical assumptions on the forecaster‟s degree of

rationality (Brav and Heaton 2002). The optimal forecasting mechanism implies that the forecaster has

correct beliefs about the structure of the demand process, knowledge and understanding of the structure of

the optimal forecasting mechanism, and access to an unbiased estimate of α*(W). It is optimistic to

assume that given the complexity of the context, the forecaster has “structural certainty” about the

demand environment (perturbed random walk) and optimal forecasting mechanism (single-exponential

smoothing). For example, a forecaster may perceive trends where there are none (see Figure 1). Our

empirical estimation in Section 4 will therefore allow for richer models that describe forecasting behavior

beyond simple exponential smoothing. From a behavioral perspective, the crucial question is the choice of

the smoothing constant 𝛼(𝑊), relative to the unbiased estimate 𝛼∗(𝑊) in Eq. (2).

There are compelling reasons to assume that forecasters follow the error-response logic of simple

exponential smoothing. Practically, exponential smoothing corresponds to the mental process of error

detection and subsequent adaptation.2 Given a constant smoothing parameter α, it represents single-loop

learning where a forecaster observes an error and then adjusts her next forecast based on that error. As

such, exponential smoothing, interpreted as trial-and-error learning, is a plausible model for real behavior.

Further, exponential smoothing has two important characteristics as a boundedly rational decision

1 McNamara and Houston derive expression (1) using Bayesian principles. The same expression (though

articulated slightly differently) was derived by Harrison (1967) as the argument that minimizes the variance of

forecast errors. 2 This is also a fundamental principle of cybernetics (Wiener 1948) and the foundation for closed loop theories

of learning (Adams 1968). There is neurological evidence that our brain supports such a process (Gehring, Goss,

Coles, Meyer and Donchin 1993).

9

heuristic: It does not require much memory because the most recent forecast contains all information

necessary to make the next forecast, and it is a robust heuristic in many different environments, beyond

the particular one used in our study (Gardner 1985 and 2006).

How humans evaluate and subsequently respond to signals has been tied to the concepts of

representativeness and conservatism. Representativeness means that individuals have a tendency to

overreact to a signal and account only insufficiently for the weight they should attribute to that signal. For

example, individuals neglect to acknowledge that a small sample size implies that only a low statistical

weight should be attributed to the sample (Tversky and Kahnemann 1971). In our context,

representativeness implies that forecasters consistently use a higher W than the underlying time series

would entail when determining their α(W). On the other hand, conservatism implies that individuals have

a tendency to underreact to a signal, even though the statistical weight they should attribute to the signal

is strong. This phenomenon has mostly been observed in the context of Bayesian updating (Camerer

1995). In our context, conservatism implies that human forecasters consistently use a lower W than the

underlying time series would entail. Massey and Wu (2005) integrate these observations into a system-

neglect hypothesis: the strength of the signal is salient in the decision maker‟s perception, whereas the

system that generated the signal is latent in the background. This leads to a general neglect of the system,

such that the decision maker emphasizes the strength of a signal at the expense of the weight that should

be attached to that signal. In other words, the weight of the signal is less of a determinant of behavior than

it ought to be. For the forecasting context of our study, system-neglect leads us to believe that W is less a

determinant of behavior than Eq. (1) implies. Specifically, we would expect that our behavioral α(W) is

less responsive to W (i.e. the curve is flatter) than α*(W).

As Massey and Wu point out, the system neglect hypothesis predicts that there is relatively more over-

reaction for low values of W, and relatively more under-reaction for high values of W. In our context, for

any WL < WH, we would have α(WL)- α*(WL) > α(WH)- α*(WH). System neglect makes no specific

predictions about absolute levels of over- and under-reaction (i.e. if α(0) is very high, there could be over-

reaction for all values of W, and if α(0) = 0, there could be under-reaction for all values of W). We

illustrate the difference between the normative reaction according to Eq. (1), and the predicted behavioral

reaction according to system neglect in Figure 3.

10

Figure 3: Normative versus System Neglect Reaction

Note that Figure 3 shows one possible pattern of system neglect (absolute over-reaction for low values of

W, and absolute under-reaction for high values of W). We hypothesize:

HYPOTHESIS 1 (SYSTEM NEGLECT): Individuals show relatively more over-reaction for low values of

W, and relatively more under-reaction for high values of W.

4. Study 1 (Baseline)

4.1 Experimental Design

In a controlled laboratory environment, subjects make sequential forecasts based on an evolving time

series of demand realizations generated from a perturbed random walk. Subjects were told they were

managing inventory at a retail store. For 50 periods, subjects observed demand and were asked to make a

point forecast for the next time period. Throughout the experiment, a visible graph was updated to include

all demand realizations up to the current period. A table also provided historic demand information, as

well as information on previous forecasts, absolute forecast errors, and relative forecast errors.

Our theoretical developments in the previous section posit that human forecasters react to forecast

errors, and that their reaction pattern depends systematically on the forecasting environment. To test our

main research hypothesis (system neglect), we vary experimental conditions along the two parameters of

our forecasting environment, c and n. First, we vary the degree of change, by letting c equal 0, 10, or 40.

Second, we vary the degree of noise, by letting n equal 10 or 40. This results in six experimental

conditions representing different demand environments, ranging from no-change-low-noise (c = 0, n =

10) to high-change-high-noise (c = 40, n = 40), as shown in Table 1.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.5 1 1.5 2 2.5 3 3.5 4

Re

acti

on

α

Weight W

Normative Reaction System Neglect Reaction

Overreaction

Underreaction

11

Table 1: Overview of Experimental Conditions

change-to-noise ratio 𝑊 =𝒄𝟐

𝒏𝟐 (and 𝛼∗ 𝑊 in parentheses)

n = 10 n = 40

c = 0 Condition 1

0 (0)

Condition 2

0 (0)

c = 10 Condition 3

1 (.62)

Condition 4

1/16 (.22)

c = 40 Condition 5

16 (.94)

Condition 6

1 (.62)

Environments characterized by a significant degree of change over time are likely to produce rather

distinct demand evolutions. To ensure overall consistency between demand data and the data-generating

system, we generated four demand datasets from each of the six environments. Data in each time series

represented units of demand in each period. We implement the resulting 6*4 = 24 treatments in a

between-subject design. Subjects are not informed that data was generated by a random walk with noise,

and they are also not provided with the actual parameters c and n in their condition (see Asparouhova et

al. 2009 for a similar design). Instead, subjects receive 30 historic data points by condition before making

their first forecast, shown throughout the experiment in both the graph and the history table. The example

time-series datasets depicted in Figure 2 are actual datasets from our experiment.

The forecasting task was implemented in the experimental software zTree (Fischbacher 2007). In

order to provide incentives for accurate forecasting, we paid each subject $10 multiplied by the subject‟s

accuracy across the 𝑇 = 5050 periods. (Forecasting accuracy was defined as (1 − 𝑀𝐴𝑃𝐸), where

𝑀𝐴𝑃𝐸 = 1

𝑇 𝐷𝑡−𝐹𝑡

𝐷𝑡

𝑇𝑡=1 , the Mean Absolute Percentage Error calculated based on the entire history of

forecasts 𝐹𝑡 and demand observations 𝐷𝑡 .) In addition, each subject was paid a participation fee of $5.

Payoffs were rounded up to the next full dollar value, and the average payoff was $14.80.

4.2 Data

The baseline study (Study 1) was conducted at a behavioral lab in a large, public university in the

American Midwest. The 252 participants in the study belonged to a subject pool associated with the

business school, and registered for the study in response to an online posting. About 50% of the subjects

were current undergraduate students from various fields. The remaining 50% consisted of either graduate

students or staff at the university. 23 of the 24 treatment conditions had at least 10 subjects, while one

treatment had 8 subjects.

To correct for errors and outliers, we examined all individual forecasts 𝑭𝒊𝒕 with an absolute forecast

error 𝑫𝒕 − 𝑭𝒊𝒕 > 300. In a few cases, obvious typographical errors could be determined and the

12

forecasts were corrected accordingly. If the intended forecast could not be determined, but the response

appeared to be a typographic error (i.e. one forecast of 20 in between a long series of forecasts between

700 and 900), that forecasts was recorded as missing. In total, such corrections were rare (<0.1 % of all

forecasts). Prior to completing Study 1, we also completed a pretest (261 subjects) at a different university

located in the American Northeast; more details about the pretest are given in appendix 1.

4.3 Initial Analyses

Let 𝐹𝑖𝑡 denote forecast for period t made by subject i in period t-1, after observing demand 𝐷𝑡−1, and let

𝐹𝑡 = 𝐹𝑖𝑡𝑖 denote the corresponding average across all 𝐼 individuals within a given condition. The

optimal forecast for period t is given 𝐹𝑡 𝐷𝑡−1 𝛼∗(𝑊) which we abbreviate by 𝐹𝑡

∗ for notational

convenience. Through its dependence on the smoothing constant 𝛼∗(𝑊) and the demand realizations

𝐷𝑡−1, it is understood that 𝐹𝑡∗ is specific to each of the six conditions (which differ by W) as well as to

each of the four demand sets within a condition (which differ by the vector of demand realizations 𝐷𝑡).

Table 2 compares the observed mean absolute forecast error 𝑀𝐴𝐸 𝐷𝑡 , 𝐹𝑖𝑡 = 1

𝑆 𝐼 𝑇 𝐹𝑠𝑖𝑡 − 𝐷𝑠𝑡 𝑆𝐼𝑇 ,

which is the T-period average across all I subjects in all S demand seeds within a given demand

environment, over all conditions. Simple t-tests (p ≤ .01) confirm the observed mean absolute error is

significantly larger than the corresponding error measure based on optimal forecasts, 𝑀𝐴𝐸 𝐷𝑡 , 𝐹𝑡∗ .

Further, a comparison across environments is consistent with our intuition: performance deteriorates

when noise n and instability c increase.

Table 2: Observed Forecasting Performance Measured by MAE

(optimal performance in parentheses)

Figure 4 illustrates the evolution of demand 𝑫𝒕, observed forecasts 𝑭𝒕, and normative forecasts 𝑭𝒕∗.

Without formal analysis, we can make a number of observations. The observed forecasts (grey line)

mimic the evolution of demand (dots). This is consistent with exponential smoothing, but certainly not

optimal in the stable demand environments (conditions 1 and 2), where the correct forecasts 𝑭𝒕∗ (black

line) do not react at all to demand signals. Further, while both the observed as well as the normatively

correct forecasts represent a smoothed version of demand, especially condition 4 shows that there is more

variability in the series of observed forecasts than in the series of normatively correct forecasts.

n = 10 n = 40

c = 0 10.15 (7.75) 38.55 (30.74)

c = 10 16.42 (12.86) 47.36 (36.51)

c = 40 38.94 (34.34) 64.03 (53.54)

Notes. All differences between observed and optimal MAEs are significant (p ≤ .01)

13

Figure 4: Sample Evolutions of Demand, Average Observed Forecast, Normative Forecast

We next compare observed forecast adjustments to the normative exponential smoothing benchmark.

To formalize adjustments as a response to observed forecast error, we define the adjustment score

𝛼𝑖𝑡 =𝐹𝑖𝑡−𝐹𝑖𝑡−1

𝐷𝑡−1−𝐹𝑖𝑡−1, which follows immediately from rearranging the single exponential smoothing formula

in Eq. (2).3 We can use this ratio to categorize observed behavior, as shown in Figure 5. A score of

𝛼𝑖𝑡 < 0 would indicate that subjects adjusted their forecast in the opposite direction of their forecast error

(11% of all observations). Possible explanations of such behavior would be that subjects either followed a

previously salient trend expectation, or believed in the law of small numbers, i.e. that high values of a

3 By construction, this score is not defined for the first period, nor for when 𝐷𝑡−1 = 𝐹𝑖𝑡−1. Note that this ratio

has been used before as an adjustment score in newsvendor research (cf. Schweitzer and Cachon 2000).

400

500

600

Time


400

500

600

Time


500

650

800

Time


500

650

800

Time


300

500

700

Time


500

700

900

Time


14

stable series balance out with small values in small samples. An adjustment score of 𝛼𝑖𝑡 = 0 (10% of all

observations) indicates no reaction. If the adjustment score falls between 0 and 1 (42% of all

observations), it is consistent with adjusting the current forecast to the observed forecast error. Finally,

any adjustment score 𝛼𝑖𝑡 > 1 (37% of all observations) indicates that subjects were extrapolating

illusionary trends into the future. This initial analysis highlights that while simple error-response level

adjustment is the dominant response pattern, there is strong evidence that subjects tend to adjust their

forecasts beyond the range consistent with simple exponential smoothing. This calls for a more

comprehensive description of behavior, which we provide in the next section.

Figure 5: Ranges for the Adjustment Score αit

To provide a brief, aggregate analysis4 of forecast adjustments across conditions, we calculate

𝛼 = 1

𝑆𝐼𝑇 𝛼𝑖𝑡(𝑠)𝑆𝐼𝑇 while noting that such average scores need to be interpreted with caution. Several

directional observations can be made (see Table 3). First, the reaction α increases in the degree of change,

and decreases in the degree of noise. This observation is in line with our normative predictions from Eq.

(1), as subject behavior corresponds directionally to change in the change-to-noise ratio as one would

expect. Second, in all conditions, the average reaction differs from the normative reaction. Condition 5

shows some evidence for underreaction, whereas all other conditions show some evidence of

overreaction.

4 Because excessively high and low adjustments can have a strong influence on this analysis, we remove all

𝛼𝑖𝑡 < −1 (5% of observations) and all 𝛼𝑖𝑡 > 2 (8% of observations).

αit < 0, i.e. negative trend or

gambler‟s fallacy

Time

Demand

Past Present Future

Fit-1

Dt-1

αit > 1, i.e. subject believes in

a trend.

0 ≤ αit ≤ 1, i.e. exponential

smoothing.

15

Table 3: Average Forecast Adjustment Scores 𝜶

n = 10 n = 40

α α*(W) α α*(W)

c = 0 .59** (.01) 0.00 (p=.09) .56** (.01) 0.00

(p≤.01) (p≤.01)

c = 10 .74** (.01) .62 (p≤.01) .69** (.01) .22

(p≤.01) (p≤.01)

c = 40 .89** (.01) .94 (p≤.01) .80** (.01) .62

Notes. Bold entries are average adjustment scores, with std. errors reported in brackets. The **

indicates that all average adjustment scores are significantly different from their normative value.

The p-values between columns and rows are significance tests comparing average adjustment scores

between conditions.

4.4 A Generalized Model of Forecasting Behavior

The previous section has highlighted that actual behavior is not completely captured by single exponential

smoothing, and identifying a descriptively accurate forecasting model is ultimately an empirical question.

Rather than imposing single exponential smoothing as the only model, we allow our data to select a

preferred model of forecasting behavior. We include two generalizations in the empirical specification of

behavior: initial anchoring and illusionary trends. Initial anchoring refers to the well documented

tendency of individuals to anchor their decisions on some artificial or self generated value (Epley and

Gilovich 2001). Illusionary trends refer to the idea that people are fast to see trends where there are none

(DeBondt 1993).

We conceptualize forecasts as containing three essential structural components: A level estimate Lt, a

trend estimate Tt, and „trembling hands‟ noise εt, leading to a generalized structural equation for forecast

Ft+1:

𝐹𝑡+1 = 𝐿𝑡 + 𝑇𝑡 + 𝜀𝑡 (3)

We include the noise term because human decision making is known to be inherently stochastic

(Rustichini 2008). We specify the level term in (3) as

𝐿𝑡 = 𝜃𝐿𝐹𝑡 + 𝛼 𝐷𝑡 − 𝐹𝑡 + 1 − 𝜃𝐿 𝐶. (4)

The specification in Eq. (4) introduces the anchoring parameter θL and the constant C. While exponential

smoothing suggests that forecasters correctly and exclusively anchor on their previous forecasts, the

literature on anchor and adjustment heuristics often includes the initial values of a time series as an

additional anchor (Chapman and Johnson 2002). The parameter θL allows people to either anchor their

forecasts only on previous forecasts (θL = 1), or to anchor their forecasts on the initial and constant value

16

C (θL = 0) or some combination of these two extremes (0 < θL < 1). Note that in stable time series, initial

anchoring with θL = 1 is normatively correct, since forecasts should be constant.

Forecasters should never develop trend estimates in the context of our time series, but this is a

normative aspect unknown to our subjects. Our data-generating process can produce random successive

level increases or decreases that can easily be perceived as trends (see Figure 2). Because there is

considerable evidence that forecasters are quick to see trends where they do not exist (DeBondt 1993), we

specify the trend term in (3) as

𝑇𝑡 = 𝑇𝑡−1 + 𝛽 𝐿𝑡 − 𝐿𝑡−1 − 𝑇𝑡−1 , (5)

which corresponds to double exponential smoothing. Using Eq. (4), we can re-write Eq. (5) as

𝑇𝑡 = 1 − 𝛽 𝑇𝑡−1 + 𝛽 𝜃𝐿 − 𝛼 𝐹𝑡 − 𝐹𝑡−1 + 𝛼𝛽 𝐷𝑡 − 𝐷𝑡−1 (6)

Combining Eqs. (3), (4) and (6), rearranging terms, allowing Δ to symbolize first differences and Et to

represent the forecast error (Dt - Ft), we can specify our generalized forecasting model as follows:

𝐹𝑡+1 = 𝜃𝐿𝐹𝑡 + 𝛼𝐸𝑡 + 𝛼𝛽∆𝐷𝑡 + 𝛽 𝜃𝐿 − 𝛼 ∆𝐹𝑡 + 1 − 𝜃𝐿 𝐶 + 1 − 𝛽 𝑇𝑡−1 + 𝜀𝑡 . (7)

Eq. (7) serves as the basis of our empirical estimation (see Appendix 2 for additional remarks on

model and error specification). This model nests the normative benchmark for our context (i.e., single

exponential smoothing) as a special case for 𝜃𝐿 = 1 and 𝛽 = 0.

To test whether the generalizations made in Eq. (7) are necessary, we conduct a hierarchical analysis

to examine whether it is empirically justified to simplify Eq. (7) to the normative benchmark of single

exponential smoothing or not. To that purpose, we estimate four different models: A full model (Model

4), a model without ΔFt as independent variable (Model 3), a model without ΔFt and ΔDt as independent

variables (Model 2), and a model without ΔFt and ΔDt as independent variables, without a constant, and

with the constraint of 𝜃𝐿 = 1 (Model 1). For each of the six experimental conditions, we estimate the

structural parameters of these 4 models, including random slopes and intercepts (see Eq. (10) in appendix

2), using maximum likelihood (ML).5 From the last row in each condition in Table A1 (labeled Δχ2), we

can see that the likelihood ratio tests for the decrease in model fit by going from one less constrained to a

more constrained model indicate that any simplification of model 4 leads to a significant decrease in

model fit across all conditions. This confirms our observation from the previous section. Based on the

model fit statistics, simple exponential smoothing does not fully describe observed behavior, and the

generalized model of Eq. (7) is empirically the preferred model. One can also see that α, our main

parameter of interest (the effect of µ(Et), shown in the first row under each condition for all models)

5 All analyses were conducted in Stata 11, using the 'xtmixed' procedure. In a few instances, when models would

not converge using ML estimation, restricted ML (REML) was used for estimation instead.

17

generally increases the more we constrain the model. In other words, the estimate for α in simpler models

tends to suffer from a positive bias, i.e. be higher than it‟s true value, since these simpler models do not

account for additional forms of subject reaction (besides error-response) inherent in our data.

Consequently, the simple average adjustment scores reported in Table 3 inflate the true reaction α, since

they do not control for additional behavioral effects. We therefore focus our analysis on interpreting the

behavioral parameters estimated in model 4.

4.5 Results

We use the estimates for model 4 in each condition (see Table A1 in the appendix) to calculate the

behavioral parameters (i.e. α, β, θL and C) of Eq. (7) (See Appendix 2 for details). We provide an

overview of these behavioral parameters in Table 4. Note that all parameter tests reported in this section

are Wald tests on specific parameters, or linear/nonlinear combination of parameters. We use likelihood

ratio tests only to test null-hypotheses on random effects or to compare nested models (Verbeek 2000).

Clear patterns emerge from our analysis. In all conditions, the reaction parameter α is positive and

significant (p ≤ .01), indicating that individuals react to their most recent forecast error. We further test

whether the behavioral αs are different from their normative values, as Hypothesis 1 would predict. Note

that α > α* implies overreaction, and α < α* implies underreaction. We find evidence for overreaction in

conditions 1,2 and 4, a „correct‟ reaction that is not significantly different from the normative value in

conditions 3 and 6, and evidence for underreaction in condition 5. This pattern is consistent with system-

neglect and confirms our main hypothesis.

Table 4: Overview of Behavioral Model Parameter Estimates and Hypotheses Tests

Condition 1 Condition 2 Condition 3 Condition 4 Condition 5 Condition 6

c=0, n=10 c=0, n=40 c=10, n=10 c=10, n=40 c=40, n=10 c=40, n=40

α* .00 .00 .62 .22 .94 .62

α .39* (.04) .48* (.05) .68* (.04) .60* (.04) .70* (.03) .56* (.04)

β .28* (.09) -.05 (.08) .27* (.09) .07 (.06) .45* (.10) .45* (.15)

θL .70* (.03) .75* (.03) .99 (.01) .90* (.02) 1.00 (.00) .95* (.01)

C 501* (1.2) 507* (7.2) 625* (15) 723* (31)

α = α* p ≤ .01 p ≤ .01 p = .13 p ≤ .01 p ≤ .01 p = .11

Notes. *p<=.01. Significance tests for α, β and C test whether these parameters are different from 0. For θL we test θL =1. The

parameter C is not reported if θL not different from 1. α and β, and the hypothesis tests related to α, are based on the mean

estimates of the according random slopes, see Appendix 2 for details. Standard errors are in parentheses.

We now test whether our behavioral estimates for α from Table 4 react to changes in our experimental

parameters as expected. To do so, we re-estimate model 4 across two conditions simultaneously, allowing

model parameters to change between conditions. This contrast estimation allows us to test whether model

18

parameters are significantly different from each other between conditions. The results from this analysis

are reported in Table 5.

Table 5: Contrasts of α Estimates Across Demand Conditions

n = 10 n = 40

α α

c = 0 .39 (.04) (p=.32) .48 (.05)

(p≤.01) (p≤.10)

c = 10 .68 (.04) (p≤.10) .60 (.04)

(p=.66) (p=.36)

c = 40 .70 (.03) (p≤.01) .56 (.04)

Notes. Bold entries are estimated values for α from Model 4, Table A1,

with std. errors reported in brackets. The p-values between columns and

rows are significance tests comparing α between conditions.

We observe that α increases significantly in c initially (c = 0 versus c = 10), with no further increase for c

= 40. We further observe that α decreases significantly in n, except for the stationary demand conditions

with c = 0.

The full model includes additional behavioral parameters besides α. While we view these additional

parameters primarily as statistical controls for behavioral effects that may otherwise be falsely attributed

to 𝛼, we briefly comment on three additional results. First, consider the anchoring parameter θL. Note that

in conditions 1 and 2, initial anchoring is functionally equivalent to the normative benchmark, and we see

evidence for such anchoring in the data. In the other conditions, initial anchoring is not the normative

benchmark, and we see that in conditions 3 and 5,( i.e. low noise conditions), no initial anchoring takes

place. We observe evidence for initial anchoring only in the high noise conditions (4 and 6), leading to the

conclusion that this decision bias is visible in our data, but only if noise is high.

Next, consider the illusionary trend parameter β. In conditions 1,3,5 and 6, β is positive and

significant, indicating that in these conditions, respondents do tend to see illusionary trends. Interestingly,

there is little evidence for a significant β in conditions 2 and 4, indicating that the tendency to detect

illusionary trends tends to be less prevalent in high noise conditions. Finally, consider the effects of ΔFt in

Table A1. These effects are generally negative and significant. Since θL > α in all of our conditions, Eq.

(7) does not explain these negative effects. A possible post-hoc explanation for these negative effects may

lie in regret: Strong recent changes in a forecast are countered by later forecast adjustments in the

opposite direction. People may recognize their tendency to over-react (either due to system neglect of

illusionary trends), and actively counter this prior over-reaction in a current forecast.

19

4.6 Performance Implications

We now explore how the decision biases uncovered in the previous section impact forecasting

performance. Using the estimations from our generalized forecasting model, we can attribute loss in

forecasting performance to two classes of (mis)behavior: systematic decision biases such as mis-specified

error-response (𝛼 ≠ 𝛼∗(𝑊)), initial anchoring (θL ≠ 1) or illusionary trends (𝛽 > 0), and unsystematic

“trembling hands” random errors. To separate these two sources of performance loss, we calculate for

each demand seed s the forecast performance of three types of forecast evolutions: the normative, the

observed, and the “behaviorally predicted” forecasts. The normative forecast in period t of seed s is

defined by 𝐹𝑠𝑡∗ = 𝐹𝑠𝑡 𝐷𝑠𝑡−1 , 𝐹𝑠𝑡−1 𝛼

∗(𝑊) where 𝛼∗(𝑊) is common to all demand seeds within a

demand environment. The observed forecast of subject i in period t of seed s is denoted 𝐹𝑠𝑖𝑡 . The

predicted forecasts are defined as 𝐹 𝑠𝑖𝑡 = 𝐹 𝑠𝑖𝑡 𝐷𝑠𝑡−1 , 𝐹 𝑠𝑖𝑡−1 Θ sit , where Θ sit are the estimated parameters

of our generalized forecasting model, including fixed effects and best linear unbiased predictions of

random effects at the “dataset” level. The predicted forecasts 𝐹 𝑠𝑖𝑡 are seed- and individual-specific

forecasts that, unlike the observed 𝐹𝑠𝑖𝑡 , were “filtered” through the structural estimation of the parameters

of our generalized forecasting model. These prediction were obtained using best linear unbiased

predictions in Stata (see Bates and Pinheiro 1998). We then measure performance for each of our six

demand conditions as the mean absolute forecast error, averaged across all I subjects i and all S seeds s

within that condition. Formally, for the observed forecasts, we define 𝑀𝐴𝐷𝑜 = 1

𝑆𝐼𝑇 𝐹𝑠𝑖𝑡 − 𝐷𝑠𝑡 𝑆𝐼𝑇 , and

equivalently for the normative (𝑀𝐴𝐷𝑛 ) and predicted forecasts (𝑀𝐴𝐷𝑝 ).6 Using these definitions, we can

describe the total performance loss from observed forecasts as (𝑀𝐴𝐸𝑜 − 𝑀𝐴𝐸𝑛)/𝑀𝐴𝐸𝑛 . Importantly, we

can precisely capture the loss in forecasting performance due to systematic decision biases (“Loss 1”) as

(𝑀𝐴𝐸𝑝 − 𝑀𝐴𝐸𝑛)/𝑀𝐴𝐸𝑛 , and the loss in forecasting performance due to unsystematic random noise in

decision making (“Loss 2”) as (𝑀𝐴𝐸𝑜 − 𝑀𝐴𝐸𝑝)/𝑀𝐴𝐸𝑛 . Table 6 provides an overview of our analysis.

6 Because we cannot fit our generalized forecasting model for periods 30-32 due to insufficient forecasting

history, and Period 79 is the last period which results in an observed error, we use all forecasts made from period 33-

79.

20

Table 6: Mean Absolute Forecast Errors and Performance Loss

Normative

(MAEN)

Predicted

(MAEP)

Observed

(MAEO)

Loss 1

(MAEP-

MAEN)/MAEN

Loss 2

(MAEO-

MAEP)/MAEN

Loss (Total)

(MAEO-

MAEN)/MAEN

n=10 n=40 n=10 n=40 n=10 n=40 n=10 n=40 n=10 n=40 n=10 n=40

c=0 7.75 30.74 8.86 34.88 10.15 38.55 14% 13% 17% 12% 31% 25%

c=10 12.86 36.51 14.34 42.01 16.42 47.36 11% 15% 16% 15% 28% 30%

c=40 34.34 53.54 35.41 56.82 38.94 64.03 3% 6% 10% 13% 13% 20%

As expected, we observe that mean absolute errors from observed forecasts exceed those from

normative forecasts. We also observe that MAEs increase in c and n, but this result has to be interpreted

with caution. Different environments produce different forecast performance due to their inherent

complexity. Forecasting performance relative to optimal performance improves in less stable

environments (Loss Total). Interestingly, we observe that in general, the loss in performance due to

random decision making (Loss 2) is as high as, or higher than, the loss of performance due to decision

biases (Loss 1). Counter to intuition, Loss 1 is lower in conditions with high change (c = 40) than in

conditions with little (c = 10) or no (c = 0) change. It seems that the decision heuristics individuals use to

make forecasts work better in unstable and changing environments, and become more biased in stable

environments.

One could make the argument that our performance comparison in Table 7 is unfair. Subjects did not

know the c and n of their experimental condition, and would have had to use their existing data to

estimate these parameters. Or, alternatively, forecasters may use an out-of-sample procedure, where they

use the existing data to find optimal smoothing parameters directly (instead of estimating c and n).

Therefore we compared our normative MAEs to the MAEs obtained from an out-of-sample procedure: In

each time period for each dataset, we estimated an optimal α7 and created a forecast using that optimal α.

The MAE resulting from such out-of-sample forecasts was very close to the normative MAE reported in

Table 7.

5. Study 2 (Intervention)

The previous section documents how biased reaction patterns can lead to suboptimal forecasting

accuracy. In this section we ask how one could improve performance. Consider that a key observation

from our baseline study is a systematic pattern of over- and under-reaction to forecast errors. The main

7 As an optimality criterion, we used either the MAE or the MAPE as a criterion, in addition to a more modern

maximum likelihood approach (Hyndman, Koehler, Snyder and Grose 2002). We also used a maximum likelihood

procedure that allows for simultaneous parameter estimation in the context of double exponential smoothing

(Andrawis and Atiya 2009).

21

idea explaining this system neglect pattern is that the sensory nature of a signal may partially override the

cognitive processes that determine how much weight should be attributed to the signal. This consistent

tendency to “neglect the system” lends itself to the design of a possible intervention to improve

forecasting performance. If one can re-emphasize the system that created a signal before a decision maker

comes up with a forecast, one might reduce this 'cognitive override' of the signal, and lower the bias

created by system-neglect. We therefore attempt to render signals less salient relative to the broader

information about the environment that produced this signal by asking subjects to sequentially prepare

forecasts for different demand time series. If subjects switch to a different time-series before making a

new forecast, they need to re-focus on the new series. We hypothesize that this process breaks the

saliency of observed forecast errors, and reduces system-neglect and the resulting over/under-reaction

pattern across different demand environments.

HYPOTHESIS 2 (MULTIPLE TASK STRUCTURE): A task design that requires repeated sequential

forecasts for multiple time-series reduces system-neglect patterns and improves performance.

5.1 Experimental Design

In each period, subjects observe demand for a product, make a forecast for the next period, and repeat this

task for all products in that period. As in our baseline study, we provide both a graph and a table that

display the history of demand realizations as well as information on forecast errors.

We use demand environments 4 (c = 10, n = 40) and 6 (c = 40, n = 40) from our baseline study, and

make the demand environment a between-subject factor. The two environments were specifically chosen

because they represent conditions with very high and low Loss 1 in our baseline study (see Table 5),

while only manipulating only c and leaving n constant. For 12 periods, subjects make one forecast for

each of the four demand datasets nested within the same demand environment. We again pay subjects

based on their mean absolute percentage forecasting error averaged across four products and 12 periods,

in addition to a participation fee of $5. Subjects earned, on average, $14.94. Thirty-four subjects

participated in demand environment 4, and forty subjects participated in environment 6. Subjects were

from the same subject pool and had not participated in the baseline study.

5.2 Analysis

To test Hypothesis 2, we need to establish that subjects overreact less to their forecast error in condition 4

in the 4-Product treatment when compared to the baseline 1-Product treatment, while they react similarly

in both treatments (i.e., close to the normative value) in condition 6. For this analysis, we re-estimate Eq.

(7) removing the random effects at the condition level while retaining random effects at the individual

level. We also add a control variable for decision number, since forecasts in the baseline treatment will

have been made earlier in the course of the experiment when compared to the 4-Product intervention.

Finally, we allow heteroskedasticity of regression errors between the 1-Product and 4-Product treatments.

22

A summary of the relevant parameter estimates and tests that examine whether these parameter estimates

differ between the 1-Product and 4-Product treatments is given in Table 7.

Table 7: Behavioral Model Parameter Estimates in Managerial Intervention

Coefficient Condition 4 Condition 6

1-Product 4-Product Δ 1-Product 4-Product Δ

α* .22 .22 - .62 .62 -

α .71 .52 p ≤ .05 .70 .70 p = .94

σ(ε) 29 24 p ≤ .01 26 29 p ≤ .01 Notes. ** p ≤ .01; * p ≤ .05; † p ≤ .10. The Δ column provides p-values for the tests that

compare parameter estimates in the 4-Product treatment to those in the 1-Product treatment.

Tests for σ(ε) are LR tests for decrease in model fit for a homoskedastic model.

As predicted, the parameter α in the 4-Product treatment is lower than in the 1-Product treatment within

condition 4. Decision makers overreact to their error in both cases, but less so in the 4-Product treatment.

In condition 6, as expected, α estimates are not different from each other between the two treatments. This

analysis supports Hypothesis 2.

Next we analyze the performance implications of the behavioral changes created by the intervention.

As a first test to compare the forecasting performance of subjects between the 4-Product to the 1-Product

(baseline) treatment, we calculated the mean absolute error (MAE) within each condition where both the

4-Product and 1-Product treatments were applied (6 and 4). To be more precise, in the 1-Product

treatment, the MAE was calculated over only periods 1-12, as only those periods existed in the 4-Product

treatment. A simple t-test comparing the 4-Product treatment MAE (= 45.47) to the 1-Product treatment

MAE (= 50.82) in condition 4 reveals a significantly lower MAE in the 4-Product treatment (Δ = 5.35, p

≤ .05). A similar comparison, however, in condition 6 reveals no significant difference between the two

groups (Δ = -3.73, p = .17). This finding is consistent with our expectations. Reducing system neglect

through multi-product forecasting increases forecasting performance in condition 4 and has little or no

effect in condition 6.

To test our predictions more precisely, we estimate a multi-level random effects model to predict the

absolute forecast error in each observation. To make an equivalent comparison, we add an additional

hierarchical level to our analysis, denoting x(s) to be the forecasting context x nested in dataset s. For

example, the first forecast made in dataset 1 in condition 4 is a different context than the second forecast

made in the same dataset. This random effect allows control for the randomness in absolute error. We also

added a decision # variable to control for learning and fatigue effects, as the same instance is forecast at

different times in the game in the 4-Product treatment and in the 1-Product treatment. The results from

this analysis are summarized in Table 8.

23

Table 8: Differences in the Absolute Forecast Errors in the 4-Product vs. 1-Product Treatment

Dependent Variable: Absolute Forecast Error Estimate Standard Error

Condition 6 10.74 (8.55)

4-Product Treatment -7.96** (2.61)

Condition 6 × 4-Product Treatment 8.04** (2.24)

Decision # .16 (.11)

Constant 49.09** (6.10)

σs (random intercept, dataset) 4.34 (8.35)

σi (random intercept, context) 36.27** (2.91)

N 4,047 Notes. ** p ≤ .01, * p ≤ .05, † p ≤ .10. Tests on σ are LR tests. Condition 6 is coded as a dummy

variable = 1 (0 if condition 4), 4-Product Treatment is a dummy variable = 1 (0 if 1 product

treatment)

One can clearly see that our prior predictions are supported. Absolute forecast errors in the 4-Product

treatment in Condition 4 are almost 8 points lower than in the 1-Product treatment (p ≤ .01). The same is

not true in Condition 6, where the forecast errors are not statistically different (p = .97). This analysis

provides further support for Hypothesis 2.

6. Conclusion

Our research investigates judgmental time-series forecasting in environments that can be precisely

described by their stability and noisiness. Behavior is to some degree consistent with the mechanics of

single-exponential smoothing, the normative benchmark in our context. However, subjects tend to

overreact to observed forecast errors for stable time series, and under-react to forecast error for less stable

time series. This pattern is consistent with the system-neglect hypothesis found in the regime-change

literature (cf. Massey and Wu 2005); our research provides empirical support for this hypothesis in a

“many small changes” time-series forecasting context, which is notably different from the “few big

changes” environments commonly investigated in the regime-change literature.

Surprisingly, our results show that decisions made in a stable environment suffer from stronger

systematic decision biases, compared to decisions made in less stable environments. Human judgment

appears to be more adapted to detecting change in volatile environments than to exploiting information in

stable environments. A human tendency to react to noise may simply be the result of an evolved decision

heuristic geared towards the detection of (and adaptation to) change. This would point to managerial

judgment being better in unstable environments. Emphasis should be placed on automating decision

making in stable environments.

We also show that the decline in forecasting performance due to randomness is at least as strong if not

stronger than the decline in forecasting performance due to systematic biases. Since such randomness in

decision making is mitigated by groups - i.e. multiple individuals preparing independent forecasts, and

24

these forecasts being averaged (Larrick and Soll 2006), this points to the large benefits than can be

obtained by using an effective group decision making process in forecasting.

We test an experimental intervention which is designed to mitigate the existing systematic over-

/under-reaction pattern. In particular, we required subjects to make sequential forecasts for multiple

products in an effort to emphasize the environment (which is shared by all products) and de-emphasize

the saliency of each the signal of each product. This intervention is effective in our laboratory setting.

From a theoretical perspective, this finding provides further evidence that the psychological process

underlying the observed over- and under reaction patterns is indeed related to the low relative salience of

the system generating an observed forecast error. It also suggest that it is possible to „overspecialize‟ in

forecasting. While specialization in forecasting may increase tacit domain knowledge about the market

and product, specialization may also increases the influence of system neglect in decision making.

Our results relate to the growing literature on behavioral operations management. Specifically,

experimental studies of simple newsvendor settings have documented a persistent tendency to chase

demand in stationary environments (Schweitzer and Cachon 2000, Bolton and Katok 2008, Kremer et al.

2010). Our study suggests that this tendency may be a forecasting phenomenon and not exclusively

related to inventory ordering. While subjects in newsvendor studies have perfect knowledge about the

underlying demand generating system, the system neglect hypothesis suggests that the signals and

feedback they observe during the course of the experiment will make them at least partially neglect that

knowledge. We therefore conjecture that decomposing forecasting and ordering in newsvendor

experiments may be a fruitful and important endeavor. Further, newsvendor studies assume that using of a

stationary and known demand environment makes the forecasting task simpler. Our results suggest that

stable environments lead to more biased decision making. If subjects neglect their knowledge of the

system and change forecasts based on signals, stable demand environments have not only little ecological

validity (Brown and Steyvers 2008), they may also be environments that significantly decrease the

performance of human judgment. Finally, subjects in most newsvendor and beergame studies are

confronted with demand stimuli in quick succession. Such a context provides a strong salience of demand

signals. Our study would suggest that decision makers may perform better when the relative saliency of

the most recent demand signal is mitigated, for example by re-emphasizing the environment before

making the next decision. It would be interesting to test whether performance in newsvendor experiments

can be improved by re-emphasizing the demand environment after each decision.

Our study has several limitations. Our intervention study was designed to emphasize the “demand-

generating system” by asking subjects to forecast multiple time-series, but the success of this intervention

may stem merely from the time lag between successive forecasts of an individual time-series. Future

research could explore if performance improvements can be achieved by occupying subjects for the same

25

time with an unrelated task. Additionally, while our analyses explicitly controlled for initial anchoring

and illusionary trends, our study was not designed to explore these behaviors in detail. Future research

should further explore these (or other) behavioral phenomena in demand forecasting. Finally, our

forecasting context assumes that forecasters have no quantitative forecasting support available besides a

graph and history table. In practice, many forecasts represent judgmental adjustments to an anchor

provided by a quantitative forecasting technique (Fildes et al. 2009). Future research could more

explicitly address the impact of such anchors.

Our research provides a solid theoretical and empirical framework for modeling human judgment in

forecasting in non-stationary time-series. This rich context is relevant for many different fields beyond

operations management. For example, our framework may be useful for the study of overreaction and

illusionary trends in stock markets, or for examining how medical doctors interpret longitudinal data of

their patients, or perhaps as a window for understanding human reactions to climate change. We envision

these developments to be not only empirical but also theoretical in nature. Our research suggests a simple

and fairly generic way of formally capturing a persistent judgment bias and its relationship to parameters

describing a non-stationary environment. Our results could thus inform future work on how to design

information and incentive systems that are robust to the kinds of judgment biases we observe.

References

Adams, J. A. 1968. Response feedback and learning. Psychological Bulletin 70(6) 486-504.

Andrawis, R. R., A. F. Atiya. 2009. A new Bayesian formulation for Holt‟s exponential smoothing. J.

Forecasting 28(3): 218-234.

Andreassen, P. B., S. J. Kraus. 1990. Judgmental extrapolation and the salience of change. J. Forecasting

9(4) 347-372.

Asparouhova, E., M. Hertzel, M. Lemmon. 2009. Inference from streaks in random outcomes:

Experimental evidence on beliefs in regime shifting and the law of small numbers. Management

Sci. 55(11) 1766-1782.

Barberis, N., A. Shleifer and R. Vishny. 1998. A model of investor sentiment. J. Financial Economics,

49: 307-343.

Barry, D. M., G. F. Pitz. 1979. Detection of change in nonstationary, random sequences. Organizational

Behavior and Human Performance 24 111-125.

Bates, D. M., J. C. Pinheiro. 1998. Computational methods for multilevel modeling. Technical

Memorandum BL0112140-980226-01TM. Murray Hill, NJ: Bell Labs, Lucent Technologies.

Bendoly, E., K. Donohue, K. L. Schultz. 2006. Behavior in operations management: Assessing recent

findings and revisiting old assumptions. J. Operations Management 24 737-752.

26

Bloomfield, R. and J. Hales. 2002. Predicting the next step of a random walk: experimental evidence of

regime-shifting beliefs. J. Financial Economics 65 397-414.

Bolger, F., N. Harvey. 1993. Context-sensitive heuristics in statistical reasoning. Quarterly J.

Experimental Psychology 46A(4) 779-811.

Bolton, G., E. Katok. 2008. Learning-by-doing in the newsvendor problem: A laboratory investigation of

the role of experience and feedback. Manufacturing & Service Operations Management, 10(3)

519-538.

Brav, A., J. B. Heaton. 2002. Competing theories of financial anomalies. Rev. Fin. Stud. 15(2) 575-606.

Brown, S. D., M. Steyvers. 2009. Detecting and predicting changes. Cognitive Psychology 58 49-67.

Camerer, C. F. 1995. Individual decision making. J. Kagel, A. Roth, eds. The Handbook of Experimental

Economics. Princeton University Press, Princeton, NJ.

Carbone, R., W. Gorr. 1985. Accuracy of judgmental forecasting of time series. Decision Sciences 16

153-160.

Chapman, G. B., E. J. Johnson. 2002. Incorporating the irrelevant: Anchors in judgments of belief and

value. T. Gilovich, D. Griffin, D. Kahneman, eds. Heuristics and Biases. Cambridge University

Press, Cambridge UK, 120-138.

Croson, R., K. Donohue. 2003. Impact of POS data sharing on supply chain management: An

experimental study. Production and Operations Management 12(1) 1-11.

Croson, R., K. Donohue, E. Katok, J. Sterman. 2009. Order stability in supply chains: Coordination risk

and the role of coordination stock. Working paper.

DeBondt, W. F. M. 1993. Betting on trends: Intuitive forecasts of financial risk and return. International

J. Forecasting 9 355-371.

Edwards, W. 1968. Conservatism in human information processing. B. Kleinmuntz, ed. Formal

Representation of Human Judgment. Wiley, NY, 17-52.

Epley, N., T. Gilovich. 2001. Putting adjustment back in the anchoring and adjustment heuristic. Psych.

Science, 12(5): 391-396.

Fildes, R. P. Goodwin, M. Lawrence, K. Nikolopoulos. 2009. Effective forecasting and judgmental

adjustments: An empirical evaluation and strategies for improvement in supply-chain planning.

International J. Forecasting 25 3-23.

Fischbacher, U. 2007. z-Tree: Zurich toolbox for ready-made economic experiments. Experimental

Economics 10(2) 171–178.

Gardner, E. S. 1985. Exponential smoothing: The state of the art. J. Forecasting 4(1) 1-28.

27

Gardner, E. S. 2006. Exponential smoothing: The state of the art– Part II. International J. Forecasting 22

637-666.

Gehring, W. J., B. Goss, M. G. H. Coles, D. E. Meyer, E. Donchin. 1993. A neural system for error

detection and compensation. Psychological Science 4(6) 385-390.

Griffin, D., A. Tversky. 1992. The weighing of evidence and the determinants of confidence. Cognitive

Psychology 24 411-435.

Harvey, N. 2007. Use of heuristics: Insights from forecasting research. Thinking & Reasoning 13(1) 5-24.

Harrison, P. J. 1967. Exponential smoothing and short-term sales forecasting. Management Sci. 13(11)

821-842.

Hyndman, R. J., A. B. Koehler, R. D. Snyder, S. Grose. 2002. A state space framework for automatic

forecasting using exponential smoothing methods. Int. J. Forecasting 18 439-454.

Kahneman, D., A. Tversky. 1972. Subjective probability: A judgment of representativeness. Cognitive

Psychology 3 430-454.

Kremer, M., S. Minner, L.N. Van Wassenhove. 2010. Do random errors explain newsvendor behavior?

Manufacturing & Service Operations Management. Forthcoming.

Larrick, R. P., J. B. Soll. Intuitions about Combining Opinions: Misappreciation of the Averaging

Principle. Management Science 52(1) 111-127.

Lawrence, M. J., R. H. Edmundson, M. J. O'Connor. 1985. An examination of the accuracy of judgmental

extrapolation of time series. International Journal of Forecasting 1 25-35.

Lawrence, M., M. O'Connor. 1992. Exploring judgmental forecasting. International J. Forecasting 8 15-

26.

Lawrence, M., M. O'Connor. 1995. The anchor and adjustment heuristic in time-series forecasting. J.

Forecasting 14 443-451.

Lawrence, M. P. Goodwin, M. O'Connor, D. Önkal. 2006. Judgemental forecasting: A review of progress

over the last 25 years. International J. Forecasting, 22 493-518.

Lee, H. L., V. Padmanabhan, S. Whang. 1997. Information distortion in the supply chain: The bullwhip

effect. Management Sci. 43(4) 546-558.

Makridakis, S., S. Wheelwright, R. Hyndman. 1998. Forecasting: Methods and Applications. Wiley, New

York, NY.

Marathe, R. R.,S. M. Ryan. 2005. On the validity of the geometric Brownian motion assumption.

Engineering Economist. 50 159-192.

Massey, C., G. Wu. 2005. Detecting regime shifts: The causes of under- and overreaction. Management

Sci.51(6) 932-947.

28

McNamara, J. M., A. I. Houston. 1987. Memory and the efficient use of information. J. Theoretical

Biology. 125 385-395.

Poteshman, A.M. 2001. Underreaction, overreaction, and increasing misreaction to information in the

options market. J. Finance, 56(3): 851-876.

Rabin, M. 2002. Inference by believers in the law of small numbers. Quarterly J. Economics, 117(3):

775-816.

Rabin, M. and D. Vayanos. 2009. The gambler‟s and hot-hand fallacies: Theory and applications.

University of California working paper.

Rustichini, A. 2008. Neuroeconomics: Formal models of decision-making and cognitive neuroscience.

Glimcher, P. W., C. Camerer, R. Poldrack, E. Fehr, eds. Neuroeconomics. Elsevier, Holland, UK,

33-46.

Sanders, N. 1992. Accuracy of judgmental forecasts: A comparison. Omega 20(3) 353-364.

Sanders, N. 1997. The impact of task properties feedback on time series judgmental forecasting tasks.

Omega 25 135-144.

Sanders, N., K. B. Manrodt. 2003a. Forecasting software in practice: Use, satisfaction, and performance.

Interfaces 33(5) 90-93.

Sanders, N., K. B. Manrodt. 2003b. The efficacy of using judgmental versus quantitative forecasting

methods in practice. Omega 31 511-522.

Schweitzer, M. E., G. Cachon. 2000. Decision bias in the newsvendor problem with a known demand

distribution: Experimental evidence. Management Sci.46(3) 404-420.

Stone, E.R., R.B. Opel. 2000. Training to improve calibration and discrimination: The effects of

performance and environmental feedback. Organizational Behavior and Human Decision

Processes 83 282-309.

Su, X. 2008. Bounded rationality in newsvendor models. Manufacturing and Service Operations

Management 10(4) 566-589.

Tversky, A., D. Kahnemann. 1971. Belief in the law of small numbers. Psychological Bulletin 80 237-

251.

Verbeek, M. 2000. A Guide to Modern Econometrics. Wiley, New York, NY.

Wiener, N. 1948. Cybernetics or Control and Communication in the Animal and the Machine. Wiley,

New York, NY.

Winters, P.R. 1960. Forecasting sales by exponentially weighted moving averages. Management Sci.. 6(3)

324-342.

29

Appendix 1: Pre-test Information

Prior to the study presented in this paper, we completed a thorough pre-test of our experiment. The task,

experimental parameters, software and functionality were very similar to the baseline study reported here,

with two exceptions: Participants in the pretest made decisions for only 40 consecutive periods, while the

data presented here is based on 50 periods. Second, the students in the pre-test were given course extra

credit for participating and were entered into a drawing for one cash reward per section. We conducted

the same statistical tests on our pretest data and found results that are directionally identical to the ones

reported here. The pre-test was predominantly used to determine whether subjects should receive a graph

of the time series or not, and whether providing qualitative information on the demand series (product

with stable/unstable demand) influenced performance. The final design (subjects receive a graph but no

qualitative information) corresponds to the setting in the pre-test where subjects had the best performance.

Appendix 2: Econometric Specification and Estimation Details

Equation (7) provides a basis for the behavioral model we estimate in our analysis. An empirical problem

with equation (7) is that we do not observe data on Tt-1. This could bias empirical results. To at least

partially control for this potential bias, we propose to estimate Eq. (7) with the additional independent

variables ΔDt-1 and ΔFt-1, leading to the following empirical specification:

1 1 2 3 4 1 5 6 1 constantt t t t t t t tF a E a F a D a D a F a F (8)

Finally, as we will see in the analysis, the following (equivalent) specification of Eqn. (8) provides for

an easier comparison of nested models, and therefore serves as our primary empirical specification:

1 1 2 3 4 1 5 6 11 constantt t t t t t t tF a E a F a D a D a F a F (9)

In general, an observation at time t in the experiment is nested in subject i, who is nested in demand

dataset s, which in turn is nested in experimental condition (i.e., demand environment) c. Since we

estimate our model within each condition, this implies a three level nested structure of error terms, such

that we have random intercepts vs and wi. Further, we believe that the behavioral parameters of our model

vary considerably, depending on both the actual dataset being observed and on the individual performing

the forecast. This expectation would imply that a1 - a4 should be modeled as random coefficients.

However, results in our pretest show that, while there was some variance over a1 and a3, there was little

variance on the other two coefficients. Estimating random coefficients models in which the coefficients

have little variance can lead to non-convergence and inappropriate standard errors. We therefore only

30

estimate a1 and a3 as random slopes. This three level random-effects model will effectively control for the

dependence we have among observations in our dataset. In summary, we can write:

1( , , ) 1 2 3 4 1 5 6 1 ( ) ( , )con.si si

t c s i t t t t t t s c i s c tF a E a F a D a D a F a F v w (10)

All random coefficients are estimated as having a normal distribution. In our results, we use µ and σ to

refer to the mean and standard deviation of that distribution. For example, µ(Et) refers to the mean of the

random slope a1, whereas σi(Et) refers to the standard deviation of that slope at the individual level. The

behavioral parameters of Eq. (7) can then be calculated as follows: α = µ(Et), 𝜃𝐿 = 𝑎 2 + 1, 𝛽 =

𝜇 𝐸𝑡 /𝜇 𝐷𝑡 and 𝐶 = −con./𝑎 2.

31

Table A1: Results from Behavioral Estimation by Condition

Condition 1 Condition 2 Condition 3

Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4

μ(Et) .63** (.04) .49** (.03) .40** (.03) .39** (.04) .54** (.04) .45** (.04) .46** (.05) .48** (.05) .84** (.05) .85** (.04) .74** (.04) .68** (.04)

Ft -.39** (.02) -.37** (.03) -.30** (.03) -.33** (.02) -.31** (.02) -.25** (.03) -.01 (.01) -.01† (.01) -.01 (.01)

μ(ΔDt) .10** (.03) .11** (.03) -.01 (.04) -.03 (.04) .14** (.05) .19** (.05)

ΔDt-1 .00 (.01) .01 (.02) -.02† (.01) -.05** (.02) -.01 (.01) .02 (.02)

ΔFt -.08** (.02) -.03 (.02) -.08** (.02)

ΔFt-1 -.05** (.02) -.08** (.02) -.07** (.01)

μ(con.) 194** (11) 183** (13) 151** (15) 166** (11) 156** (12) 129** (15) 8.2† (4.4) 8.5* (4.2) 4.2 (3.9)

σs(Et) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .05 (.06) .05 (.07) .07 (.05) .06 (.05) .00 (.00) .00 (.00)

σs(ΔDt) .00 (.00) .00 (.00) .04 (.05) .04 (.05) .06 (.05) .07 (.05)

σs(con.) .81 (.36) .76 (.33) .65 (.29) 4.0* (1.9) 3.5* (1.6) 3.2* (1.5) 1.2† (.63) 1.1† (.56) .88† (.49)

σi(Et) .22** (.03) .18** (.02) .18** (.03) .17** (.03) .26** (.03) .22** (.03) .22** (.03) .22** (.03) .20** (.03) .20** (.03) .19** (.03) .18** (.03)

σi(ΔDt) .13** (.02) .13** (.02) .14** (.03) .15** (.03) .19** (.03) .19** (.03)

σi(con.) .97 (.21) .77 (.21) .59 (.23) 6.2** (.96) 4.9** (.85) 4.7** (.86) 1.78 (.37) 1.3 (.39) 1.0 (.57)

N 2,021 (43) 1,923 (41) 1,880 (40)

Δχ2 256.00** 63.93** 13.48** 250.61** 57.60** 26.74** 65.99** 70.28** 32.08**

Condition 4 Condition 5 Condition 6

Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4

μ(Et) .72** (.03) .67** (.03) .67** (.03) .60** (.04) .97** (.04) .98** (.05) .82** (.03) .70** (.03) .82** (.05) .80** (.05) .66** (.04) .56** (.04)

Ft -.16** (.02) -.15** (.02) -.10** (.02) .00 (.00) .00 (.00) .00 (.00) -.09** (.01) -.07** (.01) -.05** (.01)

μ(ΔDt) -.02 (.03) .04 (.03) .19** (.06) .31** (.07) .16* (.08) .25** (.08)

ΔDt-1 -.04** (.01) .01 (.02) -.03** (.01) .10** (.02) -.04** (.01) .06** (.02)

ΔFt -.14** (.02) -.14** (.02) -.14** (.02)

ΔFt-1 -.07** (.01) -.05** (.01) -.03** (.01)

μ(con.) 99** (9.5) 90 (9.9) 62** (11) 4.5† (2.6) 2.9 (2.4) 1.2 (2.2) 67** (6.4) 52** (6.1) 37** (6.3)

σs(Et) .00 (.00) .00 (.00) .02 (.10) .02 (.08) .07 (.03) .08 (.04) .00 (.00) .00 (.00) .07 (.05) .07 (.04) .00 (.00) .00 (.00)

σs(ΔDt) .03 (.05) .02 (.07) .12** (.05) .11** (.05) .14** (.06) .13** (.06)

σs(con.) 2.9* (1.4) 3.0* (1.4) 2.5* (1.2) 2.2 (1.5) 1.9 (1.3) 1.7 (1.1) 4.4† (2.1) 3.1† (1.6) 2.4† (1.3)

σi(Et) .18** (.02) .17** (.04) .15** (.03) .13** (.03) .12** (.02) .11** (.02) .11** (.02) .10** (.02) .21** (.03) .19** (.02) .20** (.03) .17** (.03)

σi(ΔDt) .10* (.03) .10* (.03) .10** (.02) .10** (.03) .18** (.03) .18** (.03)

σi(con.) 3.5** (.98) 3.6** (1.0) 2.4 (1.2) 5.3** (.81) 4.1** (.83) 3.1** (1.1) 6.9** (1.2) 4.9* (1.1) 3.4† (1.3)

N/Sub. 1,880 (40) 2,018 (43) 2,111 (45)

Δχ2 135.47** 15.37** 36.73** 95.81** 95.98** 47.93** 169.29** 160.60** 43.41**

Notes. ** p≤.01; * p≤.05; † p≤.10; µ(x) stands for the mean of random effect x; σi(x) stands for the standard deviation of random effect x at the individual level (s indicates dataset level);

demand forecasting behavior: system neglect and change ... · (harvey 2007). regarding managerial...

Documents