demand forecasting behavior: system neglect and change ... · (harvey 2007). regarding managerial...
TRANSCRIPT
1
Demand Forecasting Behavior: System Neglect and Change Detection
Mirko Kremer Smeal College of Business, Pennsylvania State University, University Park, Pennsylvania 16802,
Brent Moritz Carlson School of Management, University of Minnesota , Minneapolis, Minnesota 55455,
Enno Siemsen Carlson School of Management, University of Minnesota , Minneapolis, Minnesota 55455,
Abstract:
This research analyzes how individuals make forecasts based on time series data, and tests
an intervention designed to improve forecasting performance. Using data from a controlled
laboratory experiment, we find that forecasting behavior systematically deviates from
normative predictions: Forecasters over-react to errors in relatively stable environments, but
under-react to errors in relatively unstable environments. Surprisingly, the performance loss
due to systematic judgment biases is larger in stable than in unstable environments. In a
second study, we test an intervention designed to mitigate these biased reaction patterns. In
order to reduce the salience of recent demand signals, and emphasize the environment
generating theses signals, we require forecasters to prepare a forecast in other time-series
before returning to their original time-series. This intervention improves forecasting
performance.
Keywords: forecasting; behavioral operations; system-neglect; exponential smoothing
Working Paper Draft - Please do not Distribute
2
1. Introduction
Demand forecasting in time series environments is fundamental to many operational decisions. Poor
forecasts can result in inadequate capacity, excess inventory or inferior customer service. Given the
importance of good forecasts to operational success, quantitative methods of time-series forecasting are
well known and widely available (cf. Makridakis, Wheelwright and Hyndman 1998). Despite the fact that
companies frequently have access to time-series history and sophisticated quantitative methods embedded
in forecasting software, empirical evidence shows that real world forecasting frequently relies on human
judgment. In a study of 240 U.S. corporations, while over 90% of companies reported having access to
some forecasting software (Sanders and Manrodt 2003a), only 29% of firms primarily use quantitative
forecasting methods, 30% primarily use judgmental methods while the remaining 41% apply both
quantitative and judgmental methods (Sanders and Manrodt 2003b). Although quantitative analysis based
on a time-series may often provide the basis for a forecast, it is a common practice to alter such forecasts
based on human judgment (Fildes, Goodwin, Lawrence and Nikolopoulos 2009).
A recent trend in operations management research is to study operational decisions from a behavioral
perspective (Bendoly, Donohue and Schultz 2006). While much research in behavioral operations
management is devoted to inventory decision making, Schweitzer and Cachon (2000, p. 419) highlight
the importance of explicitly separating the forecasting task from the inventory decision task:
“While the forecasting task typically requires managerial judgment, the task of converting a
forecast into an order quantity can be automated. A firm may reduce decision bias by asking
managers to generate forecasts that are then automatically converted into order quantities.”
Thus, inventory decisions can (and frequently should) be decomposed: When choosing an order quantity,
an individual has to estimate the probability distribution of future demand; derive a service level; and
then use demand distribution and service level to determine an order quantity. Biased judgments of
demand distributions would result in sub-optimal inventory decisions. For example, Schweitzer and
Cachon (2000) investigate newsvendor decision making under stationary and known demand
distributions, a setting where demand forecasting is theoretically irrelevant. A key finding is that order
quantities are on average biased towards mean demand, relative to the expected profit maximizing order
quantity. This biased ordering has been attributed to unsystematic randomness in decision making (Su
2008) as well as the more systematic biases like demand chasing (Kremer et al. 2010), i.e., the tendency
to adjust orders toward previous demand. In a more complex “beergame” setting, Croson and Donohue
(2003) observe the bullwhip effect, i.e. upstream order amplification in the supply chain, with participants
who face a known and stationary demand distribution. Croson, Donohue, Katok and Sterman (2009)
observe this effect even with constant and deterministic demand. In sum, existing experimental evidence
3
suggests that biased judgments of demand distributions can strongly affect the quality of higher-order
decisions like purchasing, inventory or capacity planning. Therefore, the analysis of judgmental
forecasting is crucial for a better understanding of decision making in operations management.
Extensive literature on human judgment in time-series forecasting exists (Lawrence, Goodwin,
O‟Connor and Onkal 2006). Central findings include the wide-spread use of heuristics such as anchor-
and-adjustment, as well as the importance of feedback and task decomposition on forecasting
performance. However, the overall findings remain somewhat inconclusive, in part because forecasting
behavior appears sensitive to different components of the time-series. Further, the judgmental forecasting
literature is typically concerned with the detection of predictable changes in a time series, such as trends
or seasonality (Harvey 2007). In contrast, our research is focused on individual reaction to unpredictable
change in time-series. We ask the following two research questions: First, how do individuals create time-
series forecasts in unstable environments? Second, what can managers do to improve forecasting
performance?
We study these questions in a laboratory setting that allows for precise normative predictions:
forecasting a time series generated by a perturbed random walk. Across a wide range of environmental
conditions, we show that time-series forecasting behavior is described by an error-response model.
However, forecasters tend to over-react to forecast errors in more stable environments and under-react to
forecast errors in less stable environments. This pattern is consistent with the system neglect hypothesis
(Massey and Wu 2005) which posits that forecasters place too much weight on recently observed forecast
errors relative to the environment that produces these signals. To explore how to improve forecasting
performance, we therefore design and test an intervention which builds directly on the system neglect
hypothesis. Instead of making forecasts for a single time-series (our base study), we require subjects to
make forecasts for multiple time-series in parallel in our second study, in an attempt to reduce the relative
salience of a recent signals and re-emphasize the demand environment underlying/common to all time-
series. We find that this simple intervention can improve forecasting performance.
We proceed in this paper as follows. The next section outlines the academic literature that relates to
our research. In §3 we discuss our theoretical developments. In §4 we discuss the results of our first
study, which is focused on understanding human judgment in time-series analysis tasks. Section 5 is
devoted to the results of our second study, which emphasizes managerial interventions to improve human
judgment in time-series analysis tasks. We discuss our results and conclude the paper in §6.
2. Related Literature
Existing research on judgmental forecasting provides vast but somewhat inconclusive empirical evidence
regarding forecasting performance, cognitive processes, and managerial interventions. Many studies have
4
been devoted to comparing the performance of human forecasts to quantitative forecasting methods, but
the empirical evidence is not consistent (Lawrence et al. 1985, Carbone and Gorr 1985, Sanders 1992,
Fildes et al. 2009). The literature has also investigated a variety of cognitive processes underlying the
evolution of judgmental forecasts, such as different variations of the anchoring and adjustment heuristic
(Harvey 2007). Regarding managerial interventions, judgmental forecast accuracy can improve with
performance feedback (e.g., Stone and Opel 2000) and task properties feedback (e.g., Sanders 1997), but
the effectiveness of these levers depends on specific contextual elements of the forecasting task
(Lawrence et al. 2006). Existing research on judgmental time-series forecasting examines pattern
detection, i.e. how well human subjects can identify trends and seasonal changes in a noisy time series
(Andreassen and Kraus 1990; Lawrence and O'Connor 1992; Bolger and Harvey 1993; Lawrence and
O‟Connor 1995). In contrast, our research focuses on change detection, i.e. how subjects separate random
noise from unsystematic level changes.
When observing signal variation in a time-series, a forecaster needs to identify if there is substantive
(and persistent) cause for this variation, or whether variation just represents noise with no implications for
future observations. The ability to distinguish substantive change from random variation has been studied
extensively in the literature on regime change detection (Barry and Pitz 1979). A central conclusion from
regime change research is that people under-react to change in environments that are unstable and have
precise signals, and overreact in environments that are stable with noisy signals (Griffin and Tversky
1992). This seemingly contradictory reaction pattern has been reconciled by the system-neglect
hypothesis (Massey and Wu 2005), which posits that individuals overweigh signals relative to the
underlying system which generates the signals.
A related stream of research in financial economics seeks to explain the pattern of short-term under-
reaction and long-term overreaction to information, often observed in stock market investment decisions
(Poteshman 2001). Some theoretical work has been devoted to explaining this behavioral pattern, e.g. by
linking such behavior to the “gambler‟s fallacy” or the “hot-hand effect” (Barberis et al. 1998, Rabin
2002, Rabin and Vayanos 2009). In an asset pricing context, Brav and Heaton (2002) illustrate how an
over-/underreaction pattern arises from biased information processing of investors subject to the
representativeness heuristic (Kahnemann and Tversky 1972) and conservatism (Edwards 1968), and show
how this pattern can also arise from a fully Bayesian investor lacking structural knowledge about the
possible instability of the time-series. Experimental tests of this “mixed-reaction pattern” include
Bloomfield and Hales (2002) and Asparouhova, Hertzel and Lemmon (2009).
A central difference between our research and existing research on human change detection patterns is
the complexity of the judgment environment. In Massey and Wu (2005), participants face binary signals
(red or blue balls) which can be generated from two regimes (draws from two urns with fixed proportions
5
of red and blue balls in each). Given a sequence of signals, the experimental task is to identify when a
regime change (i.e. a switch from one urn to the other) has occurred. Further, as subjects have a perfect
knowledge of the system parameters (the proportion of blue balls in either urn) there is no ambiguity
concerning the relevant world. This environment fits a binary forecasting task where a well-known
phenomenon needs to be detected (for example, when a bull market turns into a bear market). Similarly,
in Bloomfield and Hales (2002) and Asparouhova et al. (2009), participants face a fairly simple series of
signals generated from a symmetric random walk. Brav and Heaton (2002) illustrate their theoretical
considerations in an environment where a series of independently and identically distributed assets exhibit
a single structural break which shifts the asset distribution only once during the time series. A central
question of our research is whether the over-reaction/under-reaction patterns observed in such fairly
simple settings translate to the relatively richer environment of time-series demand forecasting under
frequent change. Further, beyond trying to understand human reaction patterns, our study designs and
tests an intervention to mitigate biases and the resulting performance losses.
3. Theory
To begin our theory development, it is important to briefly characterize the judgment task underlying a
time-series forecast. In essence, a forecaster (=she) needs to decide whether observed variation in the time
series data provides a reason to modify a previous forecast in the next period. We illustrate this judgment
task in Figure 1.
Figure 1: The Challenge of Time-Series Analysis
If she interprets variation purely as random noise, she can ignore this variation and not change her
forecast (i.e. a long-run average, the circle in Figure 1). If she believes that variation represents a change
Noise
Change in Level
Trend
Time
Demand
Average
Past Present Future
Variation indicates …
Variation
6
in the underlying level of the time series, the most recent demand observations contain more information
about the future than past observations, and need to receive more weight in the forecast. Her forecast is
then close to the square in Figure 1. Finally, if she believes that this variation is indicative of a trend, (an
ongoing change in the level), she would extrapolate the existing variation to re-occur in the future, and
her forecast would be close to the triangle in Figure 1.
In practice, these choices are not mutually exclusive. A forecaster may decide that variation is partially
due to noise, and partially due to a level change, and therefore create a forecast somewhere in between the
square and the circle in Figure 1. Or she may believe that variation represents both a level change and a
trend. The key challenge is differentiating level changes from noise. While our empirical analysis will
control for individuals potentially detecting illusory trends, our simulated demand environment does not
contain trends, and a comprehensive discussion of trend detection is beyond the scope of this paper.
3.1 Demand Environment
We assume that forecasters react to demand observations in time intervals indexed by t, without any
additional information on future demand realizations beyond that which is contained within the time
series. The level of our time series changes according to a random walk. If we define μt to be the level at
time t, the level at the next regular observation at time t + 1 is given by μt+1 = μt + Vt, where Vt is a
normally distributed random variable with mean 0 and standard deviation c. The demand observation Dt
in each time period is then a normally distributed random variable with mean μt and standard deviation n.
Roughly put, the standard deviation c captures the notion of change, i.e. permanent shocks to the time-
series, while the standard deviation n captures the noise surrounding the level, i.e. temporary shocks to the
time-series. With a change parameter c, the level of the time series in the next period has a 68% chance of
being within +/- c of the level in the current period. The noise parameter n implies a 68% chance of the
actual demand observation being less than +/- n away from the true level. Figure 2 illustrates how the
shape of a representative time series depends on these two parameters.
While allowing for randomly changing levels μt, a time series from this data-generating process has no
underlying systematic trend or seasonality. Although real time-series often contain such elements,
methods to de-trend and de-seasonalize data are available (Winters 1960). For simplicity, our research is
focused on real-world data that has gone through such modifications, or data that can be well described by
Brownian motion, such as energy demand or airline passengers (Marathe and Ryan 2005). Importantly,
this simplification allows us to study the decision task of differentiating level changes from noise, without
further confounding such judgments with the estimation of trends and seasonal elements. Further, the
demand process we consider provides a simple normative benchmark: Single exponential smoothing
(Harrison 1967, McNamara and Houston 1987).
7
Figure 2: Sample Demand Paths for different c and n (𝝁𝟎 = 𝟓𝟎𝟎).
3.2 Normative Benchmark
Structurally, the single exponential smoothing forecast Ft+1 (made in period t for period t + 1), is a
weighted average of the most recent demand observation and the previous forecast, 𝐹𝑡+1 = 𝛼𝐷𝑡 +
1 − 𝛼 𝐹𝑡 = 𝐹𝑡 + 𝛼(𝐷𝑡 − 𝐹𝑡). The latter part of this equation highlights how the forecast 𝐹𝑡+1 is driven
by a response to the forecast error. The appropriate smoothing level 𝛼∗ is a function of the change (c) and
noise (n) parameters governing the time series. To further characterize 𝛼∗, it is useful to introduce the
concept of weight of evidence. We formally define this weight as the change-to-noise ratio W = c2/n2,
which increases as the degree of change in the time-series (c) rises, and decreases as the noise in the time
series (n) intensifies. Intuitively, W measures the reliability with which an observed forecast error
represents a change in the level of a time series. With low W (variations in demand are mostly noise),
forecast errors should be discarded and should not influence behavior. With a high W (variations in
De
man
d
Time
Condition 1, c=0, n=10
De
man
d
Time
Condition 2, c=0, n=40
De
man
d
Time
Condition 3, c=10, n=10
De
man
d
Time
Condition 4, c=10, n=40D
em
and
Time
Condition 5, c=40, n=10D
em
and
Time
Condition 6, c=40, n=40
8
demand are mostly level changes), forecast errors should strongly influence forecasts. This intuition is
formally supported by Harrison (1967) and McNamara and Houston (1987) who show that1
𝛼∗ 𝑊 =2
1+ 1+4 1𝑊
. (1)
Note that the optimal smoothing constant in Eq. (1) depends only on the change-to-noise ratio W, while
the demand time series is driven by absolute levels of c and n. For example in Figure 2, condition 3 (c =
10 and n = 10) and condition 6 (c = 40 and n = 40) have the same W, and therefore the same associated
𝛼∗(W). The optimal forecasting mechanism for our demand environment is
𝐹𝑡+1 = 𝐹𝑡 + 𝛼∗ 𝑊 𝐷𝑡 − 𝐹𝑡 . (2)
3.3 Behavioral Forecasting and System Neglect
The previous section outlines how single exponential smoothing with α*(W) is optimal for a random walk
described by c and n. In this section we discuss forecasting behavior relative to this normative benchmark.
From a behavioral perspective, Eq. (2) poses two critical assumptions on the forecaster‟s degree of
rationality (Brav and Heaton 2002). The optimal forecasting mechanism implies that the forecaster has
correct beliefs about the structure of the demand process, knowledge and understanding of the structure of
the optimal forecasting mechanism, and access to an unbiased estimate of α*(W). It is optimistic to
assume that given the complexity of the context, the forecaster has “structural certainty” about the
demand environment (perturbed random walk) and optimal forecasting mechanism (single-exponential
smoothing). For example, a forecaster may perceive trends where there are none (see Figure 1). Our
empirical estimation in Section 4 will therefore allow for richer models that describe forecasting behavior
beyond simple exponential smoothing. From a behavioral perspective, the crucial question is the choice of
the smoothing constant 𝛼(𝑊), relative to the unbiased estimate 𝛼∗(𝑊) in Eq. (2).
There are compelling reasons to assume that forecasters follow the error-response logic of simple
exponential smoothing. Practically, exponential smoothing corresponds to the mental process of error
detection and subsequent adaptation.2 Given a constant smoothing parameter α, it represents single-loop
learning where a forecaster observes an error and then adjusts her next forecast based on that error. As
such, exponential smoothing, interpreted as trial-and-error learning, is a plausible model for real behavior.
Further, exponential smoothing has two important characteristics as a boundedly rational decision
1 McNamara and Houston derive expression (1) using Bayesian principles. The same expression (though
articulated slightly differently) was derived by Harrison (1967) as the argument that minimizes the variance of
forecast errors. 2 This is also a fundamental principle of cybernetics (Wiener 1948) and the foundation for closed loop theories
of learning (Adams 1968). There is neurological evidence that our brain supports such a process (Gehring, Goss,
Coles, Meyer and Donchin 1993).
9
heuristic: It does not require much memory because the most recent forecast contains all information
necessary to make the next forecast, and it is a robust heuristic in many different environments, beyond
the particular one used in our study (Gardner 1985 and 2006).
How humans evaluate and subsequently respond to signals has been tied to the concepts of
representativeness and conservatism. Representativeness means that individuals have a tendency to
overreact to a signal and account only insufficiently for the weight they should attribute to that signal. For
example, individuals neglect to acknowledge that a small sample size implies that only a low statistical
weight should be attributed to the sample (Tversky and Kahnemann 1971). In our context,
representativeness implies that forecasters consistently use a higher W than the underlying time series
would entail when determining their α(W). On the other hand, conservatism implies that individuals have
a tendency to underreact to a signal, even though the statistical weight they should attribute to the signal
is strong. This phenomenon has mostly been observed in the context of Bayesian updating (Camerer
1995). In our context, conservatism implies that human forecasters consistently use a lower W than the
underlying time series would entail. Massey and Wu (2005) integrate these observations into a system-
neglect hypothesis: the strength of the signal is salient in the decision maker‟s perception, whereas the
system that generated the signal is latent in the background. This leads to a general neglect of the system,
such that the decision maker emphasizes the strength of a signal at the expense of the weight that should
be attached to that signal. In other words, the weight of the signal is less of a determinant of behavior than
it ought to be. For the forecasting context of our study, system-neglect leads us to believe that W is less a
determinant of behavior than Eq. (1) implies. Specifically, we would expect that our behavioral α(W) is
less responsive to W (i.e. the curve is flatter) than α*(W).
As Massey and Wu point out, the system neglect hypothesis predicts that there is relatively more over-
reaction for low values of W, and relatively more under-reaction for high values of W. In our context, for
any WL < WH, we would have α(WL)- α*(WL) > α(WH)- α*(WH). System neglect makes no specific
predictions about absolute levels of over- and under-reaction (i.e. if α(0) is very high, there could be over-
reaction for all values of W, and if α(0) = 0, there could be under-reaction for all values of W). We
illustrate the difference between the normative reaction according to Eq. (1), and the predicted behavioral
reaction according to system neglect in Figure 3.
10
Figure 3: Normative versus System Neglect Reaction
Note that Figure 3 shows one possible pattern of system neglect (absolute over-reaction for low values of
W, and absolute under-reaction for high values of W). We hypothesize:
HYPOTHESIS 1 (SYSTEM NEGLECT): Individuals show relatively more over-reaction for low values of
W, and relatively more under-reaction for high values of W.
4. Study 1 (Baseline)
4.1 Experimental Design
In a controlled laboratory environment, subjects make sequential forecasts based on an evolving time
series of demand realizations generated from a perturbed random walk. Subjects were told they were
managing inventory at a retail store. For 50 periods, subjects observed demand and were asked to make a
point forecast for the next time period. Throughout the experiment, a visible graph was updated to include
all demand realizations up to the current period. A table also provided historic demand information, as
well as information on previous forecasts, absolute forecast errors, and relative forecast errors.
Our theoretical developments in the previous section posit that human forecasters react to forecast
errors, and that their reaction pattern depends systematically on the forecasting environment. To test our
main research hypothesis (system neglect), we vary experimental conditions along the two parameters of
our forecasting environment, c and n. First, we vary the degree of change, by letting c equal 0, 10, or 40.
Second, we vary the degree of noise, by letting n equal 10 or 40. This results in six experimental
conditions representing different demand environments, ranging from no-change-low-noise (c = 0, n =
10) to high-change-high-noise (c = 40, n = 40), as shown in Table 1.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.5 1 1.5 2 2.5 3 3.5 4
Re
acti
on
α
Weight W
Normative Reaction System Neglect Reaction
Overreaction
Underreaction
11
Table 1: Overview of Experimental Conditions
change-to-noise ratio 𝑊 =𝒄𝟐
𝒏𝟐 (and 𝛼∗ 𝑊 in parentheses)
n = 10 n = 40
c = 0 Condition 1
0 (0)
Condition 2
0 (0)
c = 10 Condition 3
1 (.62)
Condition 4
1/16 (.22)
c = 40 Condition 5
16 (.94)
Condition 6
1 (.62)
Environments characterized by a significant degree of change over time are likely to produce rather
distinct demand evolutions. To ensure overall consistency between demand data and the data-generating
system, we generated four demand datasets from each of the six environments. Data in each time series
represented units of demand in each period. We implement the resulting 6*4 = 24 treatments in a
between-subject design. Subjects are not informed that data was generated by a random walk with noise,
and they are also not provided with the actual parameters c and n in their condition (see Asparouhova et
al. 2009 for a similar design). Instead, subjects receive 30 historic data points by condition before making
their first forecast, shown throughout the experiment in both the graph and the history table. The example
time-series datasets depicted in Figure 2 are actual datasets from our experiment.
The forecasting task was implemented in the experimental software zTree (Fischbacher 2007). In
order to provide incentives for accurate forecasting, we paid each subject $10 multiplied by the subject‟s
accuracy across the 𝑇 = 5050 periods. (Forecasting accuracy was defined as (1 − 𝑀𝐴𝑃𝐸), where
𝑀𝐴𝑃𝐸 = 1
𝑇 𝐷𝑡−𝐹𝑡
𝐷𝑡
𝑇𝑡=1 , the Mean Absolute Percentage Error calculated based on the entire history of
forecasts 𝐹𝑡 and demand observations 𝐷𝑡 .) In addition, each subject was paid a participation fee of $5.
Payoffs were rounded up to the next full dollar value, and the average payoff was $14.80.
4.2 Data
The baseline study (Study 1) was conducted at a behavioral lab in a large, public university in the
American Midwest. The 252 participants in the study belonged to a subject pool associated with the
business school, and registered for the study in response to an online posting. About 50% of the subjects
were current undergraduate students from various fields. The remaining 50% consisted of either graduate
students or staff at the university. 23 of the 24 treatment conditions had at least 10 subjects, while one
treatment had 8 subjects.
To correct for errors and outliers, we examined all individual forecasts 𝑭𝒊𝒕 with an absolute forecast
error 𝑫𝒕 − 𝑭𝒊𝒕 > 300. In a few cases, obvious typographical errors could be determined and the
12
forecasts were corrected accordingly. If the intended forecast could not be determined, but the response
appeared to be a typographic error (i.e. one forecast of 20 in between a long series of forecasts between
700 and 900), that forecasts was recorded as missing. In total, such corrections were rare (<0.1 % of all
forecasts). Prior to completing Study 1, we also completed a pretest (261 subjects) at a different university
located in the American Northeast; more details about the pretest are given in appendix 1.
4.3 Initial Analyses
Let 𝐹𝑖𝑡 denote forecast for period t made by subject i in period t-1, after observing demand 𝐷𝑡−1, and let
𝐹𝑡 = 𝐹𝑖𝑡𝑖 denote the corresponding average across all 𝐼 individuals within a given condition. The
optimal forecast for period t is given 𝐹𝑡 𝐷𝑡−1 𝛼∗(𝑊) which we abbreviate by 𝐹𝑡
∗ for notational
convenience. Through its dependence on the smoothing constant 𝛼∗(𝑊) and the demand realizations
𝐷𝑡−1, it is understood that 𝐹𝑡∗ is specific to each of the six conditions (which differ by W) as well as to
each of the four demand sets within a condition (which differ by the vector of demand realizations 𝐷𝑡).
Table 2 compares the observed mean absolute forecast error 𝑀𝐴𝐸 𝐷𝑡 , 𝐹𝑖𝑡 = 1
𝑆 𝐼 𝑇 𝐹𝑠𝑖𝑡 − 𝐷𝑠𝑡 𝑆𝐼𝑇 ,
which is the T-period average across all I subjects in all S demand seeds within a given demand
environment, over all conditions. Simple t-tests (p ≤ .01) confirm the observed mean absolute error is
significantly larger than the corresponding error measure based on optimal forecasts, 𝑀𝐴𝐸 𝐷𝑡 , 𝐹𝑡∗ .
Further, a comparison across environments is consistent with our intuition: performance deteriorates
when noise n and instability c increase.
Table 2: Observed Forecasting Performance Measured by MAE
(optimal performance in parentheses)
Figure 4 illustrates the evolution of demand 𝑫𝒕, observed forecasts 𝑭𝒕, and normative forecasts 𝑭𝒕∗.
Without formal analysis, we can make a number of observations. The observed forecasts (grey line)
mimic the evolution of demand (dots). This is consistent with exponential smoothing, but certainly not
optimal in the stable demand environments (conditions 1 and 2), where the correct forecasts 𝑭𝒕∗ (black
line) do not react at all to demand signals. Further, while both the observed as well as the normatively
correct forecasts represent a smoothed version of demand, especially condition 4 shows that there is more
variability in the series of observed forecasts than in the series of normatively correct forecasts.
n = 10 n = 40
c = 0 10.15 (7.75) 38.55 (30.74)
c = 10 16.42 (12.86) 47.36 (36.51)
c = 40 38.94 (34.34) 64.03 (53.54)
Notes. All differences between observed and optimal MAEs are significant (p ≤ .01)
13
Figure 4: Sample Evolutions of Demand, Average Observed Forecast, Normative Forecast
We next compare observed forecast adjustments to the normative exponential smoothing benchmark.
To formalize adjustments as a response to observed forecast error, we define the adjustment score
𝛼𝑖𝑡 =𝐹𝑖𝑡−𝐹𝑖𝑡−1
𝐷𝑡−1−𝐹𝑖𝑡−1, which follows immediately from rearranging the single exponential smoothing formula
in Eq. (2).3 We can use this ratio to categorize observed behavior, as shown in Figure 5. A score of
𝛼𝑖𝑡 < 0 would indicate that subjects adjusted their forecast in the opposite direction of their forecast error
(11% of all observations). Possible explanations of such behavior would be that subjects either followed a
previously salient trend expectation, or believed in the law of small numbers, i.e. that high values of a
3 By construction, this score is not defined for the first period, nor for when 𝐷𝑡−1 = 𝐹𝑖𝑡−1. Note that this ratio
has been used before as an adjustment score in newsvendor research (cf. Schweitzer and Cachon 2000).
400
500
600
Time
Condition 1, c=0, n=10
400
500
600
Time
Condition 2, c=0, n=40
500
650
800
Time
Condition 3, c=10, n=10
500
650
800
Time
Condition 4, c=10, n=40
300
500
700
Time
Condition 5, c=40, n=10
500
700
900
Time
Condition 6, c=40, n=40
14
stable series balance out with small values in small samples. An adjustment score of 𝛼𝑖𝑡 = 0 (10% of all
observations) indicates no reaction. If the adjustment score falls between 0 and 1 (42% of all
observations), it is consistent with adjusting the current forecast to the observed forecast error. Finally,
any adjustment score 𝛼𝑖𝑡 > 1 (37% of all observations) indicates that subjects were extrapolating
illusionary trends into the future. This initial analysis highlights that while simple error-response level
adjustment is the dominant response pattern, there is strong evidence that subjects tend to adjust their
forecasts beyond the range consistent with simple exponential smoothing. This calls for a more
comprehensive description of behavior, which we provide in the next section.
Figure 5: Ranges for the Adjustment Score αit
To provide a brief, aggregate analysis4 of forecast adjustments across conditions, we calculate
𝛼 = 1
𝑆𝐼𝑇 𝛼𝑖𝑡(𝑠)𝑆𝐼𝑇 while noting that such average scores need to be interpreted with caution. Several
directional observations can be made (see Table 3). First, the reaction α increases in the degree of change,
and decreases in the degree of noise. This observation is in line with our normative predictions from Eq.
(1), as subject behavior corresponds directionally to change in the change-to-noise ratio as one would
expect. Second, in all conditions, the average reaction differs from the normative reaction. Condition 5
shows some evidence for underreaction, whereas all other conditions show some evidence of
overreaction.
4 Because excessively high and low adjustments can have a strong influence on this analysis, we remove all
𝛼𝑖𝑡 < −1 (5% of observations) and all 𝛼𝑖𝑡 > 2 (8% of observations).
αit < 0, i.e. negative trend or
gambler‟s fallacy
Time
Demand
Past Present Future
Fit-1
Dt-1
αit > 1, i.e. subject believes in
a trend.
0 ≤ αit ≤ 1, i.e. exponential
smoothing.
15
Table 3: Average Forecast Adjustment Scores 𝜶
n = 10 n = 40
α α*(W) α α*(W)
c = 0 .59** (.01) 0.00 (p=.09) .56** (.01) 0.00
(p≤.01) (p≤.01)
c = 10 .74** (.01) .62 (p≤.01) .69** (.01) .22
(p≤.01) (p≤.01)
c = 40 .89** (.01) .94 (p≤.01) .80** (.01) .62
Notes. Bold entries are average adjustment scores, with std. errors reported in brackets. The **
indicates that all average adjustment scores are significantly different from their normative value.
The p-values between columns and rows are significance tests comparing average adjustment scores
between conditions.
4.4 A Generalized Model of Forecasting Behavior
The previous section has highlighted that actual behavior is not completely captured by single exponential
smoothing, and identifying a descriptively accurate forecasting model is ultimately an empirical question.
Rather than imposing single exponential smoothing as the only model, we allow our data to select a
preferred model of forecasting behavior. We include two generalizations in the empirical specification of
behavior: initial anchoring and illusionary trends. Initial anchoring refers to the well documented
tendency of individuals to anchor their decisions on some artificial or self generated value (Epley and
Gilovich 2001). Illusionary trends refer to the idea that people are fast to see trends where there are none
(DeBondt 1993).
We conceptualize forecasts as containing three essential structural components: A level estimate Lt, a
trend estimate Tt, and „trembling hands‟ noise εt, leading to a generalized structural equation for forecast
Ft+1:
𝐹𝑡+1 = 𝐿𝑡 + 𝑇𝑡 + 𝜀𝑡 (3)
We include the noise term because human decision making is known to be inherently stochastic
(Rustichini 2008). We specify the level term in (3) as
𝐿𝑡 = 𝜃𝐿𝐹𝑡 + 𝛼 𝐷𝑡 − 𝐹𝑡 + 1 − 𝜃𝐿 𝐶. (4)
The specification in Eq. (4) introduces the anchoring parameter θL and the constant C. While exponential
smoothing suggests that forecasters correctly and exclusively anchor on their previous forecasts, the
literature on anchor and adjustment heuristics often includes the initial values of a time series as an
additional anchor (Chapman and Johnson 2002). The parameter θL allows people to either anchor their
forecasts only on previous forecasts (θL = 1), or to anchor their forecasts on the initial and constant value
16
C (θL = 0) or some combination of these two extremes (0 < θL < 1). Note that in stable time series, initial
anchoring with θL = 1 is normatively correct, since forecasts should be constant.
Forecasters should never develop trend estimates in the context of our time series, but this is a
normative aspect unknown to our subjects. Our data-generating process can produce random successive
level increases or decreases that can easily be perceived as trends (see Figure 2). Because there is
considerable evidence that forecasters are quick to see trends where they do not exist (DeBondt 1993), we
specify the trend term in (3) as
𝑇𝑡 = 𝑇𝑡−1 + 𝛽 𝐿𝑡 − 𝐿𝑡−1 − 𝑇𝑡−1 , (5)
which corresponds to double exponential smoothing. Using Eq. (4), we can re-write Eq. (5) as
𝑇𝑡 = 1 − 𝛽 𝑇𝑡−1 + 𝛽 𝜃𝐿 − 𝛼 𝐹𝑡 − 𝐹𝑡−1 + 𝛼𝛽 𝐷𝑡 − 𝐷𝑡−1 (6)
Combining Eqs. (3), (4) and (6), rearranging terms, allowing Δ to symbolize first differences and Et to
represent the forecast error (Dt - Ft), we can specify our generalized forecasting model as follows:
𝐹𝑡+1 = 𝜃𝐿𝐹𝑡 + 𝛼𝐸𝑡 + 𝛼𝛽∆𝐷𝑡 + 𝛽 𝜃𝐿 − 𝛼 ∆𝐹𝑡 + 1 − 𝜃𝐿 𝐶 + 1 − 𝛽 𝑇𝑡−1 + 𝜀𝑡 . (7)
Eq. (7) serves as the basis of our empirical estimation (see Appendix 2 for additional remarks on
model and error specification). This model nests the normative benchmark for our context (i.e., single
exponential smoothing) as a special case for 𝜃𝐿 = 1 and 𝛽 = 0.
To test whether the generalizations made in Eq. (7) are necessary, we conduct a hierarchical analysis
to examine whether it is empirically justified to simplify Eq. (7) to the normative benchmark of single
exponential smoothing or not. To that purpose, we estimate four different models: A full model (Model
4), a model without ΔFt as independent variable (Model 3), a model without ΔFt and ΔDt as independent
variables (Model 2), and a model without ΔFt and ΔDt as independent variables, without a constant, and
with the constraint of 𝜃𝐿 = 1 (Model 1). For each of the six experimental conditions, we estimate the
structural parameters of these 4 models, including random slopes and intercepts (see Eq. (10) in appendix
2), using maximum likelihood (ML).5 From the last row in each condition in Table A1 (labeled Δχ2), we
can see that the likelihood ratio tests for the decrease in model fit by going from one less constrained to a
more constrained model indicate that any simplification of model 4 leads to a significant decrease in
model fit across all conditions. This confirms our observation from the previous section. Based on the
model fit statistics, simple exponential smoothing does not fully describe observed behavior, and the
generalized model of Eq. (7) is empirically the preferred model. One can also see that α, our main
parameter of interest (the effect of µ(Et), shown in the first row under each condition for all models)
5 All analyses were conducted in Stata 11, using the 'xtmixed' procedure. In a few instances, when models would
not converge using ML estimation, restricted ML (REML) was used for estimation instead.
17
generally increases the more we constrain the model. In other words, the estimate for α in simpler models
tends to suffer from a positive bias, i.e. be higher than it‟s true value, since these simpler models do not
account for additional forms of subject reaction (besides error-response) inherent in our data.
Consequently, the simple average adjustment scores reported in Table 3 inflate the true reaction α, since
they do not control for additional behavioral effects. We therefore focus our analysis on interpreting the
behavioral parameters estimated in model 4.
4.5 Results
We use the estimates for model 4 in each condition (see Table A1 in the appendix) to calculate the
behavioral parameters (i.e. α, β, θL and C) of Eq. (7) (See Appendix 2 for details). We provide an
overview of these behavioral parameters in Table 4. Note that all parameter tests reported in this section
are Wald tests on specific parameters, or linear/nonlinear combination of parameters. We use likelihood
ratio tests only to test null-hypotheses on random effects or to compare nested models (Verbeek 2000).
Clear patterns emerge from our analysis. In all conditions, the reaction parameter α is positive and
significant (p ≤ .01), indicating that individuals react to their most recent forecast error. We further test
whether the behavioral αs are different from their normative values, as Hypothesis 1 would predict. Note
that α > α* implies overreaction, and α < α* implies underreaction. We find evidence for overreaction in
conditions 1,2 and 4, a „correct‟ reaction that is not significantly different from the normative value in
conditions 3 and 6, and evidence for underreaction in condition 5. This pattern is consistent with system-
neglect and confirms our main hypothesis.
Table 4: Overview of Behavioral Model Parameter Estimates and Hypotheses Tests
Condition 1 Condition 2 Condition 3 Condition 4 Condition 5 Condition 6
c=0, n=10 c=0, n=40 c=10, n=10 c=10, n=40 c=40, n=10 c=40, n=40
α* .00 .00 .62 .22 .94 .62
α .39* (.04) .48* (.05) .68* (.04) .60* (.04) .70* (.03) .56* (.04)
β .28* (.09) -.05 (.08) .27* (.09) .07 (.06) .45* (.10) .45* (.15)
θL .70* (.03) .75* (.03) .99 (.01) .90* (.02) 1.00 (.00) .95* (.01)
C 501* (1.2) 507* (7.2) 625* (15) 723* (31)
α = α* p ≤ .01 p ≤ .01 p = .13 p ≤ .01 p ≤ .01 p = .11
Notes. *p<=.01. Significance tests for α, β and C test whether these parameters are different from 0. For θL we test θL =1. The
parameter C is not reported if θL not different from 1. α and β, and the hypothesis tests related to α, are based on the mean
estimates of the according random slopes, see Appendix 2 for details. Standard errors are in parentheses.
We now test whether our behavioral estimates for α from Table 4 react to changes in our experimental
parameters as expected. To do so, we re-estimate model 4 across two conditions simultaneously, allowing
model parameters to change between conditions. This contrast estimation allows us to test whether model
18
parameters are significantly different from each other between conditions. The results from this analysis
are reported in Table 5.
Table 5: Contrasts of α Estimates Across Demand Conditions
n = 10 n = 40
α α
c = 0 .39 (.04) (p=.32) .48 (.05)
(p≤.01) (p≤.10)
c = 10 .68 (.04) (p≤.10) .60 (.04)
(p=.66) (p=.36)
c = 40 .70 (.03) (p≤.01) .56 (.04)
Notes. Bold entries are estimated values for α from Model 4, Table A1,
with std. errors reported in brackets. The p-values between columns and
rows are significance tests comparing α between conditions.
We observe that α increases significantly in c initially (c = 0 versus c = 10), with no further increase for c
= 40. We further observe that α decreases significantly in n, except for the stationary demand conditions
with c = 0.
The full model includes additional behavioral parameters besides α. While we view these additional
parameters primarily as statistical controls for behavioral effects that may otherwise be falsely attributed
to 𝛼, we briefly comment on three additional results. First, consider the anchoring parameter θL. Note that
in conditions 1 and 2, initial anchoring is functionally equivalent to the normative benchmark, and we see
evidence for such anchoring in the data. In the other conditions, initial anchoring is not the normative
benchmark, and we see that in conditions 3 and 5,( i.e. low noise conditions), no initial anchoring takes
place. We observe evidence for initial anchoring only in the high noise conditions (4 and 6), leading to the
conclusion that this decision bias is visible in our data, but only if noise is high.
Next, consider the illusionary trend parameter β. In conditions 1,3,5 and 6, β is positive and
significant, indicating that in these conditions, respondents do tend to see illusionary trends. Interestingly,
there is little evidence for a significant β in conditions 2 and 4, indicating that the tendency to detect
illusionary trends tends to be less prevalent in high noise conditions. Finally, consider the effects of ΔFt in
Table A1. These effects are generally negative and significant. Since θL > α in all of our conditions, Eq.
(7) does not explain these negative effects. A possible post-hoc explanation for these negative effects may
lie in regret: Strong recent changes in a forecast are countered by later forecast adjustments in the
opposite direction. People may recognize their tendency to over-react (either due to system neglect of
illusionary trends), and actively counter this prior over-reaction in a current forecast.
19
4.6 Performance Implications
We now explore how the decision biases uncovered in the previous section impact forecasting
performance. Using the estimations from our generalized forecasting model, we can attribute loss in
forecasting performance to two classes of (mis)behavior: systematic decision biases such as mis-specified
error-response (𝛼 ≠ 𝛼∗(𝑊)), initial anchoring (θL ≠ 1) or illusionary trends (𝛽 > 0), and unsystematic
“trembling hands” random errors. To separate these two sources of performance loss, we calculate for
each demand seed s the forecast performance of three types of forecast evolutions: the normative, the
observed, and the “behaviorally predicted” forecasts. The normative forecast in period t of seed s is
defined by 𝐹𝑠𝑡∗ = 𝐹𝑠𝑡 𝐷𝑠𝑡−1 , 𝐹𝑠𝑡−1 𝛼
∗(𝑊) where 𝛼∗(𝑊) is common to all demand seeds within a
demand environment. The observed forecast of subject i in period t of seed s is denoted 𝐹𝑠𝑖𝑡 . The
predicted forecasts are defined as 𝐹 𝑠𝑖𝑡 = 𝐹 𝑠𝑖𝑡 𝐷𝑠𝑡−1 , 𝐹 𝑠𝑖𝑡−1 Θ sit , where Θ sit are the estimated parameters
of our generalized forecasting model, including fixed effects and best linear unbiased predictions of
random effects at the “dataset” level. The predicted forecasts 𝐹 𝑠𝑖𝑡 are seed- and individual-specific
forecasts that, unlike the observed 𝐹𝑠𝑖𝑡 , were “filtered” through the structural estimation of the parameters
of our generalized forecasting model. These prediction were obtained using best linear unbiased
predictions in Stata (see Bates and Pinheiro 1998). We then measure performance for each of our six
demand conditions as the mean absolute forecast error, averaged across all I subjects i and all S seeds s
within that condition. Formally, for the observed forecasts, we define 𝑀𝐴𝐷𝑜 = 1
𝑆𝐼𝑇 𝐹𝑠𝑖𝑡 − 𝐷𝑠𝑡 𝑆𝐼𝑇 , and
equivalently for the normative (𝑀𝐴𝐷𝑛 ) and predicted forecasts (𝑀𝐴𝐷𝑝 ).6 Using these definitions, we can
describe the total performance loss from observed forecasts as (𝑀𝐴𝐸𝑜 − 𝑀𝐴𝐸𝑛)/𝑀𝐴𝐸𝑛 . Importantly, we
can precisely capture the loss in forecasting performance due to systematic decision biases (“Loss 1”) as
(𝑀𝐴𝐸𝑝 − 𝑀𝐴𝐸𝑛)/𝑀𝐴𝐸𝑛 , and the loss in forecasting performance due to unsystematic random noise in
decision making (“Loss 2”) as (𝑀𝐴𝐸𝑜 − 𝑀𝐴𝐸𝑝)/𝑀𝐴𝐸𝑛 . Table 6 provides an overview of our analysis.
6 Because we cannot fit our generalized forecasting model for periods 30-32 due to insufficient forecasting
history, and Period 79 is the last period which results in an observed error, we use all forecasts made from period 33-
79.
20
Table 6: Mean Absolute Forecast Errors and Performance Loss
Normative
(MAEN)
Predicted
(MAEP)
Observed
(MAEO)
Loss 1
(MAEP-
MAEN)/MAEN
Loss 2
(MAEO-
MAEP)/MAEN
Loss (Total)
(MAEO-
MAEN)/MAEN
n=10 n=40 n=10 n=40 n=10 n=40 n=10 n=40 n=10 n=40 n=10 n=40
c=0 7.75 30.74 8.86 34.88 10.15 38.55 14% 13% 17% 12% 31% 25%
c=10 12.86 36.51 14.34 42.01 16.42 47.36 11% 15% 16% 15% 28% 30%
c=40 34.34 53.54 35.41 56.82 38.94 64.03 3% 6% 10% 13% 13% 20%
As expected, we observe that mean absolute errors from observed forecasts exceed those from
normative forecasts. We also observe that MAEs increase in c and n, but this result has to be interpreted
with caution. Different environments produce different forecast performance due to their inherent
complexity. Forecasting performance relative to optimal performance improves in less stable
environments (Loss Total). Interestingly, we observe that in general, the loss in performance due to
random decision making (Loss 2) is as high as, or higher than, the loss of performance due to decision
biases (Loss 1). Counter to intuition, Loss 1 is lower in conditions with high change (c = 40) than in
conditions with little (c = 10) or no (c = 0) change. It seems that the decision heuristics individuals use to
make forecasts work better in unstable and changing environments, and become more biased in stable
environments.
One could make the argument that our performance comparison in Table 7 is unfair. Subjects did not
know the c and n of their experimental condition, and would have had to use their existing data to
estimate these parameters. Or, alternatively, forecasters may use an out-of-sample procedure, where they
use the existing data to find optimal smoothing parameters directly (instead of estimating c and n).
Therefore we compared our normative MAEs to the MAEs obtained from an out-of-sample procedure: In
each time period for each dataset, we estimated an optimal α7 and created a forecast using that optimal α.
The MAE resulting from such out-of-sample forecasts was very close to the normative MAE reported in
Table 7.
5. Study 2 (Intervention)
The previous section documents how biased reaction patterns can lead to suboptimal forecasting
accuracy. In this section we ask how one could improve performance. Consider that a key observation
from our baseline study is a systematic pattern of over- and under-reaction to forecast errors. The main
7 As an optimality criterion, we used either the MAE or the MAPE as a criterion, in addition to a more modern
maximum likelihood approach (Hyndman, Koehler, Snyder and Grose 2002). We also used a maximum likelihood
procedure that allows for simultaneous parameter estimation in the context of double exponential smoothing
(Andrawis and Atiya 2009).
21
idea explaining this system neglect pattern is that the sensory nature of a signal may partially override the
cognitive processes that determine how much weight should be attributed to the signal. This consistent
tendency to “neglect the system” lends itself to the design of a possible intervention to improve
forecasting performance. If one can re-emphasize the system that created a signal before a decision maker
comes up with a forecast, one might reduce this 'cognitive override' of the signal, and lower the bias
created by system-neglect. We therefore attempt to render signals less salient relative to the broader
information about the environment that produced this signal by asking subjects to sequentially prepare
forecasts for different demand time series. If subjects switch to a different time-series before making a
new forecast, they need to re-focus on the new series. We hypothesize that this process breaks the
saliency of observed forecast errors, and reduces system-neglect and the resulting over/under-reaction
pattern across different demand environments.
HYPOTHESIS 2 (MULTIPLE TASK STRUCTURE): A task design that requires repeated sequential
forecasts for multiple time-series reduces system-neglect patterns and improves performance.
5.1 Experimental Design
In each period, subjects observe demand for a product, make a forecast for the next period, and repeat this
task for all products in that period. As in our baseline study, we provide both a graph and a table that
display the history of demand realizations as well as information on forecast errors.
We use demand environments 4 (c = 10, n = 40) and 6 (c = 40, n = 40) from our baseline study, and
make the demand environment a between-subject factor. The two environments were specifically chosen
because they represent conditions with very high and low Loss 1 in our baseline study (see Table 5),
while only manipulating only c and leaving n constant. For 12 periods, subjects make one forecast for
each of the four demand datasets nested within the same demand environment. We again pay subjects
based on their mean absolute percentage forecasting error averaged across four products and 12 periods,
in addition to a participation fee of $5. Subjects earned, on average, $14.94. Thirty-four subjects
participated in demand environment 4, and forty subjects participated in environment 6. Subjects were
from the same subject pool and had not participated in the baseline study.
5.2 Analysis
To test Hypothesis 2, we need to establish that subjects overreact less to their forecast error in condition 4
in the 4-Product treatment when compared to the baseline 1-Product treatment, while they react similarly
in both treatments (i.e., close to the normative value) in condition 6. For this analysis, we re-estimate Eq.
(7) removing the random effects at the condition level while retaining random effects at the individual
level. We also add a control variable for decision number, since forecasts in the baseline treatment will
have been made earlier in the course of the experiment when compared to the 4-Product intervention.
Finally, we allow heteroskedasticity of regression errors between the 1-Product and 4-Product treatments.
22
A summary of the relevant parameter estimates and tests that examine whether these parameter estimates
differ between the 1-Product and 4-Product treatments is given in Table 7.
Table 7: Behavioral Model Parameter Estimates in Managerial Intervention
Coefficient Condition 4 Condition 6
1-Product 4-Product Δ 1-Product 4-Product Δ
α* .22 .22 - .62 .62 -
α .71 .52 p ≤ .05 .70 .70 p = .94
σ(ε) 29 24 p ≤ .01 26 29 p ≤ .01 Notes. ** p ≤ .01; * p ≤ .05; † p ≤ .10. The Δ column provides p-values for the tests that
compare parameter estimates in the 4-Product treatment to those in the 1-Product treatment.
Tests for σ(ε) are LR tests for decrease in model fit for a homoskedastic model.
As predicted, the parameter α in the 4-Product treatment is lower than in the 1-Product treatment within
condition 4. Decision makers overreact to their error in both cases, but less so in the 4-Product treatment.
In condition 6, as expected, α estimates are not different from each other between the two treatments. This
analysis supports Hypothesis 2.
Next we analyze the performance implications of the behavioral changes created by the intervention.
As a first test to compare the forecasting performance of subjects between the 4-Product to the 1-Product
(baseline) treatment, we calculated the mean absolute error (MAE) within each condition where both the
4-Product and 1-Product treatments were applied (6 and 4). To be more precise, in the 1-Product
treatment, the MAE was calculated over only periods 1-12, as only those periods existed in the 4-Product
treatment. A simple t-test comparing the 4-Product treatment MAE (= 45.47) to the 1-Product treatment
MAE (= 50.82) in condition 4 reveals a significantly lower MAE in the 4-Product treatment (Δ = 5.35, p
≤ .05). A similar comparison, however, in condition 6 reveals no significant difference between the two
groups (Δ = -3.73, p = .17). This finding is consistent with our expectations. Reducing system neglect
through multi-product forecasting increases forecasting performance in condition 4 and has little or no
effect in condition 6.
To test our predictions more precisely, we estimate a multi-level random effects model to predict the
absolute forecast error in each observation. To make an equivalent comparison, we add an additional
hierarchical level to our analysis, denoting x(s) to be the forecasting context x nested in dataset s. For
example, the first forecast made in dataset 1 in condition 4 is a different context than the second forecast
made in the same dataset. This random effect allows control for the randomness in absolute error. We also
added a decision # variable to control for learning and fatigue effects, as the same instance is forecast at
different times in the game in the 4-Product treatment and in the 1-Product treatment. The results from
this analysis are summarized in Table 8.
23
Table 8: Differences in the Absolute Forecast Errors in the 4-Product vs. 1-Product Treatment
Dependent Variable: Absolute Forecast Error Estimate Standard Error
Condition 6 10.74 (8.55)
4-Product Treatment -7.96** (2.61)
Condition 6 × 4-Product Treatment 8.04** (2.24)
Decision # .16 (.11)
Constant 49.09** (6.10)
σs (random intercept, dataset) 4.34 (8.35)
σi (random intercept, context) 36.27** (2.91)
N 4,047 Notes. ** p ≤ .01, * p ≤ .05, † p ≤ .10. Tests on σ are LR tests. Condition 6 is coded as a dummy
variable = 1 (0 if condition 4), 4-Product Treatment is a dummy variable = 1 (0 if 1 product
treatment)
One can clearly see that our prior predictions are supported. Absolute forecast errors in the 4-Product
treatment in Condition 4 are almost 8 points lower than in the 1-Product treatment (p ≤ .01). The same is
not true in Condition 6, where the forecast errors are not statistically different (p = .97). This analysis
provides further support for Hypothesis 2.
6. Conclusion
Our research investigates judgmental time-series forecasting in environments that can be precisely
described by their stability and noisiness. Behavior is to some degree consistent with the mechanics of
single-exponential smoothing, the normative benchmark in our context. However, subjects tend to
overreact to observed forecast errors for stable time series, and under-react to forecast error for less stable
time series. This pattern is consistent with the system-neglect hypothesis found in the regime-change
literature (cf. Massey and Wu 2005); our research provides empirical support for this hypothesis in a
“many small changes” time-series forecasting context, which is notably different from the “few big
changes” environments commonly investigated in the regime-change literature.
Surprisingly, our results show that decisions made in a stable environment suffer from stronger
systematic decision biases, compared to decisions made in less stable environments. Human judgment
appears to be more adapted to detecting change in volatile environments than to exploiting information in
stable environments. A human tendency to react to noise may simply be the result of an evolved decision
heuristic geared towards the detection of (and adaptation to) change. This would point to managerial
judgment being better in unstable environments. Emphasis should be placed on automating decision
making in stable environments.
We also show that the decline in forecasting performance due to randomness is at least as strong if not
stronger than the decline in forecasting performance due to systematic biases. Since such randomness in
decision making is mitigated by groups - i.e. multiple individuals preparing independent forecasts, and
24
these forecasts being averaged (Larrick and Soll 2006), this points to the large benefits than can be
obtained by using an effective group decision making process in forecasting.
We test an experimental intervention which is designed to mitigate the existing systematic over-
/under-reaction pattern. In particular, we required subjects to make sequential forecasts for multiple
products in an effort to emphasize the environment (which is shared by all products) and de-emphasize
the saliency of each the signal of each product. This intervention is effective in our laboratory setting.
From a theoretical perspective, this finding provides further evidence that the psychological process
underlying the observed over- and under reaction patterns is indeed related to the low relative salience of
the system generating an observed forecast error. It also suggest that it is possible to „overspecialize‟ in
forecasting. While specialization in forecasting may increase tacit domain knowledge about the market
and product, specialization may also increases the influence of system neglect in decision making.
Our results relate to the growing literature on behavioral operations management. Specifically,
experimental studies of simple newsvendor settings have documented a persistent tendency to chase
demand in stationary environments (Schweitzer and Cachon 2000, Bolton and Katok 2008, Kremer et al.
2010). Our study suggests that this tendency may be a forecasting phenomenon and not exclusively
related to inventory ordering. While subjects in newsvendor studies have perfect knowledge about the
underlying demand generating system, the system neglect hypothesis suggests that the signals and
feedback they observe during the course of the experiment will make them at least partially neglect that
knowledge. We therefore conjecture that decomposing forecasting and ordering in newsvendor
experiments may be a fruitful and important endeavor. Further, newsvendor studies assume that using of a
stationary and known demand environment makes the forecasting task simpler. Our results suggest that
stable environments lead to more biased decision making. If subjects neglect their knowledge of the
system and change forecasts based on signals, stable demand environments have not only little ecological
validity (Brown and Steyvers 2008), they may also be environments that significantly decrease the
performance of human judgment. Finally, subjects in most newsvendor and beergame studies are
confronted with demand stimuli in quick succession. Such a context provides a strong salience of demand
signals. Our study would suggest that decision makers may perform better when the relative saliency of
the most recent demand signal is mitigated, for example by re-emphasizing the environment before
making the next decision. It would be interesting to test whether performance in newsvendor experiments
can be improved by re-emphasizing the demand environment after each decision.
Our study has several limitations. Our intervention study was designed to emphasize the “demand-
generating system” by asking subjects to forecast multiple time-series, but the success of this intervention
may stem merely from the time lag between successive forecasts of an individual time-series. Future
research could explore if performance improvements can be achieved by occupying subjects for the same
25
time with an unrelated task. Additionally, while our analyses explicitly controlled for initial anchoring
and illusionary trends, our study was not designed to explore these behaviors in detail. Future research
should further explore these (or other) behavioral phenomena in demand forecasting. Finally, our
forecasting context assumes that forecasters have no quantitative forecasting support available besides a
graph and history table. In practice, many forecasts represent judgmental adjustments to an anchor
provided by a quantitative forecasting technique (Fildes et al. 2009). Future research could more
explicitly address the impact of such anchors.
Our research provides a solid theoretical and empirical framework for modeling human judgment in
forecasting in non-stationary time-series. This rich context is relevant for many different fields beyond
operations management. For example, our framework may be useful for the study of overreaction and
illusionary trends in stock markets, or for examining how medical doctors interpret longitudinal data of
their patients, or perhaps as a window for understanding human reactions to climate change. We envision
these developments to be not only empirical but also theoretical in nature. Our research suggests a simple
and fairly generic way of formally capturing a persistent judgment bias and its relationship to parameters
describing a non-stationary environment. Our results could thus inform future work on how to design
information and incentive systems that are robust to the kinds of judgment biases we observe.
References
Adams, J. A. 1968. Response feedback and learning. Psychological Bulletin 70(6) 486-504.
Andrawis, R. R., A. F. Atiya. 2009. A new Bayesian formulation for Holt‟s exponential smoothing. J.
Forecasting 28(3): 218-234.
Andreassen, P. B., S. J. Kraus. 1990. Judgmental extrapolation and the salience of change. J. Forecasting
9(4) 347-372.
Asparouhova, E., M. Hertzel, M. Lemmon. 2009. Inference from streaks in random outcomes:
Experimental evidence on beliefs in regime shifting and the law of small numbers. Management
Sci. 55(11) 1766-1782.
Barberis, N., A. Shleifer and R. Vishny. 1998. A model of investor sentiment. J. Financial Economics,
49: 307-343.
Barry, D. M., G. F. Pitz. 1979. Detection of change in nonstationary, random sequences. Organizational
Behavior and Human Performance 24 111-125.
Bates, D. M., J. C. Pinheiro. 1998. Computational methods for multilevel modeling. Technical
Memorandum BL0112140-980226-01TM. Murray Hill, NJ: Bell Labs, Lucent Technologies.
Bendoly, E., K. Donohue, K. L. Schultz. 2006. Behavior in operations management: Assessing recent
findings and revisiting old assumptions. J. Operations Management 24 737-752.
26
Bloomfield, R. and J. Hales. 2002. Predicting the next step of a random walk: experimental evidence of
regime-shifting beliefs. J. Financial Economics 65 397-414.
Bolger, F., N. Harvey. 1993. Context-sensitive heuristics in statistical reasoning. Quarterly J.
Experimental Psychology 46A(4) 779-811.
Bolton, G., E. Katok. 2008. Learning-by-doing in the newsvendor problem: A laboratory investigation of
the role of experience and feedback. Manufacturing & Service Operations Management, 10(3)
519-538.
Brav, A., J. B. Heaton. 2002. Competing theories of financial anomalies. Rev. Fin. Stud. 15(2) 575-606.
Brown, S. D., M. Steyvers. 2009. Detecting and predicting changes. Cognitive Psychology 58 49-67.
Camerer, C. F. 1995. Individual decision making. J. Kagel, A. Roth, eds. The Handbook of Experimental
Economics. Princeton University Press, Princeton, NJ.
Carbone, R., W. Gorr. 1985. Accuracy of judgmental forecasting of time series. Decision Sciences 16
153-160.
Chapman, G. B., E. J. Johnson. 2002. Incorporating the irrelevant: Anchors in judgments of belief and
value. T. Gilovich, D. Griffin, D. Kahneman, eds. Heuristics and Biases. Cambridge University
Press, Cambridge UK, 120-138.
Croson, R., K. Donohue. 2003. Impact of POS data sharing on supply chain management: An
experimental study. Production and Operations Management 12(1) 1-11.
Croson, R., K. Donohue, E. Katok, J. Sterman. 2009. Order stability in supply chains: Coordination risk
and the role of coordination stock. Working paper.
DeBondt, W. F. M. 1993. Betting on trends: Intuitive forecasts of financial risk and return. International
J. Forecasting 9 355-371.
Edwards, W. 1968. Conservatism in human information processing. B. Kleinmuntz, ed. Formal
Representation of Human Judgment. Wiley, NY, 17-52.
Epley, N., T. Gilovich. 2001. Putting adjustment back in the anchoring and adjustment heuristic. Psych.
Science, 12(5): 391-396.
Fildes, R. P. Goodwin, M. Lawrence, K. Nikolopoulos. 2009. Effective forecasting and judgmental
adjustments: An empirical evaluation and strategies for improvement in supply-chain planning.
International J. Forecasting 25 3-23.
Fischbacher, U. 2007. z-Tree: Zurich toolbox for ready-made economic experiments. Experimental
Economics 10(2) 171–178.
Gardner, E. S. 1985. Exponential smoothing: The state of the art. J. Forecasting 4(1) 1-28.
27
Gardner, E. S. 2006. Exponential smoothing: The state of the art– Part II. International J. Forecasting 22
637-666.
Gehring, W. J., B. Goss, M. G. H. Coles, D. E. Meyer, E. Donchin. 1993. A neural system for error
detection and compensation. Psychological Science 4(6) 385-390.
Griffin, D., A. Tversky. 1992. The weighing of evidence and the determinants of confidence. Cognitive
Psychology 24 411-435.
Harvey, N. 2007. Use of heuristics: Insights from forecasting research. Thinking & Reasoning 13(1) 5-24.
Harrison, P. J. 1967. Exponential smoothing and short-term sales forecasting. Management Sci. 13(11)
821-842.
Hyndman, R. J., A. B. Koehler, R. D. Snyder, S. Grose. 2002. A state space framework for automatic
forecasting using exponential smoothing methods. Int. J. Forecasting 18 439-454.
Kahneman, D., A. Tversky. 1972. Subjective probability: A judgment of representativeness. Cognitive
Psychology 3 430-454.
Kremer, M., S. Minner, L.N. Van Wassenhove. 2010. Do random errors explain newsvendor behavior?
Manufacturing & Service Operations Management. Forthcoming.
Larrick, R. P., J. B. Soll. Intuitions about Combining Opinions: Misappreciation of the Averaging
Principle. Management Science 52(1) 111-127.
Lawrence, M. J., R. H. Edmundson, M. J. O'Connor. 1985. An examination of the accuracy of judgmental
extrapolation of time series. International Journal of Forecasting 1 25-35.
Lawrence, M., M. O'Connor. 1992. Exploring judgmental forecasting. International J. Forecasting 8 15-
26.
Lawrence, M., M. O'Connor. 1995. The anchor and adjustment heuristic in time-series forecasting. J.
Forecasting 14 443-451.
Lawrence, M. P. Goodwin, M. O'Connor, D. Önkal. 2006. Judgemental forecasting: A review of progress
over the last 25 years. International J. Forecasting, 22 493-518.
Lee, H. L., V. Padmanabhan, S. Whang. 1997. Information distortion in the supply chain: The bullwhip
effect. Management Sci. 43(4) 546-558.
Makridakis, S., S. Wheelwright, R. Hyndman. 1998. Forecasting: Methods and Applications. Wiley, New
York, NY.
Marathe, R. R.,S. M. Ryan. 2005. On the validity of the geometric Brownian motion assumption.
Engineering Economist. 50 159-192.
Massey, C., G. Wu. 2005. Detecting regime shifts: The causes of under- and overreaction. Management
Sci.51(6) 932-947.
28
McNamara, J. M., A. I. Houston. 1987. Memory and the efficient use of information. J. Theoretical
Biology. 125 385-395.
Poteshman, A.M. 2001. Underreaction, overreaction, and increasing misreaction to information in the
options market. J. Finance, 56(3): 851-876.
Rabin, M. 2002. Inference by believers in the law of small numbers. Quarterly J. Economics, 117(3):
775-816.
Rabin, M. and D. Vayanos. 2009. The gambler‟s and hot-hand fallacies: Theory and applications.
University of California working paper.
Rustichini, A. 2008. Neuroeconomics: Formal models of decision-making and cognitive neuroscience.
Glimcher, P. W., C. Camerer, R. Poldrack, E. Fehr, eds. Neuroeconomics. Elsevier, Holland, UK,
33-46.
Sanders, N. 1992. Accuracy of judgmental forecasts: A comparison. Omega 20(3) 353-364.
Sanders, N. 1997. The impact of task properties feedback on time series judgmental forecasting tasks.
Omega 25 135-144.
Sanders, N., K. B. Manrodt. 2003a. Forecasting software in practice: Use, satisfaction, and performance.
Interfaces 33(5) 90-93.
Sanders, N., K. B. Manrodt. 2003b. The efficacy of using judgmental versus quantitative forecasting
methods in practice. Omega 31 511-522.
Schweitzer, M. E., G. Cachon. 2000. Decision bias in the newsvendor problem with a known demand
distribution: Experimental evidence. Management Sci.46(3) 404-420.
Stone, E.R., R.B. Opel. 2000. Training to improve calibration and discrimination: The effects of
performance and environmental feedback. Organizational Behavior and Human Decision
Processes 83 282-309.
Su, X. 2008. Bounded rationality in newsvendor models. Manufacturing and Service Operations
Management 10(4) 566-589.
Tversky, A., D. Kahnemann. 1971. Belief in the law of small numbers. Psychological Bulletin 80 237-
251.
Verbeek, M. 2000. A Guide to Modern Econometrics. Wiley, New York, NY.
Wiener, N. 1948. Cybernetics or Control and Communication in the Animal and the Machine. Wiley,
New York, NY.
Winters, P.R. 1960. Forecasting sales by exponentially weighted moving averages. Management Sci.. 6(3)
324-342.
29
Appendix 1: Pre-test Information
Prior to the study presented in this paper, we completed a thorough pre-test of our experiment. The task,
experimental parameters, software and functionality were very similar to the baseline study reported here,
with two exceptions: Participants in the pretest made decisions for only 40 consecutive periods, while the
data presented here is based on 50 periods. Second, the students in the pre-test were given course extra
credit for participating and were entered into a drawing for one cash reward per section. We conducted
the same statistical tests on our pretest data and found results that are directionally identical to the ones
reported here. The pre-test was predominantly used to determine whether subjects should receive a graph
of the time series or not, and whether providing qualitative information on the demand series (product
with stable/unstable demand) influenced performance. The final design (subjects receive a graph but no
qualitative information) corresponds to the setting in the pre-test where subjects had the best performance.
Appendix 2: Econometric Specification and Estimation Details
Equation (7) provides a basis for the behavioral model we estimate in our analysis. An empirical problem
with equation (7) is that we do not observe data on Tt-1. This could bias empirical results. To at least
partially control for this potential bias, we propose to estimate Eq. (7) with the additional independent
variables ΔDt-1 and ΔFt-1, leading to the following empirical specification:
1 1 2 3 4 1 5 6 1 constantt t t t t t t tF a E a F a D a D a F a F (8)
Finally, as we will see in the analysis, the following (equivalent) specification of Eqn. (8) provides for
an easier comparison of nested models, and therefore serves as our primary empirical specification:
1 1 2 3 4 1 5 6 11 constantt t t t t t t tF a E a F a D a D a F a F (9)
In general, an observation at time t in the experiment is nested in subject i, who is nested in demand
dataset s, which in turn is nested in experimental condition (i.e., demand environment) c. Since we
estimate our model within each condition, this implies a three level nested structure of error terms, such
that we have random intercepts vs and wi. Further, we believe that the behavioral parameters of our model
vary considerably, depending on both the actual dataset being observed and on the individual performing
the forecast. This expectation would imply that a1 - a4 should be modeled as random coefficients.
However, results in our pretest show that, while there was some variance over a1 and a3, there was little
variance on the other two coefficients. Estimating random coefficients models in which the coefficients
have little variance can lead to non-convergence and inappropriate standard errors. We therefore only
30
estimate a1 and a3 as random slopes. This three level random-effects model will effectively control for the
dependence we have among observations in our dataset. In summary, we can write:
1( , , ) 1 2 3 4 1 5 6 1 ( ) ( , )con.si si
t c s i t t t t t t s c i s c tF a E a F a D a D a F a F v w (10)
All random coefficients are estimated as having a normal distribution. In our results, we use µ and σ to
refer to the mean and standard deviation of that distribution. For example, µ(Et) refers to the mean of the
random slope a1, whereas σi(Et) refers to the standard deviation of that slope at the individual level. The
behavioral parameters of Eq. (7) can then be calculated as follows: α = µ(Et), 𝜃𝐿 = 𝑎 2 + 1, 𝛽 =
𝜇 𝐸𝑡 /𝜇 𝐷𝑡 and 𝐶 = −con./𝑎 2.
31
Table A1: Results from Behavioral Estimation by Condition
Condition 1 Condition 2 Condition 3
Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4
μ(Et) .63** (.04) .49** (.03) .40** (.03) .39** (.04) .54** (.04) .45** (.04) .46** (.05) .48** (.05) .84** (.05) .85** (.04) .74** (.04) .68** (.04)
Ft -.39** (.02) -.37** (.03) -.30** (.03) -.33** (.02) -.31** (.02) -.25** (.03) -.01 (.01) -.01† (.01) -.01 (.01)
μ(ΔDt) .10** (.03) .11** (.03) -.01 (.04) -.03 (.04) .14** (.05) .19** (.05)
ΔDt-1 .00 (.01) .01 (.02) -.02† (.01) -.05** (.02) -.01 (.01) .02 (.02)
ΔFt -.08** (.02) -.03 (.02) -.08** (.02)
ΔFt-1 -.05** (.02) -.08** (.02) -.07** (.01)
μ(con.) 194** (11) 183** (13) 151** (15) 166** (11) 156** (12) 129** (15) 8.2† (4.4) 8.5* (4.2) 4.2 (3.9)
σs(Et) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .05 (.06) .05 (.07) .07 (.05) .06 (.05) .00 (.00) .00 (.00)
σs(ΔDt) .00 (.00) .00 (.00) .04 (.05) .04 (.05) .06 (.05) .07 (.05)
σs(con.) .81 (.36) .76 (.33) .65 (.29) 4.0* (1.9) 3.5* (1.6) 3.2* (1.5) 1.2† (.63) 1.1† (.56) .88† (.49)
σi(Et) .22** (.03) .18** (.02) .18** (.03) .17** (.03) .26** (.03) .22** (.03) .22** (.03) .22** (.03) .20** (.03) .20** (.03) .19** (.03) .18** (.03)
σi(ΔDt) .13** (.02) .13** (.02) .14** (.03) .15** (.03) .19** (.03) .19** (.03)
σi(con.) .97 (.21) .77 (.21) .59 (.23) 6.2** (.96) 4.9** (.85) 4.7** (.86) 1.78 (.37) 1.3 (.39) 1.0 (.57)
N 2,021 (43) 1,923 (41) 1,880 (40)
Δχ2 256.00** 63.93** 13.48** 250.61** 57.60** 26.74** 65.99** 70.28** 32.08**
Condition 4 Condition 5 Condition 6
Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4 Model 1 Model 2 Model 3 Model 4
μ(Et) .72** (.03) .67** (.03) .67** (.03) .60** (.04) .97** (.04) .98** (.05) .82** (.03) .70** (.03) .82** (.05) .80** (.05) .66** (.04) .56** (.04)
Ft -.16** (.02) -.15** (.02) -.10** (.02) .00 (.00) .00 (.00) .00 (.00) -.09** (.01) -.07** (.01) -.05** (.01)
μ(ΔDt) -.02 (.03) .04 (.03) .19** (.06) .31** (.07) .16* (.08) .25** (.08)
ΔDt-1 -.04** (.01) .01 (.02) -.03** (.01) .10** (.02) -.04** (.01) .06** (.02)
ΔFt -.14** (.02) -.14** (.02) -.14** (.02)
ΔFt-1 -.07** (.01) -.05** (.01) -.03** (.01)
μ(con.) 99** (9.5) 90 (9.9) 62** (11) 4.5† (2.6) 2.9 (2.4) 1.2 (2.2) 67** (6.4) 52** (6.1) 37** (6.3)
σs(Et) .00 (.00) .00 (.00) .02 (.10) .02 (.08) .07 (.03) .08 (.04) .00 (.00) .00 (.00) .07 (.05) .07 (.04) .00 (.00) .00 (.00)
σs(ΔDt) .03 (.05) .02 (.07) .12** (.05) .11** (.05) .14** (.06) .13** (.06)
σs(con.) 2.9* (1.4) 3.0* (1.4) 2.5* (1.2) 2.2 (1.5) 1.9 (1.3) 1.7 (1.1) 4.4† (2.1) 3.1† (1.6) 2.4† (1.3)
σi(Et) .18** (.02) .17** (.04) .15** (.03) .13** (.03) .12** (.02) .11** (.02) .11** (.02) .10** (.02) .21** (.03) .19** (.02) .20** (.03) .17** (.03)
σi(ΔDt) .10* (.03) .10* (.03) .10** (.02) .10** (.03) .18** (.03) .18** (.03)
σi(con.) 3.5** (.98) 3.6** (1.0) 2.4 (1.2) 5.3** (.81) 4.1** (.83) 3.1** (1.1) 6.9** (1.2) 4.9* (1.1) 3.4† (1.3)
N/Sub. 1,880 (40) 2,018 (43) 2,111 (45)
Δχ2 135.47** 15.37** 36.73** 95.81** 95.98** 47.93** 169.29** 160.60** 43.41**
Notes. ** p≤.01; * p≤.05; † p≤.10; µ(x) stands for the mean of random effect x; σi(x) stands for the standard deviation of random effect x at the individual level (s indicates dataset level);