the limit of rationality in choice modeling: formulation...

The Limit of Rationality in Choice Modeling:

Formulation, Computation, and Implications

Srikanth JagabathulaStern School of Business, New York University, New York, NY 10012, [email protected]

Paat RusmevichientongMarshall School of Business, University of Southern California, Los Angeles, CA 90089, [email protected]

Customer preferences may not be rational, so we focus on quantifying the limit of rationality (LoR) in

choice modeling applications. We define LoR as the cost of approximating the observed choice fractions from

a collection of offer sets with those from the best-fitting probability distribution over rankings. Computing

LoR is intractable in the worst case. To deal with this challenge, we introduce two new concepts, rational

separation and choice graph, through which we reduce the problem to solving a dynamic program on the

choice graph and express the computational complexity in terms of the structural properties of the graph.

By exploiting the graph structure, we provide practical methods to compute LoR efficiently for a large class

of applications. We apply our methods to real-world grocery sales data and identify product categories for

which going beyond rational choice models is necessary to obtain an acceptable performance.

Key words : limit of rationality, choice modeling, rank aggregation

1. Introduction

Firms routinely collect inventory and sales data that provide information on product availability

and customer choices, respectively. Together, these data are used by firms to obtain the necessary

demand estimates for inventory, assortment, and pricing decisions. Traditionally, fitting indepen-

dent demand models, which assume that products receive independent demand streams, which are

unaffected by the availability of other products, was common. However, this assumption introduces

biases in demand estimates when products are close substitutes and their availability changes over

time; customers may substitute an unavailable product with an available one, making product

demand a function of the entire offer set. For this reason, moving from independent to choice-based

demand models has been a trend both in the academic literature and in industry practice.

Among choice-based demand models, random utility maximization (RUM) models have, by far,

received the most attention. These models are based on the utility maximization principle and

assume that in each choice instance, a customer draws a random sample of utilities for the offered

products and chooses the product with the highest utility. RUM models generalize the classic

notion of a rational individual maximizing a utility function to a population of rational individuals,

1

Jagabathula and Rusmevichientong: The Limit of Rationality

2

each maximizing a (potentially) different utility function. A variation in the utilities may arise

because of preference heterogeneity across individuals or the stochastic elements within individual

preferences (McFadden 2005). When the number of alternatives is finite (Mas-Colell et al. 1995),

only the ordering of the products induced by the sampled utilities, not the actual utility values

themselves, matters in determining customer choices. As a result, an RUM model is equivalently

described by the distribution over preference lists induced by the model, and the choice probabilities

under the RUM model are stochastically rationalizable: the probability of choosing a product from

an offer set is equal to the sum of the probabilities associated with all the preference lists in which

the product is the most preferred among all the products in the offer set.

However, stochastic rationalizability imposes regularity conditions on the choice probabilities

that may not be satisfied in practice. For instance, stochastic rationalizability requires the prob-

ability of choosing an alternative to never increase when more products become available1, but

this condition may be violated because individuals may not be perfectly rational; see, for example,

Tversky et al. (1990). The observed violations of rationality raise the following empirical questions:

(a) Given a population of individuals and a collection of offer sets, are the observed choice proba-

bilities (as computed from sales transactions) consistent with the RUM principle? (b) If not, what

is the degree of inconsistency?

Motivated by these questions, this paper focuses on quantifying the limit of stochastic rationality

(LoR), which is the degree to which the RUM principle is inconsistent with the given empirical

choice probabilities observed from a collection of offer sets. More formally, suppose N = 1,2, . . . , n

is the universe of n products, and we have collected choice observations for a collection M of

subsets of N . For each subset S ∈M, let fi,S ∈ [0,1] denote the fraction of customers who purchased

product i when S was offered. We then ask, how well does the RUM model explain the observed

choice probabilities fM = (fi,S : i∈ S, S ∈M)?

We measure the degree of inconsistency by means of a non-negative, strictly convex loss function

xM 7→ loss(fM,xM) that measures the distance between the observed choice probabilities fM

and the choice probabilities xM that are consistent with an RUM model. We suppose that the

loss function has the property in which loss(fM,xM) = 0 if and only if fM = xM; examples

include log-loss and general norm-loss functions (Examples 2.1 and 2.2). We define the limit of

(stochastic) rationality or simply the limit of rationality LoR(M) as the minimum distance

minxM

loss(fM,xM) over all choice probabilities xM that are consistent with an RUM model. Note

that the observed choice data are stochastically rationalizable if and only if LoR(M) = 0.

1 Block and Marschak (1960) and Falmagne (1978) gave the necessary and sufficient conditions for a system of choice

probabilities to be stochastically rationalizable.


3

This paper focuses on two key aspects related to the above optimization problem: (a) the com-

putability of LoR(M) for a given collection of observed choice probabilities fM and (b) the appli-

cation of LoR(M) for model selection to determine if a modeler should fit a more complex RUM

model or must go beyond the RUM class to obtain an acceptable performance.

Computing the LoR requires us to find the choice probabilities x∗M that are consistent with an

RUM model and are closest to the observed probabilities fM. We formulate this problem as a

constrained convex program whose feasible region is a polytope with n! (n factorial) extreme points.

By applying the conditional-gradient algorithm, we reduce the problem of solving the constrained

convex program to solving a sequence of rank aggregation problems. Each rank aggregation problem

requires finding the ranked list σ that minimizes a linear cost function: minσ∑

S∈M∑

i∈S ci,S ·

1l[σ, i,S], where the cost coefficients ci,S are given, and 1l[σ, i,S] takes a value of 1, if i is the most

preferred product in S according to the ranked list σ and 0 otherwise. The problem is known to

be NP-hard (Dwork et al. 2001). As the key theoretical contribution of this paper, we characterize

the computational complexity of the rank aggregation problem in terms of the structure of the

corresponding choice graph, a novel concept that we introduce in Definition 3.3. A choice graph

is defined for each collection M of subsets; every subset in M is a vertex in the choice graph,

and the edges capture the relationships of the most preferred products among the subsets in M.

We show that for commonly occurring set collections in the retail and revenue management (RM)

settings, the corresponding choice graphs have rich structures, leading to efficient solutions for rank

aggregation problems.

From a practical standpoint, our methods provide researchers and practitioners with a diagnostic

tool to quantify the impact of their rationality and parametric assumptions. In practice, one fits a

parametric choice model (such as the multinomial logit model) from the RUM class to the observed

choice data to predict the demand for various offer sets. This procedure is satisfactory only if the

researcher can assess the performance loss from his/her modeling assumptions and deems the loss

acceptable. Our methods allow the practitioner to decompose the loss as follows:

Total loss︸︷︷︸loss(fM,y∗M)

= Rationality loss︸︷︷︸LoR(M)

+ Parametric loss︸︷︷︸loss(fM,y∗M)− LoR(M)

,

where y∗M denotes the choice probability vector under the fitted parametric model used by the

researcher. This decomposition provides the following immediate insights: If the rationality loss is

acceptable but the total loss is not, then the researcher will benefit from fitting a more complex

parametric model from the RUM class. If, on the other hand, the rationality loss is itself unaccept-

able, then the researcher must relax the rationality assumption and go beyond the RUM class.


4

1.1 Main Contributions

Our work makes the following three main contributions:

1. Computing the LoR. Our theoretical results express the computational complexity of the rank

aggregation problem in terms of the structure of what we call the choice graph, which captures

the rational separation structure among the subsets inM (see Section 3 for precise definitions).

By formulating the rank aggregation problem as both a dynamic program (DP) and a linear

program (LP) on the choice graph, we show the following (see Sections 4 and 5):

Choice Graph DP complexity LP (# of vars. and constrs.)

Tree O(n |M|) [Thm. 4.1] O(n |M|) [Thm. 4.2]

Bdd. tree width exp. in tree width [Thm. 5.4] exp. in tree width [Thm. 5.5]

Unbd. tree width butexp. in choice depth [Thm. 5.8] exp. in choice depth [Thm. 5.9]

bdd. choice depth

The first column specifies the structure of the choice graph and whether tree width and choice

depth are bounded (bdd.) or unbounded (unbd.), the second column reports the complexity of

solving the DP, and the last column specifies the number of variables (vars.) and constraints

(constrs.) in the LP. Where specified, the scaling of the DP and LP is exponential (exp.) in the

tree width or choice depth, but polynomial in n and |M|.

2. Efficient computation of LoR for important set collections. Basing on our theoretical results,

we derive the structural properties of the choice graphs for the following four set collections

that commonly arise in the retail and RM contexts, and show that the LoR can be computed

efficiently for these collections (see Section 3):

Collection Definition Choice Graph Compl.

Nested S1, S2, . . . , SL s.t. S` ⊂ S`+1 for all ` line linear

Laminar Either A⊆B, or B ⊆A, or A∩B = ∅ ∀ A,B ∈M tree linear

Differentiated A0∪ A0 ∪B` : 1≤ `≤L ,B` ∩B`′ = ∅ ∀` 6= `′ star linear

k-deletion N \A : |A| ≤ k cd≤ k O(nk)

The last column, “Compl.,” is the computational complexity of solving the rank aggregation

problem, and cd denotes the choice depth of the graph. The nested collection arises in RM con-

texts when airlines adopt the nested fare policy. The laminar collection arises when customers

use the elimination-by-aspects (EBA) screening process to form a consideration set before mak-

ing the choice. The differentiated collection arises when a firm carries a common collection of

products A0 in all stores, along with a unique subset B` to differentiate each store ` from others.

Finally, the k-deletion collection arises when a firm faces frequent replenishment and stock-outs

so that, at any point, at most k < n products are stocked out.


5

3. Empirical findings and practical insights. We applied our methods to the grocery sales trans-

actions data on 29 product categories from the popular IRI Academic Dataset (Bronnenberg

et al. 2008). For each category, we computed the corresponding total, rationality, and paramet-

ric losses under the multinomial logit (MNL) model. We obtained the following insights: (a)

most of the total loss is from the rationality loss (supporting existing work for going beyond the

RUM class; see Figure 2); (b) rationality loss provides a model selection tool for determining the

number of latent classes or when to go outside the RUM class (Figure 3); (c) fitting a complex

model within the RUM class, such as the latent-class MNL model, reduces the parametric loss

but not the rationality loss, whereas fitting a model outside the RUM class, such as the latent-

class generalized attraction model (Gallego et al. 2014) can reduce the rationality loss; and (d)

rationality loss is correlated with the market concentration of a category, suggesting that when

market shares are split about evenly across the offered brands, customers tend to be variety

seeking, switching their purchases from week to the next, thereby resulting in nontransitive

preferences and high rationality loss.

1.2 Literature Review

Our work is related to the following three streams of literature: (a) rationalization in econometrics,

(b) fitting nonparametric choice models in operations, and (c) rank aggregation in machine learning.

The question of whether a collection of observed choice probabilities is rationalizable is a classic

question that has received significant attention in economics. The majority of the existing work

focuses on deriving the necessary and sufficient conditions to answer the binary yes/no question of

whether the given data are stochastically rationalizable. Falmagne (1978) shows that a system of

choice probabilities defined over all possible subsets is stochastically rationalizable if and only if all

the Block–Marschak polynomials are non-negative; see also Barbera and Pattanaik (1986). McFad-

den and Richter (1990) show that under certain regularity conditions, the axiom of the revealed

stochastic preference provides a necessary and sufficient condition for a collection of observed choice

probabilities to be stochastically rationalizable. McFadden (2005) provides the additional necessary

and sufficient conditions in the form of systems of linear inequalities, and shows how the differ-

ent conditions relate to one another. However, existing studies do not address the computability

question of how to efficiently check the conditions when the number of products is large. To the

best of our knowledge, this is the first paper to exhibit interesting collections of subsets for which

stochastic rationality can be verified efficiently.

Our work is related to the literature on estimating nonparametric choice models pioneered by

Jagabathula (2011) and Farias et al. (2013). This body of work has primarily focused on estimating


6

a probability mass function λ(·) from transaction data for the purpose of making assortment or

pricing decisions (Jagabathula and Rusmevichientong 2017). It reduces the estimation problem to a

sequence of rank aggregation problems and develops heuristics for obtaining good approximations

that can be used in operations management decisions. For example, van Ryzin and Vulcano (2015)

recently developed an integer programming heuristic to solve the rank aggregation problem. By

contrast, our work is the first to study the problem of identifying the source of computational com-

plexity in solving the rank aggregation problem. By introducing the concepts of rational separation

and choice graph, we characterize computational complexity in terms of what we call the choice

depth of the corresponding choice graph. This characterization provides a new understanding of

what drives the computational difficulty in solving rank aggregation problems. Using our charac-

terization, we show that broad classes of offer set collections that commonly occur in practice have

a bounded choice depth and, therefore, allow efficient solutions for the rank aggregation problem.

Our work is also related to the extensive literature within the machine learning community on

solving the rank aggregation problem. The majority of such studies focus on finding a single rank-

ing that minimizes some metric. The most common metric is the total weight associated with each

unmatched pair, and the resulting problem has been extensively studied as the Kemeny optimiza-

tion problem; see Kenyon-Mathieu and Schudy (2007), Ali and Melia (2012). These studies focus

on the types of observations collected on the web, which comprise choice observations from the set

collection consisting of all pairs of products. Instead, we focus on choice observations from gen-

eral set collections, which arise frequently in RM, supply chain, and operational applications, and

identify new offer set collections for which the rank aggregation problem can be solved efficiently.

2. Problem Formulation

We now formally introduce the setup for our problem. As above, let N = 1,2, . . . , n be the

universe of products, and we assume that N already includes the no-purchase or outside option.

Let M denote a collection of subsets of N , corresponding to the assortments of products offered

to customers. For each subset S ∈M, we observe the fraction fi,S ∈ [0,1] of customers who chose

product i ∈ S when offered a subset S, with∑

i∈S fi,S = 1. We let fM = (fi,S : i ∈ S, S ∈M). Our

objective is to check if the observed choice probabilities are consistent with the RUM assumptions,

and if not, to determine the degree of inconsistency. To measure the degree of inconsistency, we

find the rational choice probabilities that are closest to the observed data. More formally, let Pn

denote the set of all permutations (or linear preference orders) of n products. Each element σ ∈Pn

is a ranking of n products, and for all i ∈N , σ(i) denotes the rank of product i. We assume that

if σ(i)<σ(j), then product i is preferred over product j.


7

For any distribution over rankings λ : Pn → [0,1] such that∑

σ∈Pnλ(σ) = 1, let xλM =(

xλi,S : i∈ S, S ∈M), where xλi,S =

∑σ∈Pn

1l[σ, i,S] ·λ(σ) and 1l[σ, i,S] is the indicator variable that

takes a value of 1 if and only if product i is the most preferred product in S under σ; that is,

1l[σ, i,S] = 1 if and only if σ(i)<σ(j) for all j ∈ S, j 6= i.

The best-fitting distribution λ(·) is the one whose corresponding choice probabilities xλM are

closest to the observed data. We measure the distances through a non-negative, strictly convex loss

function loss(fM,xλM), with the properties that for each fM, the function xM 7→ loss(fM,xM)

is non-negative, strictly convex in xM, and takes a value of 0 if and only if xM = fM. Common

examples of loss functions that satisfy these properties include the following:

Example 2.1 (Log-likelihood/Kullback-Leibler (KL) divergence loss function) Here,

loss(fM,xM) = −∑

S∈MMS

∑i∈S fi,S log (xi,S/fi,S), where MS > 0 is the weight associated

with set S ∈M, representing the number of customers who were shown the assortment S. The

non-negativity of the loss function follows because −∑

i∈S fi,S log(xi,S/fi,S) is the KL divergence

between the distributions (fi,S : i∈ S) and (xi,S : i∈ S). Therefore, the loss function is a weighted

sum of KL divergences, so loss(fM,xM) = 0 if and only if xi,S = fi,S for all i∈ S and S ∈M. Strict

convexity follows from the strict concavity property of the logarithm function. Under this loss

function, the best-fitting distribution λ is indeed the maximum likelihood estimate (MLE) because

maxλ

log

(∏S∈M

∏i∈S

(xλi,S)MSfi,S

)⇔ max

λ

∑S∈M

MS

∑i∈S

fi,S logxλi,S ⇔ minλ

loss(fM,xλM).

Example 2.2 (Squared Norm loss function) The squared norm loss function is defined as

loss(fM,xM) = ‖fM−xM‖2, where the Euclidean norm ‖·‖ is defined on R

∑S∈M|S|. With the use

of the properties of the norm function, it may be verified that the squared norm loss function is

strictly convex in xM for any fixed fM and takes a value of 0 if and only if xM = fM.

We define the limit of rationality LoR(M) associated with the collectionM as the minimum loss

that we incur when we restrict our attention to the RUM class:

LoR(M) ≡ min

loss(fM,x

λM)

∣∣∣ λ : Pn→ [0,1],∑σ∈Pn

λ(σ) = 1

. (1)

Indeed, if the observed choice probabilities are stochastically rationalizable, it follows that (1) has

an optimal objective value of zero, resulting in LoR(M) = 0, as desired.


8

Our solution approach: Equation (1) is a convex minimization problem over a convex feasible

region because the set of rational choice probabilities can be described as a polyhedron, as follows:xM

∣∣∣ xi,S =∑σ∈Pn

1l[σ, i,S]λ(σ) ∀ S ∈M, i∈ S,∑σ∈Pn

λ(σ) = 1, λ(σ)≥ 0 ∀σ ∈Pn

.

In theory, the optimization problem in Equation (1) can be solved using standard methods in convex

optimization theory. The challenge, however, is the curse of dimensionality: the feasible region is

described by n! (n factorial) variables, with one variable λ(σ) for each ranking σ ∈Pn; this is

intractable even for small values of n. We deal with the high dimensionality by applying the Frank–

Wolfe (FW) method (Frank and Wolfe 1956), also called the conditional gradient method, which

has become increasingly popular in the machine learning community as the method of choice for

solving large-scale convex optimization problems (Jaggi 2013). The FW method has many variants,

and the details of our implementation are deferred to Section 6. At its core, the FW algorithm is

an iterative method that starts with an initial feasible solution and generates a sequence of feasible

solutions that are guaranteed to converge to the optimal solution. In each iteration, the algorithm

solves the following linear program:

min

∑S∈M

∑i∈S

ci,S xi,S∣∣ xi,S =

∑σ∈Pn

1l[σ, i,S]λ(σ) ∀ i∈ S, S ∈M,

∑σ∈Pn

λ(σ) = 1, λ(σ)≥ 0 ∀ σ ∈Pn

(Rank Aggregation LP)

where the coefficient ci,S is the gradient of the loss function evaluated at the current iterate, and

represents the first-order Taylor series approximation of the objective function. The FW method

computes the new iterate by taking a convex combination of the current iterate and the solution

to the above linear program by using an appropriate convex combination weight. Thus, the FW

method converts the original convex optimization problem into a series of linear programs. The

rate at which the FW method converges depends on the particular variant of the algorithm used

and the properties of the objective function and the constraint set. Establishing the convergence

rates of the FW method is an active research area, and existing work has shown that the FW

method converges relatively quickly in a wide range of settings; see, for example, Lacoste-Julien

and Jaggi (2015) and the references therein.

Given the above discussion, we introduce the following definition that characterizes the com-

plexity of computing the LoR:

Definition 2.3 (Rational Complexity) The computational complexity associated with LoR(M)

is the complexity of solving the Rank Aggregation LP. We say that LoR(M) can be computed

efficiently if the Rank Aggregation LP can be solved in time that is polynomial in n and |M|.


9

Nesterov (2013) and Bubeck (2015) show that if the Rank Aggregation LP can be solved effi-

ciently, then the original convex optimization problem can also be solved efficiently. However, we

note that the above Rank Aggregation LP still has n! (n factorial) variables. Lemma 2.5 shows

that the problem is equivalent to the combinatorial optimization problem of finding the ranking

that minimizes the total cost across all subsets in M, which, in turn, is equivalent to finding the

most preferred product in each set that is consistent across the entire collection and minimizes the

total cost. It also shows that the problem is indeed NP-hard. Before we can state the lemma, we

introduce the following notation, which will be used frequently throughout the paper:

Definition 2.4 (Consistent Rankings) For any set S and zS ∈ S, let DS(zS) be the set of rank-

ings that make zS the most preferred product in S; i.e., DS(zS) = σ : σ(zS)<σ(i) ∀ i∈ S, i 6= zS.Similarly, for any collection of subsets A, let zA = (zS ∈ S : S ∈A) and DA(zA) =

⋂S∈ADS(zS).

It follows from the above definition that if DM(zM) 6= ∅, then zM represents consistent top-

ranked products. In other words, there exists a ranking σ of all products, such that for all S ∈M,

zS is the most preferred product in S under σ. The next lemma shows that the Rank Aggrega-

tion LP reduces to finding consistent top-ranked products for M that minimize the cost.

Lemma 2.5 (Three Equivalent Characterizations) Let the polytope M be defined by

M ≡ Convex hull of the set

(1l[σ, i,S] : i∈ S, S ∈M)∈ 0,1∑

S∈M |S| : σ ∈Pn

.

The Rank Aggregation LP is equivalent to the following three optimization problems:

minxM∈M

∑S∈M

∑i∈S

ci,Sxi,S = minσ∈Pn

∑S∈M

∑i∈S

ci,S1l[σ, i,S] = minzM :DM(zM)6=∅

∑S∈M

czS ,S, and

the Rank Aggregation LP is NP-hard, even for the collection of all pairs Pairs = i, j : i 6= j.

Proof: By definition of M , it follows that the Rank Aggregation LP is equivalent to

minxM∈M

∑S∈M

∑i∈S ci,Sxi,S. By the standard result in linear programming, the optimal solution

occurs at an extreme point of M , corresponding to a binary vector (1l[σ, i,S] : i∈ S, S ∈M) for

some σ ∈Pn. Therefore, the problem is also equivalent to minσ∈Pn

∑S∈M

∑i∈S ci,S1l[σ, i,S]. The

final equivalence follows because for each σ ∈ Pn,∑

i∈S 1l[i, σ,S] ·ci,S = czS ,S, where zS is the most

preferred product in S under σ. A vector zM = (zS : S ∈M) is feasible if DM(zM) =∩S∈MDS(zS) 6=∅; therefore, the problem minσ∈Pn

∑S∈M

∑i∈S ci,S1l[σ, i,S] is equivalent to the following problem:

min

∑S∈M

czS ,S

∣∣∣∣ zS ∈ S ∀ S ∈M and⋂S∈M

DS(zS) 6=∅

= min

zM :DM(zM)6=∅

∑S∈M

czS ,S.

The NP-hardness result follows immediately from the result of Dwork et al. (2001).


10

It is clear from Lemma 2.5 that constructing the optimal ranking is equivalent to determining

the choice z∗S for each subset S ∈M under the optimal ranking that minimizes the total cost.

Naturally, we cannot determine the optimal choices for the subsets separately because the resulting

choices may violate the transitivity of preferences; for instance, choosing 2 from 1,2,3 and 3 from

2,3,4 will imply that products 2 and 3 are simultaneously preferred, violating the transitivity of

the preferences. As a result, the optimal choices across subsets must be determined jointly to ensure

that the transitivity of preferences is satisfied; therefore, searching for the top-ranked product in

each set is challenging, and as noted in Lemma 2.5, the corresponding optimization problem is

NP-hard even for the collection just comprising all pairs of products.

However, for important collections of subsets, we show that the optimal choices of some subsets

can be determined “separately” from those of other subsets. When carefully done, this allows us to

efficiently determine the optimal choices by decoupling their search. We formalize this concept next.

3. Rational Separation and Choice Graph

In this section, we introduce the notions of rational separation and choice graph. These concepts

allow us to decouple the search for the optimal choices in certain subsets, provided that choices

from other subsets are fixed. We start with the formal definition of rational separation and then

provide an example to illustrate the concept.

Definition 3.1 (Rational Separation) Given three subsets, A, B, and C, we say that A and

B are rationally separable given C, written as A ñ B | C, if for all zA ∈ A, zB ∈ B, and zC ∈

C, whenever DA(zA) ∩DC(zC) 6= ∅ and DB(zB) ∩DC(zC) 6= ∅, we also have DA(zA) ∩DB(zB) ∩

DC(zC) 6= ∅. Similarly, for any three collections of subsets A, B, and C, we say that A ñ B | C if

DA(zA) ∩ DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6= ∅, then DA(zA)∩DB(zB)∩DC(zC) 6=∅.

A consequence of rational separation between A and B given C is that

DA(zA)∩DB(zB)∩DC(zC) 6= ∅ if and only if DA(zA)∩DC(zC) 6=∅ and DB(zB)∩DC(zC) 6=∅ .

In other words, if we ensure that the choices from subsets A and C are consistent (the choices do

not violate the transitivity of preferences) and the choices from subsets B and C are also consistent,

then the rational separation property of A and B given C immediately implies that the choices

from A, B, and C are mutually consistent. The following simple example illustrates this property.

Example 3.2 Let A= 1,2,3, B = 5,6,7, and C = 2,3,4,5,6. We will show AñB | C; i.e.,

if we fix the choice from C, then the choices from A and B can be made independently of each other.


11

Consider the case when zC = 2, i.e., product 2 is the most preferred among all the products in

set C. Because set C includes product 3, it follows that 2 is preferred over 3. Therefore, the most

preferred product in set A must be either 1 or 2, implying that DA(zA)∩DC(zC) 6= ∅ ⇐⇒ zA ∈1,2. Furthermore, because zC = 2 does not constrain the preference ordering among products 5,

6, and 7, we conclude that DB(zB)∩DC(zC) 6= ∅ ⇐⇒ zB ∈ 5,6,7. The case when zC ∈ 3,4,5,6can be argued similarly, resulting in the following table of consistent choices for zA, zB, for each zC :

zC 2 3 4 5 6

zA : DA(zA)∩DC(zC) 6=∅ 1,2 1,3 1,2,3 1,2,3 1,2,3zB : DB(zB)∩DC(zC) 6= ∅ 5,6,7 5,6,7 5,6,7 5,7 6,7

Using the above table, we now show that A ñ B | C. By construction, for any combination of

choices (zA, zB, zC) in the above table, we must have DA(zA)∩DC(zC) 6=∅ and DB(zB)∩DC(zC) 6=∅. Therefore, it suffices to show that we also have DA(zA)∩DB(zB)∩DC(zC) 6= ∅.

Suppose zC = 2. There are two cases: zA = 1 and zA = 2. If zA = 1, then for any zB ∈ 5,6,7,consider the ranking σ that ranks product 1 at the top; product 2, second; and product zB, third.

The ranking σ results in the choices zA, zB, and zC from sets A, B, and C, respectively. Therefore,

σ ∈DA(zA) ∩DB(zB) ∩DC(zC). In a similar manner, when zA = 2 and for any zB ∈ 5,6,7, the

ranking τ that ranks product 2 at the top and product zB, second, is such that τ ∈ DA(zA) ∩DB(zB)∩DC(zC). Therefore, in both cases, we have DA(zA)∩DB(zB)∩DC(zC) 6= ∅, which is the

desired result. Using a similar argument for other values of zC shows that AñB | C.

Computational Savings from Rational Separation: An important consequence of the rational

separation property is that once we fix the best item in set C, the optimal choices in sets A and B

can be determined separately. More precisely, if AñB | C, then

minczA,A + czB ,B + czC ,C

∣∣ zA ∈A,zB ∈B,zC ∈C, DA(zA)∩DB(zB)∩DC(zC) 6=∅

= minzC∈C

czC ,C +

min

zA∈A:DA(zA)∩DC(zC) 6=∅czA,A

+

min

zB∈B:DB(zB)∩DC(zC)6=∅czB ,B

.

The above equation shows that instead of a brute force search over all possible triples (zA, zB, zC),

searching over zA and zB separately, for each zC , suffices. Note that for each triple (zA, zB, zC),

checking whether DA(zA) ∩DB(zB) ∩DC(zC) 6= ∅ is an O(1) operation. Therefore, a brute force

search over all possible triples is an O(n3) operation. Instead, when we exploit rational separation,

searching for the optimal zA and zB separately is an O(n) operation for each value of zC . Because

zC has O(n) possible values, determining the optimal choices becomes an O(n2) operation. Thus,

rational separation has allowed us to shave off a factor n from the computational complexity. When

M contains many subsets, the savings become more substantial.

To facilitate its use, we represent the rational separation structure of a collection of subsets Min the form of a choice graph, which is defined as follows:


12

Definition 3.3 (Choice Graph Representation of Rational Separation) Given a collec-

tion of subsets M, a graph G = (M,E) is a choice graph of M if G is an undirected graph whose

vertices correspond to the subsets in M, and for any three collections of subsets A, B, and C, if Cseparates A and B in graph G, then it must be that Añ B | C. Here, we say C separates A and Bin the graph G if every path from a set in A to a set in B passes through some set in C.

In Sections 4 and 5, we show how the structure of the choice graph is related to the computational

complexity of computing LoR. Before that, in Sections 3.1–3.4, we identify the choice graphs for

four important classes of set collections: nested, laminar, differentiated, and k-deletion. As discussed

below, these collections occur frequently in inventory and RM applications.

3.1 Choice Graphs for Nested Collections

The collection M= S1, S2, . . . , Sm is nested if and only if S1 ⊂ S2 ⊂ · · · ⊂ Sm for some indexing

of the subsets; that is, smaller subsets are always contained within larger ones. Nested collections

are common in retail and RM settings. In their classic paper, van Ryzin and Mahajan (1999) show

that when the demand follows the multinomial logit (MNL) model and all products have the same

price and cost, a profit-maximizing retailer should stock the top h products with the largest MNL

weights (which measure the popularity of the corresponding products), and the optimal size (or

variety) h is determined by store-specific factors, such as the profit margin or the market size.

When there are no stockouts, the resulting collection of offer sets across different stores is nested.

Cachon et al. (2005) extend this finding to the case in which consumers may decide to search, which

results in non-purchase even if an offered product is preferred to the no-purchase option. Similarly,

Talluri and van Ryzin (2004) show that in the single-leg revenue management problem, a nested

policy in which the offer set is always chosen from a nested collection is profit maximizing. The

nesting is by the fare order under the MNL and other common demand models. Nested collections

also arise when airlines control seat inventories using serial nested booking class controls, in which

a lower-value class product is removed from the offering whenever a higher-value class in the nested

hierarchy is closed for sale2.

Example 1 in Talluri and van Ryzin (2004) provides a concrete illustration of serial nesting con-

trol. Suppose the airline offers three fare products, Y , M , and K, listed in the order of decreasing

prices and increasing restrictions. The airline serves five segments of customers differing in their will-

ingness to pay and the restrictions they qualify for. Talluri and van Ryzin (2004) show that to max-

imize revenue, it is optimal for the airline to offer only subsets in M= Y ,Y,K,Y,K,M,

2 Vinod (2006), a senior vice president of Sabre Corporation (the leading RM company in the travel industry), notes

that airlines frequently use a serial nesting control.


13

even if there are 8 = 23 possible offer sets to consider. When the airline follows the optimal policy,

choice observations are collected on a nested collection of subsets. By Theorem 3.4, the choice

graph for this example is: Y Y,K Y,K,M .

Theorem 3.4 The choice graph for the nested collection M is the line graph S1−S2− · · ·−Sm.

To prove the above result, we need to show that for any three disjoint collections of subsets

A,B,C ⊆M, if A and B are separated by C in the line graph, then Añ B |C. We provide a proof

sketch here for the case when each collection, A, B, and C, consists of a single set; the detailed

proof is given in Appendix A. In particular, consider the sets Si, Sj, and Sk with i < j < k, so that

Si ⊂ Sj ⊂ Sk, and, therefore, Sj separates Si from Sk in the line graph. Let zi ∈ Si, zj ∈ Sj, and

zk ∈ Sk be the respective choices, such that DSi(zi)∩DSj (zj) 6= ∅ and DSk(zk)∩DSj (zj) 6= ∅. We

now show that DSi(zi)∩DSj (zj)∩DSk(zk) 6=∅ to establish that Si ñ Sk |Sj. For that, note that

DSi(zi)∩DSj (zj) 6=∅ if and only if zj ∈ Sj \Si or zj = zi, and

DSk(zk)∩DSj (zj) 6=∅ if and only if zk ∈ Sk \Sj or zj = zk .

Consider a ranking σ ∈ DSi(zi) ∩DSj (zj). We modify σ as follows: If zk ∈ Sk \ Sj, we obtain the

modified ranking σ by moving zk to the top of the list σ, so that zk becomes the most preferred

product under σ. Because zk /∈ Si∪Sj, this modification does not affect the choices from Si and Sj;

therefore, we have σ ∈DSi(zi)∩DSj (zj)∩DSk(zk). On the other hand, if zk = zj, we obtain σ by

moving all the elements in Sk \Sj to the bottom of σ. Again, because (Sk \Sj)∩ (Sj ∪Si) = ∅, the

choices from both Si and Sj are unaffected; therefore, we also have σ ∈DSi(zi)∩DSj (zj)∩DSk(zk).

In both cases, we have shown that DSi(zi)∩DSj (zj)∩DSk(zk) 6= ∅, which is the desired result.

3.2 Choice Graphs for Laminar Collections

The laminar collection generalizes the nested collection. A collection M is laminar if it has the

property that for any two distinct sets A,B ∈ M, either A ⊂ B, or B ⊂ A, or A ∩ B = ∅; in

other words, any two sets are either disjoint or contain each other. Let T = (M,E) be the graph

representing the laminar collection M with vertices corresponding to the sets in M and an edge

S1, S2 ∈ E between the two sets S1 and S2 if and only if they are adjacent. Two subsets A,B ∈M

are adjacent if they are not disjoint and there is no other subset X ∈M, such that A⊂X ⊂B or

B ⊂X ⊂A. The main result of this section is stated in the following theorem:

Theorem 3.5 The choice graph for the laminar collection M is the forest (a collection of disjoint

trees) T= (M,E).


14

The proof of this theorem is given in Appendix A. Here, we present a motivating example.

In practice, the laminar collection arises naturally when customers use the EBA screening pro-

cess (Tversky and Kahneman 1974) to form consideration sets before making a purchase. In the

EBA screening process, customers essentially whittle down large product spaces by sequentially

eliminating products that do not meet their criteria. Figure 1 provides an example from a RM

application. Suppose a customer is looking for a one-way economy flight from Los Angeles, CA,

(LAX) to Toronto, ON, (YYZ) on July 25, 2015. This example has a total of eight itineraries, with

each one described by three attributes: departure time, number of connections, and airline. We

label the itineraries 1,2, . . . ,8, as shown at the bottom of Figure 1.

Figure 1 An illustration of the EBA screening process for a one-way flight itinerary from LAX to YYZ

on June 25, 2015. The data were collected on March 1, 2015 from Orbitz.com.

Orbitz.com allows customers to filter their search results by various attributes. For this

example, we assume that customers follow the EBA screening process to narrow their search

results. For instance, a customer who prefers non-stop flights that depart in the morn-

ing would filter the search results on these attributes to arrive at the consideration set

1,2, and then make a choice from this set. In a similar way, customers who prefer after-

noon flights would choose from the consideration set (or the effective offer set) 5,6,7,8.

Because Orbitz.com can observe the filtered search results, the purchase data provides

choice observations from effective offer sets from the following laminar collection: M =

1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8 .

The choice graph for this example, obtained by joining adjacent sets, is the following tree:

1,2,3,4,5,6,7,8

1,2,3,4

1,2

1 2

3,4

3 4

5,6,7,8

5,6

5 6

7,8

7 8


15

3.3 Choice Graphs for Differentiated Collections

The differentiated collection M= A0,A1, . . . ,A` comprises sets that possess a common core but

are otherwise disjoint. It has the property in which Ai ∩Aj =A0 for all 1≤ i, j ≤ ` so that the set

A0 forms the common core of products present in every set. Beyond the core products, there is no

overlap in the subsets. Consider a star-shaped choice graph with A0 at the center and edges from

A0 to Ai for each i; more formally, let Star = (M,E) be the graph with node set M and edge set

E = (A0,Ai) : 1≤ i≤ `. Theorem 3.6 shows that Star is a choice graph of M.

A differentiated collection arises when, in addition to common products, firms carry a unique

sub-collection of products in each store to cater to local tastes. A chain of grocery stores often

carries a common set of popular products, but each store differentiates itself by carrying locally

sourced products, such as local cheese or beer. As an example, consider a retailer with stores in

Indianapolis, Detroit, and St. Louis. By analyzing the actual grocery transaction data from the

IRI academic dataset (Bronnenberg et al. 2008), we identified the following brands of soup that

were sold at each of the three stores; the number in parentheses next to each brand is the product

ID, which ranges from 1 to 7.

• Indianapolis: Healthy Choice (1), Campbell’s (2), Progresso (3), Health Valley (4), Pacific (5)

• Detroit: Healthy Choice (1), Campbell’s (2), Progresso (3), Hormel (6)

• St. Louis: Healthy Choice (1), Campbell’s (2), Progresso (3), Private Label (7)

A0 = 1,2,3 A1 = 1,2,3,4,5

A2 = 1,2,3,6

A3 = 1,2,3,7

In this example, the common core set of products

is given by A0 = 1,2,3, and we have the follow-

ing collection of offer sets: A1 = 1,2,3,4,5 for the

Indianapolis store, A2 = 1,2,3,6 for the Detroit

store, and A3 = 1,2,3,7 for the St. Louis store.

The figure on the left shows the choice graph for

this example. Note that the choice graph of the dif-

ferentiated collection is a tree, but the collection

is not laminar. The main result of this section is

stated in the following theorem, whose proof is given in Appendix A.

Theorem 3.6 The star-graph Star is the choice graph for the differentiated collection M.

3.4 Choice Graphs for k-deletion Collections

The k-deletion collection, denoted by Delk, consists of subsets obtained from N = 1,2, . . . , n by

deleting at most k products. For any subset A⊆N , let NA =N \A. Then, Delk = NA : |A| ≤ k, so


16

Del0 = N, Del1 =N,Ni : i∈N

, Del2 = N,Ni,Ni,j : i∈N,j ∈N, i 6= j, and so on. The k-

deletion collection arises in retail contexts when a firm faces frequent stock-outs and replenishments.

In practice, retailers often maintain high service levels, resulting in low stock-out rates, especially

in the fast-moving consumer goods industries (Gruen and Corsten 2008). As a result, at most k n

products are stocked out at any point (see the case study with real-world data in Section 6).

The choice graph for the k-deletion collec-

tion has a layered structure with k+ 1 lay-

ers (see the figure on the right, for which

k = 2 and N = 1,2,3,4). For any 0 ≤ ` ≤

k, let F` ≡ NA : |A| = ` denote the collec-

tion of subsets obtained by deleting exactly

` products from N . It is clear that Delk =

∪· k`=0F`. Let Gk = (Delk,Ek) be the layered

graph whose vertices are the subsets in Delk, and Gk has k+ 1 layers, with the subsets in F` com-

prising layer ` for 0≤ `≤ k. The nodes in successive layers are connected to each other, resulting

in the edge set Ek = X`,X`+1 :X` ∈F`, X`+1 ∈F`+1, `= 0,1, . . . , k− 1. We have the following

result, the detailed proof of which is presented in Appendix A:

Theorem 3.7 For all k≥ 0, the graph Gk is a choice graph for the k-deletion collection Delk.

We conclude this section with a few remarks on the algorithmic verification of the rational

separation property. Given a collection of subsets A, B, and C, we can verify if the triple (zA, zB, zC)

of choices are mutually consistent in O(n) operations by forming a directed graph with nodes as

products and edges from zS to S \ zS for S ∈ A,B,C, and by checking for the presence of

directed cycles. As a result, we can verify if the subsets A, B, and C satisfy the rational separation

property in O(n4) complexity by checking Definition 3.1 for all possible triples (zA, zB, zC). In a

similar way, we can verify the rational separation property for any three collections of subsets A,

B, and C, albeit with exponential complexity. The details are shown in Appendix B.

4. Limits of Rationality in Trees

We now derive the complexity of computing the LoR when the underlying choice graph T= (M,E)

is a tree. We first establish that the Rank Aggregation LP can be solved in linear time by using

O(n |M|) operations. This complexity is the best possible bound (up to constant factors) because

reading the problem inputs requires Ω(n |M|) operations to read all the objective coefficients

ci,S : i∈ S,S ∈M. We prove the result by designing an efficient DP to find the optimal solution.


17

We also describe a compact linear program with O(n |M|) variables and O(n |M|) constraints to

determine the optimal solution using off-the-shelf tools. The main result of this section is stated

in the following theorem:

Theorem 4.1 (Linear Time Computation of LoR(M)) If the underlying choice graph is a

tree T = (M,E), then the Rank Aggregation LP can be solved via a DP in linear time with

O(n |M|) operations.

Proof: By Lemma 2.5, the Rank Aggregation LP is equivalent to Z∗ =

min∑

S∈M czS ,S

∣∣∣ zS ∈ S ∀S ∈M,⋂S∈MDS(zS) 6=∅

. Pick an arbitrary vertex in T, label it

as root, and treat it as the root vertex in the tree. In addition, for any S ∈M, let T(S) denote

the collection of subsets (aka vertices) in the subtree rooted at S; for mathematical convenience,

we exclude S from T(S). Note that it follows from the tree structure of the choice graph that if

S1 6= S2, S1 /∈ T(S2), and S2 /∈ T(S1), then T(S1) ∩ T(S2) = ∅. For each S ∈M, define the value

function VS : S→R for the subtree rooted at S as follows:

VS(zS) = czS ,S + min

∑A∈T(S)

czA,A

∣∣∣ zA ∈A ∀A∈T(S), DS(zS)∩DT(S)(zT(S)) 6= ∅

,

where DT(S)(zT(S)) =∩A∈T(S)DA(zA). By definition, Z∗ = minzroot∈root Vroot(zroot). Note that


∑A∈Children(S)

[czA,A +

∑B∈T(A)

czB ,B

] ∣∣∣∣DS(zS)∩

( ⋂A∈Children(S)

[DA(zA)∩DT(A)(zT(A))

] )6= ∅

.

To obtain an efficient dynamic programming recursion, we now exploit the rational separation as

described by the choice tree. Note that

DS(zS)∩

⋂A∈Children(S)

[DA(zA)∩DT(A)(zT(A))

] 6= ∅

⇔ DS(zS)∩DA(zA)∩DT(A)(zT(A)) 6=∅ ∀ A∈ Children(S)

⇔ DS(zS)∩DA(zA) 6= ∅ and DA(zA)∩DT(A)(zT(A)) 6= ∅ ∀A∈ Children(S) ,

where the first equivalence follows because S rationally separates its children and their descendants

from one another; that is, for any two children A1 and A2 of S, (A1 ∪T(A1))ñ (A2 ∪T(A2)) | Sbecause each path from A1∪T(A1) to A2∪T(A2) must go through S. Similarly, the second equiv-

alence follows because for each A ∈ Children(S), A rationally separates S from T(A); that is,

S ñT(A) | A because all paths from S to T(A) must go through A. It now follows that


∑A∈Children(S)

[czA,A +

∑B∈T(A)

czB ,B

] ∣∣∣∣ DS(zS)∩DA(zA) 6= ∅,


18

and DA(zA)∩DT(A)(zT(A)) 6= ∅ ∀A∈ Children(S),

= czS ,S +∑

A∈Children(S)

min

czA,A +

∑B∈T(A)

czB ,B

∣∣∣ DS(zS)∩DA(zA) 6= ∅,

and DA(zA)∩DT(A)(zT(A)) 6= ∅

= czS ,S +∑

A∈Children(S)

minzA∈A : DS(zS)∩DA(zA)6=∅

VA(zA) ,

where the last equality follows from the definition of VA(zA). The above recursion allows us to

start at the leaf vertices of T, traverse in a breadth-first search, and recursively compute the value

functions until we reach root.

Complexity: By traversing the choice graph in a breadth-first search manner, the DP recursion

can be computed in linear time using O(n |M|) operations. The details are in Appendix C.1.

An immediate corollary of Theorem 4.1 is that the complexity of computing the LoR for the

nested, laminar, and differentiated collections (introduced in Section 3) is O(n |M|). We empha-

size the the linear running time in Theorem 4.1 is the best one possible because the number of

parameters |ci,S : i∈ S, S ∈M| for the problem is already at least Ω(n |M|) in the worst case.

An alternative computational method based on linear programming (LP): In prac-

tical applications, having algorithms that exploit off-the-shelf optimization packages is often desir-

able. For a tree choice graph, the Rank Aggregation LP can also be reformulated as a compact

linear program with just O(n |M|) variables and constraints. This result is stated in the Theo-

rem 4.2, whose proof is in Appendix C.2, which also gives specific examples of the compact LP

representation for the nested, laminar, and differentiated collections introduced in Section 3.

Theorem 4.2 (Compact LP for Tree) When the choice graph T= (M,E) is a tree, the Rank

Aggregation LP can be equivalently reformulated as the following LP with O(n |M|) variables

and constraints:

min

∑S∈M

∑i∈S

ci,Syi,S

∣∣∣ yr,A ≤ zr,A,B and yr,B ≤ zr,A,B ∀ r ∈A∩B,A,B ∈ E,

∑i∈S

yi,S = 1 ∀ S ∈M,∑

r∈A∩Bzr,A,B = 1 ∀ A,B ∈ E, y≥ 0,z ≥ 0

5. Limits of Rationality for General Graphs

We now extend the results of Section 4 to arbitrary choice graphs that are not necessarily trees.

Our extension uses the concept of tree decomposition of a graph and the corresponding tree width.


19

These concepts are well known in classic graph theory and are used to describe the computational

complexity of many algorithms on graphs. In Section 5.1, we present the concepts of tree decom-

position and tree width. Then, in Section 5.2, we consider the problem of computing the limit of

rationality in general choice graphs. We show that if the tree width is bounded, then the Rank

Aggregation LP can be solved efficiently in polynomial time by formulating the rank aggrega-

tion problem as a DP on the tree decomposition of the choice graph. We then develop methods to

deal with the graphs of unbounded tree width by using the new concept of choice depth, which is

unique to choice graphs.

5.1 Tree Decomposition and Tree Width

We first review the concepts of tree decomposition and tree width by using the definition given

by Robertson and Seymour (1986); see Halin (1976) and Diestel (2005) for equivalent definitions.

Definition 5.1 A tree decomposition of a choice graph G = (M,E) is a connected tree TG with

nodes indexed by Xb : b∈ B. For all b∈ B, the node Xb ⊆M is a subset of the vertices of G and is

referred to as a bag of vertices. The collection of bags Xb : b∈ B satisfies the following properties:

1. M=∪b∈BXb.

2. For every edge S1, S2 ∈ E in graph G, there exists a bag Xb, such that S1, S2 ⊆Xb.

3. Running intersection property: If Xb and Xc contain a set S ∈M, then every bag in the unique

path between Xb and Xc must also contain S.

The width of the tree decomposition TG, denoted by width(TG), is equal to the size of the largest

bag minus one; that is, maxb∈B |Xb| − 1. The tree width of a graph G, denoted by tw(G), is the

minimum width among all tree decompositions of G.

Note that G has many tree decompositions. A trivial one is a tree with a single bag consisting of all

vertices in the original graph. The problem of computing the tree width of a graph is NP-complete

(Arnborg et al. 1987), but extensive research efforts to develop efficient algorithms for finding tree

decompositions with small widths have been made (Fomin and Kratsch 2010). We do not address

this particular issue here, as it is beyond the scope of our paper.

Each bag Xb ⊆M in a tree decomposition TG corresponds to a collection of subsets. To distinguish

between the vertices of the original graph G and those of TG, we will refer to a vertex in TG as

a bag of subsets. It is important to note that if the choice graph G is already a tree, then G is

NOT its tree decomposition; rather, a tree decomposition of G will be a graph, where each bag

Xb = S1, S2 consists of exactly two subsets, such that S1, S2 ∈ E corresponds to an edge in


20

the original tree. Therefore, if the original choice graph is a tree, then its tree width is equal to

2− 1 = 1. The following lemma shows that our concept of rational separation from Definition 3.1

is preserved under any tree decomposition.

Lemma 5.2 (Preservation of Rational Separation) Every tree decomposition TG of a choice

graph G preserves the rational separation encapsulated in G; that is, for any bag Xk that lies on the

unique path between bags Xi and Xj, we have Xi ñXj | Xk.

Proof: It suffices to show that whenever Xk separates Xi from Xj in TG, then Xk also separates

Xi from Xj in the original graph G. To arrive at a contradiction, we suppose on the contrary that

there exist A ∈ Xi and B ∈ Xj, such that for some path A=C0−C1− · · · −Cm−Cm+1 =B in G,

we have C` /∈Xk for all 0≤ `≤m+ 1.

Now, consider the edges e` = C`,C`+1 for 0 ≤ ` ≤m in the original graph G. Property 2 in

Definition 5.1 of the tree decomposition implies that for 0 ≤ ` ≤m, there exists a bag Y` in the

tree decomposition TG, such that e` = C`,C`+1 ⊆ Y`. By our construction, A = C0 ∈ Y0 and

B =Cm+1 ∈Ym.

Claim 1: For each 0≤ `≤m−1, the bag Xk does not lie on the path from Y` to Y`+1. To prove

this, we note that by construction, C`+1 ∈ Y` and C`+1 ∈ Y`+1. Then, by the running intersection

property in Definition 5.1, every path connecting Y` and Y`+1 in TG must contain C`+1. Because

we are assuming that C`+1 /∈Xk, it follows that Xk cannot lie on the path from Y` to Y`+1 in graph

TG. The claim thus follows.

Claim 2: There is a path in the tree decomposition TG from Xi to Y0 that does not contain

Xk, and there is a path in the tree decomposition TG from Ym to Xj that does not contain Xk. To

prove this, note that both bags Xi and Y0 contain C0, so by the running intersection property of

the tree decomposition, it must be that every bag in the path from Xi to Y0 also contains C0. Since

C0 /∈Xk, it must be that Xk does not lie on this path, giving the desired result. The proof for the

path from Ym to Xj is exactly the same. The claim thus follows.

It then follows from Claims 1 and 2 that there is a path in the tree decomposition TG from Xito Xj through the bags Y0,Y1, . . . ,Ym that does not contain Xk. This contradicts the fact that Xkseparates Xi and Xj in TG. Therefore, we must have Xi ñXj | Xk.

Because every tree decomposition preserves rational separation, we can solve the Rank Aggre-

gation LP on the tree decomposition. As the first step, the following lemma shows that we can

always find a tree decomposition with at most |M| bags. The result follows immediately from the

vertex elimination algorithm of Rose et al. (1976), and we omit the details.

Lemma 5.3 (Theorem 9 in Rose et al. 1976) We can always construct a tree decomposition

of graph G with at most |M| bags by using O(|M|+ |E|) operations.


21

5.2 Computing LoR(M) in General Choice Graphs

We now consider the problem of computing the limit of rationality in a general choice graph.

When the tree width of a choice graph is bounded, the following theorem shows that the Rank

Aggregation LP can be solved in polynomial time:

Theorem 5.4 (Complexity in Tree Width) The Rank Aggregation LP can be solved in

O(|M|n4+2tw(G)

)operations.

The proof of Theorem 5.4 makes use of a dynamic programming equation that is similar to the

one used in the proof of Theorem 4.1; the detail is in Appendix D. In addition, for a general choice

graph with a bounded tree width, we can also solve the Rank Aggregation LP by using a linear

program with O(|M|n2(tw(G)+1)

)variables and O(|M|ntw(G)+1) constraints. This result is stated in

the following theorem, whose proof is given in Appendix D.

Theorem 5.5 (LP for Bounded Tree Width) For any collection M of subsets and the asso-

ciated choice graph G, the Rank Aggregation LP can be formulated as a linear program with

O(|M|n2(tw(G)+1)

)variables and O(|M|ntw(G)+1) constraints.

Dealing with Choice Graphs with Large Tree Widths: The results thus far encompass

a broad range of set collections with tree choice graphs, including the nested, laminar, and differ-

entiated collections. Recall from Section 3.4 that Gk denotes the choice graph for the k-deletion

collection, Delk. Note that G1 is a tree, so its tree width is one. However, for k ≥ 2, Gk is not a

tree, and as shown in Lemma 5.7, its tree width is at least n.

When the tree width of the choice graph increases with the number of products, the methods from

the previous sections become intractable because the computational complexity scales exponentially

in the tree width. Fortunately, for many collections of subsets, including the k-deletion collection,

we can still solve the Rank Aggregation LP efficiently. To describe this result, we introduce

the following new concept that is unique to the choice graph:

Definition 5.6 (Choice Depth) Given a tree decomposition TG of a choice graph G = (M,E),

the choice depth under TG – denoted by cd(G,TG) – is defined by

cd(G,TG) = maxb∈B

∣∣∣∣ ⋃S∈Xb

S

∣∣∣∣ − minA∈Xb

|A|

,

and the choice depth of G, denoted by cd(G), is defined as the minimum of the choice depth under all

possible tree decompositions of G; i.e., cd(G) = mincd(G,TG) : TG is a tree decomposition of G.


22

The choice depth concept is unique to the choice graph and is fundamentally different from the

classic tree width concept. As shown in the following lemma (proved in Appendix D), the choice

depth of a graph can be much smaller than the tree width.

Lemma 5.7 (Unbounded Tree Width but Small Choice Depth) For all 2 ≤ k < n, the

choice graph Gk (as defined in Section 3.4) of the k-deletion collection Delk has tree width of at

least n and choice depth of at most k; that is, cd(Gk)≤ k < n≤ tw(Gk).

The main result of this section is stated in the following theorem, which characterizes the com-

plexity of the LoR in terms of choice depth. The proof uses the same DP equation, as in the proof

of Theorem 5.4, but exploits the special structure of the choice graph to show that instead of

maintaining the top-ranked product in each set in each bag, maintaining the ranking of the top

cd(G) + 1 products in each bag suffices. The proof of this result is given in Appendix D.

Theorem 5.8 (Complexity in Choice Depth) For any collection of subsets M and the asso-

ciated choice graph G, the Rank Aggregation LP can be solved in O(|M|n4+2cd(G)

).

LP for graphs with bounded choice depth: As a companion to the LP in Theorem 5.5,

we now describe a compact LP to solve the rank aggregation problem for graphs with bounded

choice depth. Suppose that cd(G) ≤ k − 1. Let TG be a tree decomposition of the choice graph

G with the collection of bags Xb : b∈ B and choice depth less than or equal to k. For any set

S ⊂ N , let Ok(S) denote the set of all possible top k orderings of the elements in S; that is,

Ok(S) = a∈ Sk : ai 6= aj ∀ i 6= j,1≤ i, j ≤ k, where Sk = S×S×· · ·×S denotes the k-fold Carte-

sian product of the set S. For each b ∈ B, let Ub = ∪S∈XbS denote the collection of all products

comprising the bag Xb. For any b∈ B and a∈O(Ub), let Pk(a,Ub)⊂Pn denote the set of rankings

σ that result in a as the top k ordering among the elements in Ub; that is,

Pk(a,Ub) =σ ∈Pn

∣∣ the top k products under σ correspond to the vector a.

Let zXb = (zS : S ∈ Xb) denote the vector of choices from all the subsets in the bag Xb. We let

Ok(Ub, zXb) denote the top k orderings a∈Ok(Ub) that are consistent with the choices zXb ; that is,

Ok(Ub, zXb) =

a∈Ok(Ub)

∣∣ zS = aj∗S

where j∗S = min1≤`≤k : a`∈S

` ∀ S ∈Xb.

For each b∈ B and a∈Ok(Ub), we introduce the variables xb,a that correspond to the probability

that a is the top k ordering among the elements in Ub. For any a1 ∈Ok(Ub1) and a2 ∈Ok(Ub2) that

are consistent with each other, that is, Pk(a1,Ub1)∩Pk(a2,Ub2) 6= ∅, let wb1,a1,b2,a2 be the joint


23

probability that a1 and a2 are the top k orderings among Ub1 and Ub2 , respectively. We now define

the polyhedron H as follows:

H =

(ybzXb

: b∈B, DXb(zXb) 6=∅) ∣∣∣∣

ybzXb=

∑a∈Ok(Ub,zXb )

xb,a ∀ b∈ B, zXb , such that DXb(zXb) 6=∅

xb1,a1 =∑

a2∈Ok(Ub2) : Pk(a2,Ub2

)∩Pk(a1,Ub1) 6=∅

wb1,a1,b2,a2 ∀ edge Xb1 ,Xb2 ∈TG and a1 ∈Ok(Ub1)

xb2,a2 =∑

a1∈Ok(Ub1) : Pk(a2,Ub2

)∩Pk(a1,Ub1) 6=∅

wb1,a1,b2,a2 ∀ edge Xb1 ,Xb2 ∈TG and a2 ∈Ok(Ub2)

∑a∈Ok(Ub)

xb,a = 1 ∀ b∈ B

All variables are non-negative

.

The number of variables and constraints in the above LP is determined by the cardinality of the setzXb : DXb(zXb) 6=∅

, for each bag Xb. As noted above and as shown in the proof of Theorem 5.8,

the vector of top choices zXb from all subsets S ∈ Xb can be recovered from the top (1 + cd(G))-

ranked products in the bag Xb. Therefore, we must have∣∣zXb : DXb(zXb) 6=∅∣∣=O(n1+cd(G)) =O(nk), for each b,

because cd(G)≤ k−1. Now, because the number of bags is at most |M| (cf. Lemma 5.3), it follows

that the polyhedron H is described by O(|M|n2k) variables and O(|M|nk) constraints. The main

result of this section is that M is equal to H , as stated below. The proof is in Appendix D.

Theorem 5.9 (LP for Small Choice Depth) Given a collection of subsets M and the associ-

ated choice graph G with cd(G) ≤ k− 1, we have M = H .

For the (k−1)-deletion collection from Section 3.4, we can express the polyhedron H even more

compactly. Note that cd(Delk−1) = k − 1. For the (k − 1)-deletion collection, Ub = ∪S∈XbS = N

for all b ∈ B, which simplifies the above polyhedron because we can omit the variables wb1,a1,b2,a2 .

Moreover, the dependence on the bag index b∈ B can be dropped. With yi,S denoting the probability

that i is chosen from subset S, the polyhedron for the k-deletion collection – denoted by H (Delk−1)

– can be compactly described as follows:

H (Delk−1) =

(yi,S : i∈ S, S ∈M)

∣∣∣∣ ∑a∈Ok(N)

xa = 1, yi,S =∑

a∈Ok(N,zS)

xa ∀ i∈ S,S ∈M

.

This result is summarized in the following corollary.


24

Corollary 5.10 (LP for the (k− 1)-Deletion Collection) For the (k− 1)-deletion collection,

the associated Rank Aggregation LP requires only O(nk) variables and O(nk) constraints.

Thus far, we have shown that for any set collection whose choice graph admits a small choice

depth, the corresponding LoR can be computed efficiently. Complementing this result, we now

demonstrate a set collection whose choice depth increases with the number of products n, and the

corresponding Rank Aggregation LP is NP-hard. By Lemma 2.5, the Rank Aggregation LP

is NP-hard for the collection of all pairs of products Pairs = i, j : i 6= j. The next proposition

shows that for this collection of subsets, both tree width and choice depth increase with the number

of products n. The proof is given in Appendix D.

Proposition 5.11 (Unbounded Choice Depth and Tree Width) For every choice graph

G= (Pairs,E) associated with the collection Pairs, we have tw(G)≥ 2(n− 2) and cd(G)≥ n− 2.

5.3 Extension to Allow an Additional Structure on Rankings

The development, thus far, has considered distributions over all possible rankings. In some con-

texts, rationality imposes an additional structure, which constrains the space of possible rankings.

For example, in several retail contexts, products are offered on price or display promotion, and

rationality dictates that the promoted copy of a product is always preferred over its corresponding

non-promoted copy. This structure can be captured through a constrained set of rankings of over 2n

items, where each item is either a promoted or a non-promoted copy of a product and each ranking

is constrained to prefer the promoted copies over the corresponding non-promoted copies of all n

products (see Section 6.2). In a similar way, price promotions with multiple levels of discounts may

be captured by creating a copy for each level of discount. Our framework, including all the concepts

and the theoretical results, that we developed above extends almost immediately when considering

distributions over a constrained set of rankings. The details are shown in Appendix D.7.

6. Case Study on the IRI Academic Dataset

In this section, we present the results from applying our methodology to compute the rationality

loss and the parametric losses from fitting commonly used choice models to the IRI Academic

Dataset (Bronnenberg et al. 2008). This dataset consists of real-world purchase transactions of

consumer packaged goods for grocery and drug store chains, collected from 47 markets across

the US. The purpose of our case study is twofold: (a) quantify the magnitude of the rationality

loss vis-a-vis the total loss incurred from using common parametric models for different grocery

product categories and (b) demonstrate a diagnostic tool based on our methodology that provides


25

guidance for model selection—both for choosing the model complexity within the RUM family and

for determining when going beyond the RUM family is necessary. For our analysis, we focused on

the multinomial logit (MNL) and the 15-class mixture of logits (LCMNL) models, both of which

are known to be consistent with the RUM assumption.

6.1 Data Analysis

The data consist of weekly sales transactions aggregated over all customers. Because of the large

volume of transactions, we focused on data from calendar year 2007, specifically the first two weeks

of the year. The resulting transactions span 29 product categories (cf. Table EC.1 in Appendix E.1).

We processed the data separately for each category to obtain choice data in the form of a

collection of offer sets and corresponding aggregated sales fractions. For this, we first need to define

our unit of analysis, the “product,” for each category. In the dataset, items are identified at the

most granular level in terms of their respective universal product codes (UPCs). We cannot work

directly with each UPC because many UPCs have very few sales records. To deal with data sparsity,

we aggregated the items with the same vendor code (comprising digits 4 to 8 in the UPC) into

a single “vendor”. Of the resulting vendors, we focused our analysis on the top nine purchased

vendors (across all stores during the two-week period), labeling each of them as a product. We

then combined the remaining vendors into one (the 10th) product, yielding a mapping from each

UPC to one of the 10 product IDs (n = 10). Aggregating UPCs with the same vendor code is a

common technique used in data pre-processing (Bronnenberg and Mela 2004, Ailawadi et al. 2006,

Nijs et al. 2007).

Converting the raw transaction data to choice observations: With the products defined

as above, we converted the sales transactions for each category into choice observations, as described

next. Different categories were processed separately, and we followed the same processing steps

for all the categories. For a given category, let T denote3 the collection of purchase transactions

(see Table EC.1 for the numbers of transactions |T | across categories), where each transaction

consists of the following information: the week of the purchase, the store ID where the purchase

occurred, the UPC of the purchased product, the quantity purchased, the price paid, and an

indicator of whether the product was on price or display promotion. For each transaction t∈ T , let

jt ∈ 1, . . . , n denote the product ID of the purchased UPC; It ∈ 0,1, the promotion indicator (1

if promoted and 0 otherwise); wt, the week of purchase; and st, the store of purchase. We processed

these transactions in two steps: (i) identifying the offer and promotion sets and (ii) computing the

corresponding sales fractions.

3 For notational brevity, we dropped the dependence of the quantities on the category label.


26

Step 1: Identifying the offer and promotion sets: We associated each store-week com-

bination in our dataset with an offer and a promotion set. The offer sets of products vary across

stores because of differences in the assortments carried, and across weeks within each store because

of operational reasons, such as stock-outs. With store s and week w, we associated the offer set

Ss,w =⋃t∈T jt : st = s,wt =w ⊆ 1, . . . , n, defined as the union of all products, in the category,

that were purchased at least once during week w at store s. This construction provides only an

approximation because a product might have been stocked out during the middle of the week,

resulting in an over-estimation of the offer set, or might not have been purchased at all, resulting

in an under-estimation of the offer set. The estimation errors, however, tend to be small in this

dataset because each product is obtained by aggregating tens or hundreds of UPCs, and a stock

out or non-purchase of a product will happen only when all the UPCs corresponding to the product

are stocked out or not purchased, which is rare.

The purchase behavior is also affected by promotion activity. Therefore, we associated

with each store s and week w the set of promoted products Ps,w ⊆ Ss,w, defined as

Ps,w = j ∈ Ss,w : ∃ t∈ T s.t. jt = j, It = 1, st = s,wt =w, containing all the products that were pur-

chased on promotion at least once during week w at store s4. As is the case for the offer sets, our

construction only yields an approximation because the promotion status of a product might have

changed during the week, and we did not observe it. Without additional data, our construction

results in reasonable approximations, particularly because when a product is promoted at least once

during a week, we observe that most of its purchases in the week occurred during the promotion.

The above procedure resulted in the collection of offer set and promotion set combinationsM=

(Ss,w, Ps,w) : (s,w) is a store-week combination for each category. Note that having the same offer

set at a store for two different weeks but with different promotion sets is possible because the store

is running different promotion campaigns. Therefore, we keep track of the ordered pairs (S,P ).

Step 2: Computing the sales fractions: For each offer set and promotion set combination

(S,P ) ∈M, we computed the corresponding sales fractions fj,S,P as the fraction of times j was

purchased when the offer set was S and the promotion set was P . More precisely, we set MS,P =∑t∈T 1l[Sst,wt = S,Pst,wt = P ], the number of times (S,P ) was offered and

fj,S,P =1

MS,P

∑t∈T

1l[jt = j and Sst,wt = S,Pst,wt = P ]

4 Since a product corresponds to a vendor, a product is considered to be on promotion even if only one UPC of that

vendor was purchased on promotion. This encoding does not capture the number of UPCs per vendor that are on

promotion, yielding only an approximation. An alternate encoding would create multiple promotion labels to capture

different promotion intensities, measured as the fraction of UPCs on promotion for each vendor. A similar encoding

can also be used to capture different levels of price discounts associated with price promotions.


27

Table EC.1 in Appendix E.1 reports the data summary; in particular, |T |, |M|, and

1|M|∑

(S,P )∈MMS,P · |S| for each of the 29 product categories.

Preparing promotion data for rationality loss computation: Because both the offer and

promotion sets vary in our dataset, we used the extension of our framework that can handle con-

strained rankings. Specifically, we encoded the promotion information by creating two copies of

each product—a non-promoted copy and a promoted copy—and by enlarging the product space to

2n items, consisting of the promoted and non-promoted copies of all n products. The key observa-

tion we made is that customers always prefer the promoted copy of a product over its non-promoted

copy. Therefore, we modeled the preferences of a population of customers through a distribution

over rankings of 2n items that are constrained, such that the promoted copies of products are

always preferred over their corresponding non-promoted copies. In other words, customers were

modeled through a distribution over a constrained set of rankings, and the rationality loss was

computed by finding the distribution that minimizes the loss.

More precisely, let (j,0) and (j,1) denote the non-promoted and promoted copies of j,

respectively, and let Pcon2n ⊆ P2n denote the set of constrained rankings defined as Pcon

2n =

σ ∈P2n : σ(j,1)<σ(j,0) ∀ j ∈N. Correspondingly, we mapped each tuple (S,P ) to the set

A(S,P ) = (j,0) : j ∈ S \P∪(j,1) : j ∈ P, comprising the appropriate copy of each product. Let

M denote A(S,P ) : (S,P )∈M, and MS and fa,S respectively denote MS,P and fa,S,P , where

S =A(S,P ) and a= (j,1) if j ∈ P and (j,0), otherwise.

6.2 Computing the Rationality and Parametric losses

For each category, we used the choice observations to compute the rationality loss and the para-

metric losses from fitting the single-class MNL (henceforth, simply MNL) and 15-class latent class

MNL (henceforth, simply LCMNL) models. The LCMNL model class subsumes the MNL model;

therefore, the parametric loss under the former will be smaller than that under the latter model. We

focused on computing the log-likelihood loss. For the MNL and LCMNL models, this corresponds

to solving the MLE problems.

Computing the rationality loss. We used our proposed framework to compute the rationality

loss. With the notation from the previous section, the rationality loss is given by

Rationality

loss= min

− 1

|T |∑S∈M

MS

∑a∈S

fa,S log(xλa,S/fa,S)

∣∣∣ ∑σ∈Pcon

2n

λ(σ) = 1, λ(σ)≥ 0 ∀ σ ∈Pcon2n ,

and xa,S =∑

σ∈Pcon2n

λ(σ) · 1l[σ,a, S] ∀ a∈ S, S ∈ M

(2)


28

The above optimization problem is similar to the LoR problem in (1) but with the constraint that

the distribution λ is over the set of constrained rankings Pcon2n . As detailed in Section 5.3, our

theoretical results extend naturally for solving (2).

We solved the above constrained convex program by using the FW algorithm. The FW algorithm

is a classic algorithm for solving constrained convex programs. We implemented the following

standard variant (see Algorithms 1, 2, and 3 in Jaggi 2013): let x(k) be the solution of (2) at the

end of iteration k. We update the solution in iteration k+ 1 by solving the Rank Aggregation

LP with ca,S = ∂∂x

a,Sloss(fM,x)|x=x(k) for all a∈ S, S ∈ M but over the constrained set of rankings

Pcon2n to obtain the optimal solution s∗(k), and then by performing the update x(k+1) := (1−γ∗)x(k) +

γ∗s∗(k), where γ∗ is obtained through a line search: γ∗ = arg minδ∈[0,1] loss(fM,x(k) +δ · (s∗(k)−x(k))).

We started the algorithm with the solution obtained from fitting the LCMNL model (as detailed

below) as the initial solution, and determined s∗(k) by solving the constrained Rank Aggregation

LP over the k-deletion polyhedron H (Delk), described in Corollary 5.10, with k = 4. We used

this approximation because we observed that at most four (out of ten) products were not offered

in most of the offer sets across the categories. By carrying out the iterations until an appropriate

stopping condition is met, we obtained the optimal loss.

Computing the parametric losses for the MNL and LCMNL models: For the MNL

and LCMNL models, we computed the total loss for each category by maximizing the log-likelihood

function. The resulting optimization problems are standard for these models, and we defer the

details to Appendix E.2. The parametric losses under the MNL and LCMNL models are then

obtained by subtracting the rationality loss from their respective total losses.

6.3 Results and Discussion

For each category, we reported the rationality loss and the parametric losses we computed. In

this section, we will discuss (a) the relationship between rationality and parametric losses (Section

6.3.1), (b) the use of the rationality loss as a diagnostic tool for model selection (Section 6.3.2),

(c) the benefits of going beyond the RUM family, and (d) the relationship between rationality loss

and market concentration. Appendix E.6 also presents an additional discussion on the robustness

of our results under different loss metrics.

6.3.1 Rationality loss vis-a-vis parametric losses. Figure 2 shows a bar graph of the

different categories, arranged in descending order of their respective total losses under the MNL

model. Each bar is composed of three losses: the rationality loss, the LCMNL parametric loss,

and the difference between the parametric losses of the LCMNL and MNL models, represented


29

in this order. For each category, the total length of the three bars corresponds to the parametric

loss under the MNL model. What is immediately striking from the bar graph in Figure 2 is that

the rationality loss accounts for most of the total losses across the categories, with the parametric

loss being relatively small. Fitting more complex RUM models, such as the LCMNL model, can

reduce or even eliminate (for some categories) the parametric loss; the rationality loss cannot be

reduced unless one goes beyond the RUM class. This finding strongly supports and encourages

work on relaxing the rationality assumption. In Appendix E.3, we provide a detailed breakdown

of the parametric losses under different models.

6.3.2 Diagnostic tool for model selection. The rationality losses provide guidance for

model selection, particularly in (a) choosing the number of mixture components and in (b) going

beyond rationality. Specifically, the bar graph provides a bound on the number of mixture compo-

nents for some of the categories. For instance, fitting the 15-class LCMNL model clearly eliminates

(essentially) all the parametric losses for the categories Spaghetti/Italian Sauce, Peanut Butter,

Toilet Tissue, Beer, and Blades. Consequently, 15 is an upper bound on the number of mixture

components for these models because any further increase will increase model complexity (thereby

potentially hurting its out-of-sample predictive power) without any benefit to the in-sample fit.

The exact number of mixture components can then be determined using penalized likelihood meth-

ods (see Budanova 2016), which require a bound on the number of mixture components as an

input. The bounds for other categories can be determined by increasing the number of mixture

components until their respective parametric losses become zero.

Our method also provides guidance on when a practitioner should go beyond rationality. To aid

this, we plotted the rationality loss and the total loss under the MNL model for all the categories

on a scatter plot. Figure 3 represents each category as a point, with the vertical axis value denoting

the rationality loss and the horizontal axis denoting the total loss under the MNL model. As

expected, all the points are found below the 45 line, indicating that the rationality loss is always

less than or equal to the total loss. Model selection, in practice, often means choosing a model

that meets a certain performance threshold. For the purposes of illustration, suppose it is specified

that a log-likelihood loss of less than or equal to 0.03 is acceptable; this loss corresponds to a

mean absolute error of less than 2% between the predicted and the observed sales fractions. The

acceptable threshold naturally splits into three regions, as follows:

1. Region 1: MNL acceptable. This region is found to the left of the vertical line passing through

Total loss = 0.03 and lying below the 45 line. It consists of all the product categories for which

the total loss under the MNL model is acceptable; therefore, a practitioner may stop at the

MNL model for these categories.


30

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

DiaperBlade

SugarSubs:tuteHouseholdCleaner

DeodorantFac:alTissue

FrozenDinner/EntréeToothpaste

CarbonatedBeverageShampoo

BeerSaltSnack

ColdCerealMayonnaiseToiletTissue

SoupMustard

PapwerTowelCigareOe

Margarine/BuOerToothbrush

LaundryDetergentPeanutBuOerFrozenPizza

SpagheQ/ItalianSauceHotDog

MilkYogurtCoffee

KL-divergenceLoss(10-2)

Ra:onalityLoss LCMNLParam.Loss MNLParam.LossminusLCMNLParam.Loss

Figure 2 The rationality loss,

the parametric losses

under the LCMNL

model, and the

differences between

the parametric losses

of the MNL and

LCMNL models.

The categories are

arranged in decreas-

ing order of the total

loss under the MNL

model. The rational-

ity loss accounts for

the largest portion

of total losses across

the categories.

2. Region 2: fit more complex RUM models. This region is located below the horizontal line through

Rationality loss = 0.03 and to the right of the vertical line through Total loss = 0.03. It consists

of categories for which the total losses (under the MNL model) are not acceptable, but the

rationality losses are acceptable, so the practitioner can obtain acceptable levels of performance

by fitting more complex RUM models, such as K-class LC-MNL models for some value of K > 1.

3. Region 3: RUM unacceptable. This region is found below the 45 line and above the horizontal

line through Rationality loss = 0.03, and it consists of 10 (out of 29) categories for the chosen

value of the threshold. For these categories, fitting more complex RUM models does not result in

an acceptable performance. The observed choices are significantly irrational for these categories,

and one must go beyond the RUM class to attain acceptable performance levels.

We also ran the same analysis using transactions data from weeks 27 and 28 (about six months

later) of the same year 2007. We found that about 79% of product categories were classified into

the same regions in both two-week time periods; see Appendix E.7 for more details. This analysis

suggests that our product classification in Figure 3 is robust over time.

6.3.3 Benefits of going beyond the RUM family. The key benefit of our approach is

that it allows us to identify the product categories in Region 3 of Figure 3. For these categories,

we now illustrate the benefits of going beyond the RUM model. Notably, there is no de facto

model outside the RUM class that can be readily fit to our data. We selected the generalized

attraction model (GAM) class, which was proposed by Gallego et al. (2014) as a generalization


31

beer

blades

carbbev

cigets

coffee

coldcer

deod

diapers

factiss

fzdinent

fzpizza

hhclean

hotdog

laundet

margbutr

mayo

milk

mustketcpaptowl

peanbutr

saltsnckshamp

soup

spagsauc

sugarsub

toitisu

toothbr

toothpa

yogurt

Region 1:

MNL acceptable

Region 2:

fit more complex RUM models

Region 3:

RUM unacceptable

0

2

4

6

8

0 2 4 6 8Total loss (KL-divergence (10−2))

Rati

on

ali

tylo

ss(K

L-d

iver

gen

ce(1

0−

2))

Figure 3 Scatter plot of the

rationality loss and

the total loss under

the MNL model

for the 29 product

categories. Assuming

that a loss of < 0.03

is acceptable, we

identified the cate-

gories (in Region 3)

for which going

beyond the RUM

class is necessary to

obtain an acceptable

performance.

of the basic attraction model, of which the MNL model is a special case, to address the issue of

overestimation of demand recapture in RM applications. For particular values of the parameters

(also called “shadow” attractions), the GAM model can be shown to be outside the RUM class.

We extended the GAM model class for our purposes and fitted a latent-class GAM (LC-GAM)

model with five classes to our choice data. We verified that the resulting model instances are

indeed outside the RUM class. We found that the LC-GAM model allows us to breach the LoR

for most of the 29 product categories (see Figure 4). Indeed, the LC-GAM model attains an

acceptable performance (loss ≤ 0.03) for the Margarine/Butter (margbutr) category, which belongs

to Region 3. See Appendix E.4 for details. We note that while the threshold of 0.03 was chosen

for illustration; the appropriate threshold will depend on the application, and correspondingly, the

classification of the categories into different regions may be different. At the end of Appendix E.4,

we also discuss the benefits and downside risks of going beyond the RUM family.

6.3.4 Factors that might influence rationality loss. Another observation we make from

the bar graph in Figure 2 is that a large variation exists in the rationality losses across product

categories. With the rationality loss interpreted as a measure of the degree to which customers are

irrational, this variation indicates that customer purchase behavior is highly irrational for some

categories (e.g., Coffee and Yogurt) and less so for the others (e.g., Diapers and Blades). The data

to which we had access are insufficient to provide a convincing explanation for why customers are

more irrational when purchasing from some categories than from others. Nevertheless, to obtain

a partial explanation, we computed the correlation between the rationality losses and the various

category-level features.


32

beer

blades

carbbev

cigets

coffee

coldcer

deod

diapers

factiss

fzdinent

fzpizza

hhclean

hotdog

laundet

margbutr

mayo

milk

mustketcpaptowl

peanbutr

saltsnckshamp

soup

spagsauc

sugarsub

toitisu

toothbr

toothpa

yogurt

Region 1:

MNL acceptable

Region 2:


Region 3:

RUM unacceptable

0

2

4

6

8

0 2 4 6 8Total loss (KL-divergence (10−2))

Rat

ion

ali

tylo

ss(K

L-d

iver

gen

ce(1

0−

2))

LC-GAM

MNL

Figure 4 Scatter plot of the

rationality loss and

the total losses under

the MNL and five

class LC-GAM mod-

els for the 29 product

categories. The fit-

ted LC-GAM models

are outside the RUM

class and are able to

breach the LoR for

most categories. For

margbutr, LC-GAM

obtains an acceptable

loss (≤ 0.03).

We found that the rationality loss is strongly correlated with the market concentration of the

category, defined as the KL-divergence between the uniform distribution and the market share

distribution across the different products in the category. A category in which a small number

of products capture most of the market will have a market share distribution that is far from a

uniform distribution, and thus a high market concentration. On average, product categories with

a low market concentration (e.g., Yogurt) are more irrational than those with a large market

concentration (e.g., Diapers). This observation is consistent with actual practice. A category with

low market concentration, such as Yogurt, often has frequent product introductions, and customers

tend to be variety seeking (Baltas et al. 2017), switching their purchases from one week to the next,

resulting in non-transitive preferences for each customer. On the other hand, purchases within the

Diapers category, with high market concentration, tend to be concentrated over a few products,

with customers generally remaining brand loyal across different purchase instances. This purchase

behavior results in preferences that are generally transitive for each customer and with a lower

rationality loss.

In addition to the market concentration variable, we also explored the correlation between the

rationality loss and a hedonic indicator5, taking the value 1 if a product category possesses hedonic

features (those that are strongly associated with affective sensation such as food and soft drinks;

see, e.g., Hoyer and Ridgway 1984, Kahn and Lehmann 1991) and 0 otherwise. Using the rule that

product categories corresponding to food/snacks (excl. condiments), drinks, or cigarettes are hedo-

nic, while the rest are not, we observed a strong positive correlation between rationality loss and

the hedonic indicator variable. Existing research establishes that consumers exhibit complex choice

5 We thank an anonymous referee for pointing out the connection between the rationality loss and hedonic products.


33

behaviors for hedonic product categories, including preference reversal, which violates traditional

utility theory (Okada 2005). These complex choice behaviors are not sufficiently well-captured by

transitive preference orderings, leading to high rationality losses. The details of our analysis are

presented in Appendix E.5.

Our findings indicate that going beyond the RUM class may be particularly fruitful for product

categories with low market concentrations (or high variety seeking) and hedonic attributes.

7. Conclusion and Extensions

Motivated by widely accepted observations that customer preferences may not be rational or con-

sistent, we considered the problem of quantifying the LoR in choice modeling. We formulated the

problem as an optimization problem to find the probability distribution over rankings that is clos-

est to the observed choice data. Our key theoretical contributions include the concepts of rational

separation and choice graph that have allowed us to characterize the source of computational com-

plexity in terms of the structural properties of a graph. We applied our methodology to real-world

grocery sales data to identify product categories for which going beyond rational choice models to

obtain an acceptable performance is necessary.

Our work also lays the groundwork for a number of interesting future research directions. The

immediate extensions include developing approximation techniques to solve the rank aggregation

problem for choice graphs with unbounded choice depth, reconciling the LP formulation when the

choice graph is a tree with the general LP on the tree composition of a choice graph, and building

algorithms to automatically construct choice graphs from data.

Acknowledgments: We are grateful to Professor Vishal Gaur, the associate editor, and the three

anonymous referees for their detailed, thoughtful, and insightful comments, which helped substan-

tially improve our manuscript.

References

Ailawadi, K. L., B. A. Harlam, J. Cesar, D. Trounce. 2006. Promotion profitability for a retailer: the role of

promotion, brand, category, and store characteristics. Journal of Marketing Research 43(4) 518–535.

Ali, A., M. Melia. 2012. Experiments with Kemeny ranking: What works when? Mathematical Social Sciences

64(1) 28–40.

Arnborg, S., D. Corneil, A. Proskurowski. 1987. Complexity of finding embeddings in a k-tree. SIAM Journal

on Matrix Analysis and Applications 8(2) 277–284.

Baltas, G., F. Kokkinaki, A. Loukopoulou. 2017. Does variety seeking vary between hedonic and utilitarian

products? the role of attribute type. Journal of Consumer Behaviour .


34

Barbera, S., P. K. Pattanaik. 1986. Falmagne and the rationalizability of stochastic choices in terms of

random orderings. Econometrica 54(3) 707–715.

Block, H. D., J. Marschak. 1960. Random orderings and stochastic theories of responses. Olkin, Ghurye,

Hoeffding, Madow, Mann, eds., Contributions to Probability and Statistics,. Stanford University Press,

Stanford, CA, 97–132.

Bodlaender, H. L., A. M. C. A. Koster. 2011. Treewidth computations II. Lower bounds. Information and

Computation 209(7) 1103–1119.

Bodlaender, H. L., T. Wolle, A. M. C. A. Koster. 2006. Contraction and treewidth lower bounds. Journal

of Graph Algorithms and Applications 10(1) 5–49.

Bronnenberg, B. J., M. W. Kruger, C. F. Mela. 2008. Database paper: The IRI marketing data set. Marketing

Science 27(4) 745–748.

Bronnenberg, B. J., C. F. Mela. 2004. Market roll-out and retailer adoption for new brands. Marketing

Science 23(4) 500–518.

Bubeck, S. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine

Learning 8(3-4) 231–357.

Budanova, S. 2016. Penalized maximum likelihood estimation of finite mixture models Nothwestern Univer-

sity Working Paper.

Cachon, G. P., C. Terwiesch, Y. Xu. 2005. Retail assortment planning in the presence of consumer search.

Manufacturing & Service Operations Management 7(4) 330–346.

Chernev, A., U. Bockenholt, J. Goodman. 2015. Choice overload: A conceptual review and meta-analysis.

Journal of Consumer Psychology 25(2) 333–358.

Crowley, A. E., E. R. Spangenberg, K. R. Hughes. 1992. Measuring the hedonic and utilitarian dimensions

of attitudes toward product categories. Marketing Letters 3(3) 239–249.

Davis, J., G. Gallego, H. Topaloglu. 2011. Assortment optimization under variants of the nested logit model.

Working Paper, Cornell University.

Dhar, R., K. Wertenbroch. 2000. Consumer choice between hedonic and utilitarian goods. Journal of

marketing research 37(1) 60–71.

Diestel, R. 2005. Graph Theory . Springer, New York.

Dwork, C., R. Kumar, M. Naor, D. Sivakumar. 2001. Rank aggregation methods for the web. Proceedings

of the 10th International Conference on World Wide Web. ACM, New York, NY, 613–622.

Falmagne, J. C. 1978. A representation theorem for finite random scale systems. Journal of Mathematical

Psychology 18(1) 52–72.

Farias, V. F., S. Jagabathula, D. Shah. 2013. A non-parametric approach to modeling choice with limited

data. Management Science 59(2) 305–322.

Fomin, F. V., D. Kratsch. 2010. Exact exponential algorithms. Springer Science & Business Media, Berlin,

Heidelberg.

Frank, M., P. Wolfe. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly

3(1) 95–110.

Gallego, G., R. Ratliff, S. Shebalov. 2014. A general attraction model and sales-based linear program for

network revenue management under customer choice. Operations Research 63(1) 212–232.

Givon, M. 1984. Variety seeking through brand switching. Marketing Science 3(1) 1–22.


35

Gruen, T. W., D. Corsten. 2008. A comprehensive guide to retail out-of-stock reduction in the fast-moving

consumer goods industry. Tech. rep., Grocery Manufacturers of America, Washington, DC.

Halin, R. 1976. S-functions for graphs. Journal of Geometry 8(1) 171–186.

Holbrook, M. B., E. C. Hirschman. 1982. The experiential aspects of consumption: Consumer fantasies,

feelings, and fun. Journal of consumer research 9(2) 132–140.

Hoyer, W. D., N. M. Ridgway. 1984. Variety seeking as an explanation for exploratory purchase behavior:

A theoretical model. Thomas C. Kinnear, ed., Advances in Consumer Research, vol. 11. Association

for Consumer Research, Provo, UT, 114–119.

Iyengar, S. S., M. R. Lepper. 2000. When choice is demotivating: Can one desire too much of a good thing?

Journal of personality and social psychology 79(6) 995.

Jagabathula, S. 2011. Nonparametric choice modeling: Applications to operations management. Ph.D. thesis,

MIT.

Jagabathula, Srikanth, Paat Rusmevichientong. 2017. A nonparametric joint assortment and price choice

model. Management Science 63(9) 3128–3145.

Jaggi, M. 2013. Revisiting Frank–Wolfe: Projection-free sparse convex optimization. S. Dasgupta,

D. McAllester, eds., Proceedings of the 30th International Conference on Machine Learning . Atlanta,

GA, 427–435.

Kahn, B. E., M. U. Kalwani, D. G. Morrison. 1986. Measuring variety-seeking and reinforcement behaviors

using panel data. Journal of Marketing Research 23(2) 89–100.

Kahn, B. E., D. R. Lehmann. 1991. Modeling choice among assortments. Journal of retailing 67(3) 274–299.

Kenyon-Mathieu, C., W. Schudy. 2007. How to rank with few errors. Proceedings of the Thirty-ninth Annual

ACM Symposium on Theory of Computing . STOC ’07, New York, NY, 95–103.

Kivetz, R., I. Simonson. 2002a. Earning the right to indulge: Effort as a determinant of customer preferences

toward frequency program rewards. Journal of Marketing Research 39(2) 155–170.

Kivetz, R., I. Simonson. 2002b. Self-control for the righteous: Toward a theory of precommitment to indul-

gence. Journal of Consumer Research 29(2) 199–217.

Lacoste-Julien, S., M. Jaggi. 2015. On the global linear convergence of Frank-Wolfe optimization variants.

Proceedings of the 28th International Conference on Neural Information Processing Systems. NIPS’15,

MIT Press, Cambridge, MA, USA, 496–504.

Mas-Colell, A., M. D. Whinston, J. R. Green. 1995. Microeconomic Theory , vol. 1. Oxford University Press,

New York.

McFadden, D. L. 2005. Revealed stochastic preference: A synthesis. Economic Theory 26(2) 245–264.

McFadden, D. L., M. K. Richter. 1990. Stochastic rationality and revealed stochastic preference. Preferences,

Uncertainty, and Optimality, Essays in Honor of Leo Hurwicz . Westview Press, Boulder, CO, 161–186.

Nesterov, Y. 2013. Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer Science

& Business Media.

Nijs, V. R., S. Srinivasan, K. Pauwels. 2007. Retail-price drivers and retailer profits. Marketing Science

26(4) 473–487.

O’curry, S., M. Strahilevitz. 2001. Probability and mode of acquisition effects on choices between hedonic

and utilitarian options. Marketing Letters 12(1) 37–49.

Okada, E. M. 2005. Justification effects on consumer choice of hedonic and utilitarian goods. Journal of



36

Robertson, N., P. D. Seymour. 1986. Graph minors. II. Algorithmic aspects of tree-width. Journal of

Algorithms 7(3) 309–322.

Rose, D., G. Lueker, R. E. Tarjan. 1976. Algorithmic aspects of vertex elimination on graphs. SIAM Journal

on Computing 5(2) 266–283.

Talluri, K., G. J. van Ryzin. 2004. Revenue management under a general discrete choice model of consumer

behavior. Management Science 50(1) 15–33.

Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM journal on computing 1(2)

146–160.

Train, K. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge University Press.

Tversky, A., D. Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science 185(4157)

1124–1131.

Tversky, A., P. Slovic, D. Kahneman. 1990. The causes of preference reversal. American Economic Review

80(1) 204–217.

van Ryzin, G. J., S. Mahajan. 1999. On the relationship between inventory costs and variety benefits in

retail assortments. Management Science 45 1496–1509.

van Ryzin, G. J., G. Vulcano. 2015. A market discovery algorithm to estimate a general class of non-

parametric choice model. Management Science 61(2) 281–300.

van Trijp, H. C. M. 1995. Variety-seeking in product choice behavior. theory with applications in the food

domain. Ph.D. thesis, Wageningen University.

van Trijp, H. C. M., W. D. Hoyer, J. J. Inman. 1996. Why switch? product category-level explanations for

true variety-seeking behavior. Journal of Marketing Research 33(3) 281–292.

Vinod, B. 2006. Advances in inventory control. Journal of Revenue and Pricing Management 4(4) 367.

Wertenbroch, K. 1998. Consumption self-control by rationing purchase quantities of virtue and vice. Mar-

keting science 17(4) 317–337.

e-companion to The Limit of Rationality ec1

This page is intentionally blank. Proper e-companion title

page, with INFORMS branding and exact metadata of the

main paper, will be produced by the INFORMS office when

the issue is being assembled.

ec2 e-companion to The Limit of Rationality

Online Appendix

The Limit of Rationality in Choice Modeling:

Formulation, Computation, and ImplicationsSrikanth Jagabathula

Stern School of Business, New York University, New York, NY 10012, [email protected]

Paat RusmevichientongMarshall School of Business, University of Southern California, Los Angeles, CA 90089, [email protected]

Appendix A: Proofs for Section 3

A.1 Proof of Theorem 3.4

We prove this result by induction on the number of sets inM. If |M|= 1 or 2, the result is trivially

true. So, suppose that the result is true for |M|=m− 1. Consider the case where |M|=m, with

M = S1, . . . , Sm and S1 ⊂ S2 ⊂ · · · ⊂ Sm. Let A, B, and C be arbitrary disjoint sub-collections

such that C separates A from B in the line graph; that is, every path from a set A ∈ A to a set

B ∈ B must passes through some set in C. We now show that Añ B | C. There are four cases to

consider:

Case 1: Sm /∈A∪· B∪· C. The result then follows immediately from the inductive hypothesis applied

to M\Sm.

Case 2: Sm ∈ C. Suppose that DA(zA)∩DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6= ∅. We want to show

that DA(zA)∩DB(zB)∩DC(zC) 6=∅.

Let C = C\Sm. Note that C still separatesA and B in the line graph because Sm is the rightmost

vertex. Also, DC(zC) = DC(zC)∩DSm(zSm), which implies that DA(zA)∩DC(zC)⊂DA(zA)∩DC(zC).

Therefore, it follows that DA(zA)∩DC(zC) 6= ∅ implies DA(zA)∩DC(zC) 6= ∅. Similarly, we must

have that DB(zB) ∩ DC(zC) 6= ∅. Then, by invoking the induction hypothesis for M\ Sm, we

obtain that DA(zA)∩DB(zB)∩DC(zC) 6=∅. This is almost what we want, but we need one additional

argument to complete the proof.

Let X be the rightmost vertex among A∪· B ∪· C, which corresponds to the vertex that is closest

to Sm; this is well-defined because G is a line graph. The fact that DA(zA) ∩ DC(zC) 6= ∅ and

DB(zB)∩DC(zC) 6=∅ implies that either (1) zSm ∈ Sm\X or (2) zSm = zX . Now consider any ranking

σ ∈DA(zA)∩DB(zB)∩DC(zC) (such a σ exists because of our arguments above). We modify σ as

follows. If zSm ∈ Sm \X, then we obtain the modified list σ by moving the product zSm to the top of

the list σ. Because zSm is not present in any of the sets in A∪· B∪· C, their choices are not affected.


On the other hand, if zSm = zX , we obtain σ by moving all the products in Sm \X to the bottom

of the list σ. Again, none of the other choices are affected because Sm \X does not overlap with

any of the sets in A∪· B∪· C. Therefore, in both cases, we must have σ ∈DA(zA)∩DB(zB)∩DC(zC),

implying that DA(zA)∩DB(zB)∩DC(zC) 6= ∅. This is the desired result.

Case 3: Sm ∈A. Suppose that DA(zA)∩DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6= ∅. We want to show


Let A=A\Sm. Note that C still separates A and B in the line graph because Sm is the right-

most vertex. Also, DA(zA) = DA(zA)∩DSm(zSm). As above, this implies that DA(zA)∩DC(zC) 6= ∅

and DB(zB)∩DC(zC) 6=∅. Then, by invoking the induction hypothesis for the collectionM\Sm,

we have that DA(zA) ∩ DB(zB) ∩ DC(zC) 6= ∅. As before, let X be the rightmost vertex among

A∪· B ∪· C, which corresponds to the vertex that is closest to Sm; this is well-defined because G is a

line graph. Note that it must be the case that X ∈ A∪· C because Sm ∈A and C separates A from

B. The fact that DA(zA)∩DC(zC) 6= ∅ implies that either (1) zSm ∈ Sm \X or (2) zSm = zX . Once

again, using the arguments as above, we can modify a ranking σ ∈DA(zA) ∩DB(zB) ∩DC(zC) to

obtain a ranking σ ∈DA(zA)∩DB(zB)∩DC(zC), implying that DA(zA)∩DB(zB)∩DC(zC) 6=∅. This

is the desired result.

Case 4: Sm ∈B. This case is the same as Case 3 by symmetry.

In all four cases, we see that we have A ñ B | C. By induction, the result holds for all nested

collections.


We first show that the graph T = (M,E), as constructed in Section 3.2, is a forest, that is, a

collection of mutually disjoint rooted trees, where the root is the vertex corresponding to the largest

subset among all the subsets that are a part of that tree. For that, it suffices to show that T does not

contain a cycle. Suppose, on the contrary, that T does contain a cycle, say 〈S0, S1, . . . , St, St+1 = S0〉,

where S0, S1, . . . , St are the distinct subsets. Let S` be a minimal set among S0, . . . , St so that for

all j 6= `, Sj * S`. Consider the edges S`−1, S` and S`, S`+1 in the cycle. Because M is laminar

and S` is minimal, it must be that S` ⊆ S`+1 and S` ⊆ S`−1; so S`−1 ∩ S`+1 6= ∅, and therefore,

either S`−1 ⊆ S`+1 or S`+1 ⊆ S`−1. If S`−1 ⊆ S`+1, then we have S` ⊆ S`−1 ⊆ S`+1, so there must not

be an edge between S` and S`+1 because they are not adjacent. However, this is a contradiction.

Similarly, if S`+1 ⊆ S`−1, then S` ⊆ S`+1 ⊆ S`−1, and we must not have an edge between S` and

S`−1, which is again a contradiction. Therefore, it must be the case that T has no cycle and is

therefore a forest.


To show that T is a choice graph, we note that the unique path from the root to each leaf node

consists of nested sets, and we are back to the case of the nested collection in Theorem 3.5. The

formal proof requires induction on the number of subsets in M, as shown below.

The case of one or two sets are trivially true. So, suppose that the result is true with |M|=m−1.

Now consider the case when |M| = m. Without loss of generality, assume that M has a unique

maximal set S∗. If it has multiple maximal sets, we treat each tree separately. Let A, B, and C be

arbitrary disjoint sub-collections such that C separates A from B in T; that is, every path from a

set A ∈A to a set B ∈ B passes through some set in C. We will now show that Añ B | C. There

are four cases to consider:

Case 1: S∗ /∈A∪· B∪· C. The result then follows immediately from the inductive hypothesis applied

to M\S∗.

Case 2: S∗ ∈ C. Suppose that DA(zA)∩DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6= ∅. We want to show


Let C = C \ S∗, M =M \ S∗, and let T be the tree obtained from T by removing the

vertex S∗. Note that C still separates A and B in T because S∗ is the root vertex of T. Also,

DC(zC) = DC(zC)∩DS∗(zS∗), which implies that DA(zA)∩DC(zC)⊂DA(zA)∩DC(zC). Therefore, it

follows that DA(zA)∩DC(zC) 6= ∅ implies DA(zA)∩DC(zC) 6= ∅. Similarly, we also have DB(zB)∩DC(zC) 6= ∅. It now follows by invoking the induction hypothesis for the collection M\S∗ that

DA(zA)∩DB(zB)∩DC(zC) 6= ∅.

In order to complete the proof, let T denote the collection of maximal sets among A∪· B∪· C. Note

that every set in T is a subset of S∗. The fact that DA(zA)∩DC(zC) 6=∅ and DB(zB)∩DC(zC) 6=∅

implies that either (1) zS∗ ∈ S∗\ [∪G∈TG] or (2) zS∗ = zG for someG∈ T . Using the arguments in the

proof of Theorem 3.4, we can modify any ranking σ ∈DA(zA)∩DB(zB)∩DC(zC) to obtain a ranking

σ ∈DA(zA)∩DB(zB)∩DC(zC)∩DS∗(zS∗), implying that DA(zA)∩DB(zB)∩DC(zC)∩DS∗(zS∗) 6=∅.

This completes the proof of this case.

Case 3: S∗ ∈A. Suppose that DA(zA)∩DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6= ∅. We want to show


Let A = A \ S∗, M =M\ S∗, and let T be the tree obtained from T by removing the

vertex S∗. Note that C still separates A and B in T. Also, DA(zA) = DA(zA)∩DS∗(zS∗). As above,

it follows that DA(zA) ∩DC(zC) 6= ∅ and DB(zB) ∩DC(zC) 6= ∅. Then, by invoking the inductive

hypothesis for the collection M\S∗, we obtain that DA(zA)∩DB(zB)∩DC(zC) 6= ∅.

Let T denote the collection of maximal sets among A∪· B∪· C. Note that every set in T is a subset

of S∗ and it must be in A ∪· C because S∗ ∈ A is the root vertex and C separates A from B. The


fact that DA(zA)∩DC(zC) 6= ∅ implies that either (1) zS∗ ∈ S∗ \ [∪G∈TG] or (2) zS∗ = zG for some

G∈ T . In both these cases, we can argue as above that DA(zA)∩DB(zB)∩DC(zC)∩DS∗(zS∗) 6= ∅

Case 4: S∗ ∈B. This case is the same as Case 3 by symmetry.

In all four cases, we see that we have Añ B | C. By induction, the result holds for all laminar

collections.


Consider any three disjoint sub-collections, A, B, and C, such that C separates A and B in Star. It

must then be the case that A0 ∈ C because all the paths from Ai to Aj must go through A0. Now

suppose we are given choices zA, zB, and zC, such that DA(zA)∩DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6=

∅. We want to show DA(zA)∩DB(zB)∩DC(zC) 6= ∅.

Consider an arbitrary set S ∈ (A∪· B ∪· C) \ A0. Let zS be the choice from set S, and let z0

denote the choice from A0. Note that S ∈ A1, . . . ,A`. Because we are given DS(zS)∩DA0(z0) 6=∅

and A0 ⊂ S, it must be that either zS = z0 or zS ∈ S \ A0. Now, consider a sub-collection F =

S ∈ (A∪· B ∪· C) \ A0 : zS 6= z0. Define the ranking σ with zS : S ∈ F as the top |F| products

(the precise ordering does not matter) and with z0 ranked at position |F|+ 1. It can be verified

that σ is consistent with all the given choices zA, zB, and zC. This consistency implies that σ ∈

DA(zA)∩DB(zB)∩DC(zC), so DA(zA)∩DB(zB)∩DC(zC) 6=∅, which is the desired result.


We prove the following slightly stronger result by induction: for any collections of subsets A,B,C ⊆

Delk such that C separates A from B in graph Gk, we have that Añ B | C and all the choices are

completely determined by a top-(k+1) partial ranked list. The result is trivially true for k= 0 and

k= 1.

Suppose that the result is true for k− 1. We will now show that it is also true for k. Without

loss of generality, suppose that A, B, and C are disjoint, and C separates A and B. We are given

choices zA, zB, and zC such that DA(zA)∩DC(zC) 6=∅ and DB(zB)∩DC(zC) 6= ∅. We want to show

that DA(zA)∩DB(zB)∩DC(zC) 6= ∅ and there exists a top-(k+ 1) partial ranked list such that all

the choices are consistent with it.

Note that there is an edge (S1, S2) for any two subsets S1 ∈F` and S2 ∈F`+1 for all 0≤ `≤ k−1.

Therefore, for C to separate A from B, it must be the case that C contains F` for some `. Define

`∗ to be the “largest” such level, i.e., `∗ = max` :F` ⊆C. There are 3 cases to consider.


Case 1: `∗ = k. In this case, we have that Fk ⊆ C. Since zC is given, we are given the choices zFk

for all subsets of size n− k. These choices completely determine the top-(k+ 1) ranked elements,

say, (d1, . . . , dk+1), where di is the element ranked at position i. Let σ be any ranked list having

(d1, . . . , dk+1) as the top-(k + 1) ranked products. Because the top-(k + 1) ranked products also

uniquely determine the choices from any set NA such that |A| ≤ k and the choices zA and zC are

consistent with each other, we must have that the choices zA are also consistent with σ. This

implies that σ ∈ DA(zA) ∩ DC(zC). Similarly, we must have that σ ∈ DB(zB) ∩ DC(zC). Thus, we

have shown that σ ∈DA(zA)∩DB(zB)∩DC(zC), which implies that DA(zA)∩DB(zB)∩DC(zC) 6=∅,

which is the desired result.

Case 2: `∗ = k− 1. Let A=A\Fk, B = B \Fk, and C = C \Fk. Then, it must be the case that C

separates A and B in graph Gk−1. Thus, it follows by the inductive hypothesis that Añ B | C and

the choices zA, zB, and zC are all consistent with a top-k partial ranked list σ(k) = (d1, d2, . . . , dk),

where product di is ranked at position i.

We now extend σ(k) to a top-(k+ 1) partial ranked list that is consistent with all of the choices.

Pick an arbitrary set S ∈Fk such that S ∈A∪B∪C. There are two sub-cases to consider:

Subcase 2.1: Suppose that S ∩ d1, d2, . . . , dk = ∅, i.e., S = N \ d1, d2, . . . , dk. Since there

is only one such subset S, we tackle it by extending σ(k) to obtain the top-(k + 1) partial list

(d1, d2, . . . , dk, zS) by adding zS as the (k+ 1)th element.

Subcase 2.2: Suppose that S ∩ d1, d2, . . . , dk 6= ∅. This means that S =N \Q where |Q|= k

and Q 6= d1, d2, . . . , dk. In this case, we will show that σ(k) is already consistent with the choice

zS; that is, zS = di∗ where i∗ = mini : di ∈ S.

To prove this, assume without loss of generality that S ∈A; the argument is the same if S ∈B or

S ∈ C. Consider a product j∗ ∈Q \ d1, d2, . . . , dk. Note that j∗ exists because Q and d1, . . . , dk

have exactly k elements and are not equal. Because S ∈ Fk and j∗ /∈ S, we have that S ∪ j∗ ∈

Fk−1 ⊆ C, so zS∪j∗ is given. It follows from our induction hypothesis that σ(k) is consistent with

the top choice of the set S ∪j∗, which means that zS∪j∗ = di∗ because di∗ is the most preferred

product in S under σ(k) and j∗ is not part of σ(k).

Now, since the choices of A and C are consistent with each other because DA(zA)∪DC(zC) 6=∅, it

must be the case that the choices zS∪j∗ and zS must be consistent with each other. Furthermore,

since S ⊂ S ∪ j∗ and zS∪j∗ = di∗ ∈ S, dropping j∗ does not alter the choice; that is, we must

have zS = zS∪j∗ = di∗ , which is the desired result.


Case 3: `∗ ≤ k− 2. In this case, it must be the case that in ∪ks=`∗Fs contains either sets in A or

B, but not both. Without loss of generality, assume that B∩ (∪ks=`∗Fs) = ∅; the proof for the case

where A∩ (∪ks=`∗Fs) = ∅ is similar. This means that in levels `∗, `∗+ 1, . . . , k, we will only find sets

from A or C.

Since DA(zA)∩DC(zC) 6= ∅ by our hypothesis, pick τ ∈DA(zA)∩DC(zC). Let zFk−1be the most

preferred product in Fk−1 under τ . Let B = B \Fk = B, C = C \Fk, and A= (A\Fk)∪ (Fk−1 \ C).Note that A, B, and C are (disjoint) collection of subsets in ∪k−1

s=0Fs, and Fk−1 ⊂ A∪ C. Note that

zA, zB, and zC are fixed because zFk−1is determined from τ . Also, by our construction, we have

that DA(zA)∩DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6=∅.

Moreover, we have that C separates A and B in the graph Gk−1. So, by the inductive hypothesis,

we have that Añ B | C (so DA(zA)∩DB(zB)∩DC(zC) 6=∅) and there exists a top-k partial ranked

list that is consistent with the choices zA, zB, and zC. Let us denote this top-k partial ranked list

by σ(k) = (d1, d2, . . . , dk), so product d1 is the most preferred product, d2 is the second most, and

so on.

As above, we now extend σ(k) to a top-(k+ 1) partial ranked list that is consistent with all of

the choices. Pick an arbitrary set S ∈Fk such that S ∈A∪C. Note that S /∈ B by our hypothesis.

There are two sub-cases to consider:

Subcase 3.1: Suppose that S ∩ d1, d2, . . . , dk = ∅, i.e., S = N \ d1, d2, . . . , dk. Again, note

that there is only one such subset. As a result, we simply extend σ(k) to (d1, d2, . . . , dk, zS) by adding

zS as the (k+ 1)th element.

Subcase 3.2: Suppose that S ∩ d1, d2, . . . , dk 6= ∅. This means that S =N \Q where |Q|= k

and Q 6= d1, d2, . . . , dk. In this case, we will show that σ(k) is already consistent with the choice

zS; that is, zS = di∗ where i∗ = mini : di ∈ S.

The arguments are similar to the subcase 2.2 above. In particular, assume without loss of gen-

erality that S ∈ A; the argument is the same if S ∈ B or S ∈ C. Consider a product j∗ ∈ Q \d1, d2, . . . , dk. Note that j∗ exists because Q and d1, . . . , dk have exactly k elements and are

not equal. Because S ∈Fk and j∗ /∈ S, we have that S ∪j∗ ∈Fk−1 ⊂ A∪ C, so zS∪j∗ is given. It

follows from our induction hypothesis that σ(k) is consistent with the top choice of the set S∪j∗,which means that zS∪j∗ = di∗ because di∗ is the most preferred product in S under σ(k) and j∗ is

not part of σ(k).

Now, note that by construction (of zFk−1) above, the choices zA, zC, and zFk−1

are consistent with

each other. As a result, it must be the case that the choices zS∪j∗ (with S ∪j∗ ∈Fk−1) and zS

(with S ∈A∪C) must be consistent with each other (in particular, with ranking τ). Furthermore,


since S ⊂ S ∪ j∗ and zS∪j∗ = di∗ ∈ S, dropping j∗ does not alter the choice; that is, we must

have zS = zS∪j∗ = di∗ , which is the desired result.

The result now follows by induction.

Appendix B: Verification of Rational Separation

In this section, we show how to verify the rational separation property for any three collections of

subsets A, B, and C. For any zA = (zS ∈ S : S ∈A), zB = (zS ∈ S : S ∈B), and zC = (zS ∈ S : S ∈ C),

let the directed graph H(zA, zB, zC) be defined by:

H(zA, zB, zC) =(N,(zS, j) : S ∈A∪B∪C, j ∈ S, j 6= zS

),

and thus, the set of nodes in H(zA, zB, zC) consists of all the products in N , and for each set

S ∈ A∪ B ∪ C, we have a directed edge from zS to all other products in S, indicating that zS is

preferred to all other products in S, i.e., zS is the top ranked item in S. The following lemma shows

that a consistent ranking exists if and only the graph has no directed cycle.

Lemma B.1 (Consistency Verification) The graph H(zA, zB, zC) has no directed cycle

if and only if DA(zA) ∩ DB(zB) ∩ DC(zC) 6= ∅. Moreover, verifying whether or not

DA(zA)∩DB(zB)∩DC(zC) 6= ∅ can be done in linear time, with O(n+ minn2 ,

∑S∈A∪B∪C |S|

)operations

Proof: By our construction, if there is a directed cycle in H(zA, zB, zC), then it is clear that there

is no consistent ranking, so DA(zA)∩DB(zB)∩DC(zC) = ∅. However, if there is no directed cycle,

then it is easy to construct a consistent ranking, by first constructing a consistent ranking among

(zS : S ∈A∪B ∪ C), and putting all other products at the bottom of the ranking. This completes

the first part of the lemma.

Note that a strongly connected component of H(zA, zB, zC) is a subgraph in which every vertex

is reachable from every other vertex. Thus, H(zA, zB, zC) has no directed cycle if and only if every

strongly connected component consists of a single vertex. The problem of finding all strongly

connected components in a directed graph is a well-studied problem in computer science, and

it can be solved in linear time (Tarjan 1972). Note that the number of edges in H(zA, zB, zC)

is O(minn2 ,∑

S∈A∪B∪C |S|). So, verifying whether or not DA(zA)∩DB(zB)∩DC(zC) 6= ∅ can be

done in linear time, using O(n+ minn2 ,


)operations.

The following corollary shows how to verify the rational separation property.


Corollary B.2 (Rational Separation Verification) For any collections of subsets

A, B, and C, we can verify whether or not A ñ B | C using O((∏

S∈A∪B∪C |S|)×(

n+ minn2 ,∑

S∈A∪B∪C |S|))

Proof: To verify that A ñ B | C, for each zA = (zS ∈ S : S ∈ A), zB = (zS ∈ S : S ∈ B), and zC =

(zS ∈ S : S ∈ C), we need whether or not thew following condition holds:

DA(zA) ∩ DC(zC) 6= ∅ and DB(zB)∩DC(zC) 6=∅ =⇒ DA(zA)∩DB(zB)∩DC(zC) 6=∅ .

By Lemma B.1, checking the above conditions takes a total of

O

((n+ minn2 ,

∑S∈A∪C

|S|)

+(n+ minn2 ,

∑S∈B∪C

|S|)

+(n+ minn2 ,

∑S∈A∪B∪C

|S|))

operations, which is O(n+minn2 ,


). To complete the proof, note that the number

of possible combinations of (zA, zB, zC) is∏S∈A∪B∪C |S|.

Appendix C: Proofs for Section 4

C.1 Proof of of the Computational Complexity of Dynamic Program for Tree Choice Graphs

We will establish that the DP recursion in Theorem 4.1 can be computed in linear time using

just O(n |M|) operations. We will compute the value function at each vertex in a breadth-first

search manner, starting at the leaf vertices. For each leaf vertex S, VS(zS) = czS ,S for all zS ∈ S.

Once we compute the value function for every leaf vertex, we then move up one level. Consider an

arbitrary vertex S whose set of children is denoted by Children(S). Computing the value function

at S involves two steps:

Step 1: For each vertex A∈ Children(S), compute two numbers: αA ≡minzA∈A VA(zA) and βA ≡minzA∈A\S VA(zA). Step 1 requires a total of O(|Children(S)|n) operations because for each A ∈Children(S), we enumerate through all the products zA ∈A to compute αA and βA.

Step 2: For each product zS ∈ S, we compute VS(zS) as follows:

VS(zS) = czS ,S +∑

A∈Children(S)

minzA∈A : DS(zS)∩DA(zA)6=∅

VA(zA)

= czS ,S +∑

A∈Children(S)

(1l[zS /∈A]αA + 1l[zS ∈A] minβA, VA(zS)

),

where the second equality follows from noting that zA ∈A : DS(zS)∩DA(zA) 6=∅ is equal to A,

if zS /∈A, and zS∪· (A \S), otherwise. Therefore,

minzA∈A : DA(zA)∩DS(zS)6=∅

VA(zA) =

minzA∈A VA(zA) if zS /∈A,min

VA(zS) , minzA∈A\S VA(zA)

if zS ∈A,

= 1l[zS /∈A]αA + 1l[zS ∈A] minβA, VA(zS).


As we have already pre-computed αA and βA in step 1, for each zS ∈ S, computing VS(zS) takes

O(|Children(S)|), so the entire step 2 requires O(n|Children(S)|) operations.

Therefore, the total computation for computing the value function at vertex S is

O(2n|Children(S)|) =O(n|Children(S)|). We continue this process until we reach the root vertex of

the tree. Therefore, the total operations for computing the value function at every single vertex of

the tree is equal to O(n∑

S∈M |Children(S)|) =O(n |M|), which is the desired result.

C.2 Proof of Theorem 4.2

In this section, we give the proof of Theorem 4.2, showing that for tree choice graphs, we can

reformulate the Rank Aggregation LP as an equivalent LP with just O(n |M|) variables and

constraints. Recall from Lemma 2.5 that the feasible region of the Rank Aggregation LP is

given by a polytope M defined as follows:


(1l[σ, i,S] : i∈ S, S ∈M)∈ 0,1∑


,

and the optimization problem is equivalent to Z∗ = minx∈M

∑S∈M

∑i∈S

ci,Sxi,S.

Let a polytope L be defined by:

L =

(yi,A : i∈A, A∈M

) ∣∣∣∣yr,A ≤ zr,A,B and yr,B ≤ zr,A,B ∀ r ∈A∩B, A,B ∈ E∑i∈S

yi,S = 1 ∀ S ∈M∑r∈A∩B

zr,A,B = 1 ∀ A,B ∈ E

yi,A ≥ 0, zr,A,B ≥ 0 ∀ i∈A, A∈M, r ∈A∩B, A,B ∈ E

,

where E denotes the edge set of the tree and we interpret yi,A as the probability of choosing i

from offer set A and zr,A∩B as the probability of choosing r when offered A ∩B. Note that the

polytope L has O(n |M|) variables and O(n |E|+ |M|) =O(n |M|) constraints. The main result

of this section is stated in the following theorem; it shows that M is equivalent to L .

Theorem C.1 (Compact Polytope) When the choice graph T= (M,E) is a tree, M = L .

Note that Theorem 4.2 follows immediately from the above theorem. It is clear that M ⊆L . We

prove that L ⊆M by explicitly constructing a distribution over rankings that is consistent with

the choice probabilities yi,A for all i∈A and A∈M. The details are provided in Appendix C. We

conclude this section by describing the compact LPs for the nested, laminar, and differentiated

collections introduced in Section 3.


LP for the nested collection (Section 3.1): Because the choice graph for a nested collection

is the line graph (Theorem 3.4), it follows from Theorem 4.2 that the Rank Aggregation LP

for the nested collection M= S1, . . . , S` is equivalent to the following compact LP:

min

∑h=1

∑i∈Sh

ci,Sh yi,Sh

∣∣∣∣ yi,Sh+1≤ yi,Sh ∀ i∈ Sh, 1≤ h≤ `− 1,

∑i∈Sh

yi,Sh = 1 ∀ h, y≥ 0

.

Because S1 ⊂ S2 ⊂ · · · ⊂ S`, for any edge Sh, Sh+1 in the graph, Sh ∩Sh+1 = Sh; therefore, we do

not need additional variables zr,A∩B, and we can express the LP using only the variables yi,Sh .

LP for the laminar collection (Section 3.2): Because the laminar collection has a choice

graph T = (M,E) that is a disjoint collection of trees (Theorem 3.5), the Rank Aggregation

LP reduces to the following compact LP:

min

∑S∈M

∑i∈S

ci,S yi,S

∣∣∣∣ yr,A ≤ yr,B, ∀ A,B ∈ E,B ⊆A,r ∈B, ∑i∈S

yi,S = 1 ∀ S ∈M, y≥ 0

.

As in the nested collection setting, we do not need additional variables zr,A∩B because if A∩B 6=∅,

we have either A⊂B or B ⊂A.

LP for the differentiated collection (Section 3.3): Because the choice graph for the dif-

ferentiated collection M = A0,A1, . . . ,A` is star-shaped (Theorem 3.6) with edges from A0 to

Ai, 1≤ i≤ `, where for i= 1,2, . . . , ` and Ai ∩Aj =A0 for all i 6= j, the Rank Aggregation LP

has the following compact representation:

min

∑h=0

∑i∈Ah

ci,Ahyi,Ah

∣∣∣∣ yr,Ah≤ yr,A0

∀ r ∈A0, 1≤ h≤ `,∑i∈Ah

yi,Ah= 1 0≤ h≤ `, y≥ 0

.

The above three examples demonstrate that the resulting LP is much more compact than the

original Rank Aggregation LP.

The proof of Theorem C.1 is based on the induction on the number of vertices in the tree. The

non-trivial base case is when the tree has two vertices; that is, when M= A,B. The proof here

is constructive and is given in the following lemma.

Lemma C.2 (Base Case of Two Vertices) If M= A,B, then M = L .

Proof: Without loss of generality, we assume that A ∩B 6= ∅; otherwise, the result is trivially

true. When M= A,B, note that

L =

(yi,S : i∈ S, S ∈M)∣∣ yr,A ≤ zr,A,B and yr,B ≤ zr,A,B, ∀ r ∈A∩B,∑i∈A

yi,A = 1,∑i∈B

yi,B = 1,∑

r∈A∩Bzr,A,B = 1, y,z ≥ 0


We will first show that M ⊆L . By definition, for an arbitrary x∈M , xi,S is of the form

xi,S =∑σ

τ(σ) · 1l[σ, i,S] ,

where τ : Pn→ [0,1] is a probability mass function over the set of rankings, with∑

σ τ(σ) = 1 and

τ(σ) ≥ 0 for all σ. For all i ∈ A, let yi,A =∑

σ τ(σ) · 1l[σ, i,A], and for all i ∈ B, yi,B =∑

σ τ(σ) ·

1l[σ, i,B]. Also, for each r ∈A∩B, let

zr,A,B =∑σ

τ(σ) · 1l[σ, r,A∩B]

It is easy to check that yi,A, yi,B, and zr,A,B satisfies all of the constraints in L because

max1l[σ, r,A],1l[σ, r,B] ≤ 1l[σ, r,A∩B] for any r ∈A∩B. Therefore, M ⊆L .

We now prove that L ⊆M . Consider an arbitrary element (y,z) ∈ L . For each i ∈ A \ B,

r ∈A∩B, and j ∈B \A, define the following subset of rankings:

S1(r) = DA(r)∩DA∩B(r)∩DB(r)

S2(i, r) = DA(i)∩DA∩B(r)∩DB(r)

S3(j, r) = DA(r)∩DA∩B(r)∩DB(j)

S4(i, j, r) = DA(i)∩DA∩B(r)∩DB(j)

Note that S1(r) is the set of all rankings for which r ∈A∩B is the most preferred product in A∪B;

S2(i, r) is the set of rankings for which i∈A \B is the most preferred product in A∪B, followed

by r as the second most preferred product; S3(j, r) is the set of rankings for which j ∈B \A is the

most preferred product in A∪B, followed by r as the second most preferred product; and finally,

S4(i, j, r) is the set of rankings for which i and j are the top two most preferred product in A∪B,

followed by r as the third most preferred product. By our construction, these four sets are disjoint.

Then, define a distribution λ(·) over rankings as follow:

λ(σ) =

1|S1(r)| · (yr,Ayr,B)/zr,A,B if σ ∈S1(r) for some r,

1|S2(i,r)| ·

(zr,A,B−yr,A)yr,Byi,A/zr,A,B∑`∈A\B y`,A

if σ ∈S2(i, r) for some i, r,

1|S3(j,r)| ·

(zr,A,B−yr,B)yr,Ayj,B/zr,A,B∑k∈B\A yk,B

if σ ∈S3(j, r) for some j, r,

1|S4(j,r)| ·

(zr,A,B−yr,A) (zr,A,B−yr,B)yi,A yj,B/zr,A,B

(∑

`∈A\B y`,A)(∑

k∈B\A yk,B)if σ ∈S4(i, j, r) for some i, j, r,

0 otherwise.


We will first show that λ(·) is a bonafide probability mass function. By the constraints of L , λ(·)is non-negative, so, we will now show that it sums to one. Fix r ∈A∩B. Then,∑

i∈A\B

(zr,A,B− yr,A)yr,Byi,A/zr,A,B∑`∈A\B y`,A

=(zr,A,B− yr,A)yr,B

zr,A,B(3)

∑j∈B\A

(zr,A,B− yr,B)yr,Ayj,B/zr,A,B∑k∈B\A yk,B

=(zr,A,B− yr,B)yr,A

zr,A,B(4)

∑i∈A\B

∑j∈B\A

(zr,A,B− yr,A) (zr,A,B− yr,B)yi,A yj,B/zr,A,B(∑`∈A\B y`,A

)(∑k∈B\A yk,B

) =(zr,A,B− yr,A) (zr,A,B− yr,B)

zr,A,B

Therefore,∑σ

λ(σ)

=∑

r∈A∩B

yr,Ayr,Bzr,A,B

+(zr,A,B− yr,A)yr,B

zr,A,B+

(zr,A,B− yr,B)yr,Azr,A,B

+(zr,A,B− yr,A) (zr,A,B− yr,B)

zr,A,B

=∑

r∈A∩Bzr,A,B = 1 ,

where the last inequality follows from the constraint in L .

To complete the proof, we will show that λ(·) has the desired choice probabilities. Note that for

any i∈A \B,∑σ

λ(σ) · 1l[σ, i,A] =∑

r∈A∩B

(zr,A,B− yr,A)yr,Byi,A/zr,A,B∑`∈A\B y`,A

+∑j∈B\A

∑r∈A∩B


)(∑k∈B\A yk,B

)from (4)

=yi,A∑

`∈A\B y`,A

∑r∈A∩B

(zr,A,B− yr,A)yr,B + (zr,A,B− yr,A) (zr,A,B− yr,B)

zr,A,B

=yi,A∑

`∈A\B y`,A

∑r∈A∩B

(zr,A,B− yr,A

)=

yi,A∑`∈A\B y`,A

(1−

∑r∈A∩B

yr,A

)=

yi,A∑`∈A\B y`,A

∑r∈A\B

yr,A

= yi,A ,

where the last two equalities follow from the fact that∑

r∈A∩B zr,A,B = 1 and∑

r∈A∩B yr,A +∑r∈A\B yr,A = 1. Similarly, for any j ∈B \A,∑σ

λ(σ)1l[σ, j,B] =∑

r∈A∩B


+∑i∈A\B

∑r∈A∩B


)(∑k∈B\A yk,B

)from (3)

=yj,B∑

k∈B\A yk,B

∑r∈A∩B

(zr,A,B− yr,B)yr,A + (zr,A,B− yr,A) (zr,A,B− yr,B)

zr,A,B


=yj,B∑

k∈B\A yk,B

∑r∈A∩B

(zr,A,B− yr,B

)=

yj,B∑k∈B\A yk,B

(1−

∑r∈A∩B

yr,B

)=

yj,B∑k∈B\A yk,B

∑r∈B\A

yr,B

= yj,B ,

which is the desired result. Finally, for any r ∈A∩B,∑σ

λ(σ)1l[σ, r,A] =yr,Ayr,Bzr,A,B

+∑j∈B\A


= yr,A ,

and ∑σ

λ(σ)1l[σ, r,B] =yr,Ayr,Bzr,A,B

+∑i∈A\B

(zr,A,B− yr,A)yi,Ayr,B/zr,A,B∑`∈A\B y`,A

= yr,B ,

which completes the proof.

Using the result of Lemma C.2, we now prove Theorem C.1

Proof of Theorem C.1: We prove the result by induction on the number of vertices |M| in

the tree. Without loss of generality, assume that T is connected; otherwise, we can treat each

connected component separately. When |M|= 1, the result is trivially true. The case where |M|= 2

is established in Lemma C.2. Suppose the result is true for all trees with 1,2, . . . , k − 1 vertices.

Consider the case when |M|= k > 2. Pick a vertex in T with at least two children; such a vertex

exists because |M| > 2 and T is connected6. We label this vertex as the root vertex. Note that

root ∈M. Let A1, . . . ,Ah denote the children of root in the tree, with h≥ 2. For j = 1, . . . , h, let

T(Aj) denote the sub-tree rooted at Aj excluding the vertex Aj; equivalently, Aj = Aj ∪T(Aj)

denotes the sub-tree rooted at Aj. Since we have a tree, the collections A1, . . . ,Ah are mutually

disjoint; that is, Aj1 ∩Aj2 = ∅ for all j1 6= j2.

We want to show that M = L . It is clear that M ⊆L . To show that L ⊆M , we will show

that for every feasible solution(yi,S, zr,S,T : i∈ S ∈M, r ∈ S ∩T, S,T ∈ E

)∈L , there exists

a distribution λ(·) over rankings that matches the choice probabilities yi,S for all i ∈ S ∈M. By

our inductive hypothesis, since |root∪· Aj| ≤ k− 1, there exists a probability mass function λj(·)

over rankings that matches the choice probabilities of every set in root ∪· Aj; that is, for all

S ∈ root∪· Aj and i∈ S, ∑σ∈Pn

λj(σ) · 1l[σ, i,S] = yi,S j = 1,2, . . . , h . (5)

To complete the result, we will combine the distributions λ1(·), . . . , λh(·) to create a prob-

ability distribution λ(·) over all rankings with the desired property. For j = 1,2, . . . , h ~aj

=

6 For example, if T is a line graph with three nodes S1 −S2 −S3, then S2 has two children.

e-companion to The Limit of Rationality ec15(ajB ∈B : B ∈Aj

)∈ B∈Aj B denote a vector of elements, each coming from a set in Aj. Then,

for all γ ∈ root and ~aj ∈ B∈AjB, let

wj(γ,~a

j)

=∑σ

λj(σ) · 1l[σ,γ, root] ·∏B∈Aj

1l[σ,ajB,B] =∑

σ ∈Droot(γ)∩DAj

(~aj)λj(σ) ,

where, by definition 3.1, DAj

(~aj)

= ∩B∈AjDB (aj). Note that wj(γ,~a

j)

denotes the mass under

λj(·) of all rankings where γ is the most preferred product in root and for every B ∈Aj, ajB is the

most preferred product in B; wj(γ,~a

j)

may be zero. Also, by our construction,

wj(γ,~a

j)≤∑σ

λj(σ)1l[σ,γ, root] = yγ,root.

For all(γ,~a

1, . . . , ~a

h)∈ root× ( B∈A1B)× · · ·× ( B∈AhB), let

P(γ,~a1, . . . , ~a

h) = Droot(γ)∩

(∩hj=1DAj

(~aj))

Also, let

w(γ,~a1, . . . , ~a

h) =

∏h

j=1wj(γ,~a

j)

(yγ,root)h−1

where we define 0/0 to be 0. This is well-defined because we have shown that wj(γ,~a

j)≤ yγ,root

for all j, so whenever the denominator is zero, so is the numerator. We also note that whenever

P(γ,~a1, . . . , ~a

h) = ∅, we must have that w(γ,~a

1, . . . , ~a

h) = 0. To see this, consider (γ,~a

1, . . . , ~a

h)

such that P(γ,~a1, . . . , ~a

h) = ∅. Then, we must have that Droot(γ)∩DAj = ∅ for some 1≤ j ≤ h;

otherwise, it will follow from the rational separation of A1,A2, . . . ,Ah by root that Droot(γ)∩DAi 6=

∅ for all 1≤ i≤ h implies that P(γ,~a1, . . . , ~a

h) 6= ∅. It now follows by definition that Droot(γ)∩

DAj = ∅ implies that wj(γ,~aj) = 0, which in turn implies that w(γ,~a1, . . . , ~a

h) = 0. Now define the

function λ : Pn→ [0,1] as follows:

λ(σ) =

w(γ,~a1,...,~ah)

|P(γ,~a1,...,~ah)| if σ ∈P(γ,~a

1, . . . , ~a

h)

for some (γ,~a1, . . . , ~a

h)

0 otherwise

We will show that λ(·) is a probability mass function over the rankings with the desired property.

Note that λ(σ) is well-defined because it follows by our inductive hypothesis that for each j,

there exist γ and a vector ~aj

such that Droot(γ)∩DAj

(~aj)6= ∅. It then follows by the fact that

root separates the collections Aj for all j, whenever Droot(γ) ∩DAj

(~aj)6= ∅ for all j, it must

be that P(γ,~a1, . . . , ~a

h) = Droot(γ) ∩

(∩hj=1DAj

(~aj))6= ∅ by rational separation. Moreover, by


definition,P(γ,~a

1, . . . , ~a

h)∩P

(γ, ~a

1, . . . , ~a

h)

= ∅ unless γ = γ, and ~aj

= ~aj

for all j. So, for each

σ belongs to at most a single set P(γ,~a

1, . . . , ~a

h)

. Therefore, λ(·) is well-defined.

We will now show that λ(·) is a valid probability mass function over rankings. Note that∑σ

λ(σ) =∑

(γ,~a1,...,~ah)

w(γ,~a

1, . . . , ~a

h)

=∑γ∈root

1

(yγ,root)h−1

∑~a1

· · ·∑~ah

h∏j=1

wj(γ,~a

j)

=∑γ∈root

1

(yγ,root)h−1

h∏j=1

∑~aj

wj(γ,~a

j) ,

where the second equality follows by the interchange of sum and product. By definition, for all

j = 1, . . . , h, we have∑~aj

wj(γ,~a

j)

=∑~aj

∑σ

λj(σ)1l[σ,γ, root]∏B∈Aj

1l[σ,ajB,B]

=∑σ

λj(σ)1l[σ,γ, root]∑~aj

∏B∈Aj

1l[σ,ajB,B]

=∑σ

λj(σ)1l[σ,γ, root]∏B∈Aj

∑aB∈B

1l[σ,ajB,B]

=∑σ

λj(σ)1l[σ,γ, root] = yγ,root ,

and thus, ∑σ

λ(σ) =∑γ∈root

1

(yγ,root)h−1

(yγ,root)h

=∑γ∈root

yγ,root = 1 ,

where the last equality follows from the fact that y ∈L . Thus, λ(·) is a bonafide probability mass

function over the rankings.

We now show that λ(·) has the desired choice probability. For any γ ∈ root,∑σ

λ(σ)1l[σ,γ, root] =∑~a1

· · ·∑~ah

w(γ,~a

1, . . . , ~a

h)

=1

(yγ,root)h−1

∑~a1

· · ·∑~ah

h∏j=1

wj(γ,~a

j)

=1

(yγ,root)h−1

h∏j=1

∑~aj

wj(γ,~a

j) = yγ,root ,

where the last inequality follows from the above argument that shows∑

~aj wj(γ,~a

j)

= yγ,root for

all j.

Similarly, consider an arbitrary set S ∈A` for some ` and an element s∈ S. Since T is a tree, it

follows that S /∈Aj for j 6= `. Then,∑σ

λ(σ)1l[σ, s,S] =∑γ∈root

∑~a1

· · ·∑~a`−1

∑~a`:a`

S=s

∑~a`+1

· · ·∑~ah

w(γ,~a

1, . . . , ~a

h)


=∑γ∈root

1

(yγ,root)h−1

∑~a1

· · ·∑~a`−1

∑~a`:a`

S=s

∑~a`+1

· · ·∑~ah

h∏j=1

wj(γ,~a

j)

=∑γ∈root

1

(yγ,root)h−1×

∑~a`:a`

S=s

w`(γ,~a

`)× ∏

j=1,...,h : j 6=`

∑~aj

wj(γ,~a

j)

=∑γ∈root

∑~a`:a`

S=s

w`(γ,~a

`).

By definition,∑γ∈root

∑~a`:a`

S=s

w`(γ,~a

`)

=∑γ∈root

∑~a`:a`

S=s

∑σ

λ`(σ)1l[σ,γ, root]× 1l[σ, s,S]×∏

B∈A` :B 6=S

1l[σ,ajB,B]

=∑σ

λ`(σ)

(∑γ∈root

1l[σ,γ, root]

)× 1l[σ, s,S]×

∏B∈A` :B 6=S

∑ajB∈B

1l[σ,ajB,B]

=∑σ

λ`(σ)1l[σ, s,S] = ys,S ,

where the last equality follows from our inductive assumption that λ`(·) matches the choice prob-

abilities of every set in root∪· A`.

The result of the theorem now follows by induction.

Appendix D: Proofs for Section 5

D.1 Proof of Theorem 5.4

The proof is similar to that for the case when the choice graph is a tree in that we formulate the

Rank Aggregation LP as a DP on the tree decomposition of the choice graph.

Let TG be the tree decomposition of the graph G with a collection of bags Xb : b∈ B, with

|B| ≤ |M| by Lemma 5.3. Let Xroot be a root node in TG and X−b denote the collection Xb \XParent(Xb).

With the root node fixed, the parent of a node is uniquely defined; we suppose that the parent of the

root node is the empty set. Also, let T(Xb) denote the sub-tree rooted at Xb but not including Xb.

Also, for all zXb = (zS : S ∈Xb)∈ S∈XbS, let DXb(zXb) =∩S∈XbDS(zS) and czXb ,Xb =∑

S∈X−bczS ,S.

For each bag Xb, we define a value function J : S∈XbS→R as follows: for all zXb = (zS : S ∈Xb)∈

S∈XbS such that DXb(zXb) =∩S∈XbDS(zS) 6= ∅,

JXb(zXb)

= czXb ,Xb + min

∑Xa∈T(Xb)

czXa ,Xa : (zXa : Xa ∈T(Xb))

and DXb(zXb)∩(∩Xa∈T(Xb)DXa(zXa)

)6=∅

.


We first claim that Z∗ = minzXroot J (zXroot). To see this, note that Xroot ∪(∪X∈T(Xroot)X

)=M.

Therefore, it follows from our definitions that DXroot (zXroot)∩(∩Xa∈T(Xroot)DXa(zXa)

)=∩S∈MDS(zS).

It is now sufficient to show that czXroot ,Xroot +∑Xa∈T(Xroot) czXa ,Xa =

∑S∈M czS ,S. We show this by

establishing that X−b for all b ∈ B forms a partition of M. We first argue that X−b ∩X−a = ∅ for

a, b ∈ B and a 6= b. Otherwise, there exists an S that is contained in X−b and X−a . It follows from

our definitions that S is also contained in Xb and Xa. It now follows from the running intersection

property of Definition 5.1 that every X on the path from Xb to Xa in TG also contains S. Because

every path from Xb to Xa should contain both the parents Parent(Xb) and Parent(Xa), it follows

that S ∈ Parent(Xb) and S ∈ Parent(Xa). We have thus established that S must be contained by in

Xb, Xa, Parent(Xb), and Parent(Xa). This contradicts the assumption that S ∈X−b =Xb \Parent(Xb).

It now suffices to show that for every S ∈M, there exists a b ∈ B such that S ∈ X−b . But this

immediately follows from noting that there is at least one b ∈ B such that S ∈ Xb and by taking

b∗ ∈ B to be such that S ∈Xb∗ but S /∈ Parent(Xb∗), we have that S ∈X−b∗ ; note that such a b∗ must

always exist.

We now show how we can exploit the rational separation property to solve the DP efficiently.

For that, we follow the arguments in the proof of Theorem 4.1 to express the JXb(·) as the following

DP recursion:

JXb(zXb)

= czXb ,Xb +∑

Xa∈Children(Xb)

minJXa (zXa) : DXb

(zXb)∩DXa(zXa) 6=∅

The above dynamic programming allows us to recursively compute the value function at each

bag in the tree decomposition, starting with the bags at the leaves and moving up the tree in

a breadth first fashion. In computing the value function at Xb, suppose that for each value of

zXb , the values Va(Xb) := minJXa (zXa) : DXb


have been pre-computed.

Then, to compute the value function for JXb (·) at the bag Xb, we need to perform |Children(Xb)|

additions for each state. Since there are O(ntw(G)+1) total number of states, it follows that com-

puting the value function at bag Xb requires |Children(Xb)|ntw(G)+1 operations. Now, to pre-

compute Vb(XParent(Xb)), we need to search over all possible combinations zXb and zParent(Xb) such

that DXb(zXb)∩DParent(Xb)(zParent(Xb)) 6= ∅. For a given combination of zXb and zParent(Xb), checking

whether DXb(zXb)∩DParent(Xb)(zParent(Xb)) 6= ∅ requires us to check if there is a directed cycle in

a graph with O(n) nodes, which has a O(n2) complexity. Because there is O(n2tw(G)+2) possible

combinations of zXb and zParent(Xb), the pre-computation has a complexity of O(n2tw(G)+4).


Therefore, the complexity of total computation is∑m

b=1 |Children(Xb)|ntw(G)+1 + O(|B| ·n2tw(G)+4)≤ ntw(G)+1

∑m

b=1 |Children(Xb)|+O(|B| ·n2tw(G)+4) =O(|M|n2tw(G)+4

), where the last equal-

ity follows because∑m

b=1 |Children(Xb)| is equal to the number of bags in the tree decomposition,

which is at most |M| by Lemma 5.3.


We now describe an LP with O(|M|n2(tw(G)+1)

)variables and O(|M|ntw(G)+1) constraints to solve

the Rank Aggregation LP under a general choice graph with bounded tree width. Let TG be the

minimal tree decomposition of G with a collection of bags Xb : b∈ B. To facilitate our description

of the LP feasible region, recall that we write zXb = (zS : S ∈Xb) to denote the vector of choices

from the subsets in the bag Xb and DXb(zXb) =∩S∈XbDS(zS). Consider the polytope K , defined as

follows:

K =

(ybzXb

: b∈B, DXb(zXb) 6= ∅) ∣∣∣∣∑

zXb :DXb (zXb )6=∅

ybzXb= 1 ∀ b∈B,

ybzXb=

∑zXa :DXa (zXa )∩DXb (zXb )6=∅

wa,bzXa , zXb∀ edge Xa,Xb in TG and zXb

yazXa =∑

zXb :DXa (zXa )∩DXb (zXb )6=∅

wa,bzXa , zXb∀ edge Xa,Xb in TG and zXa

All variables are non-negative

.

In the above polytope, for any zXb = (zS : S ∈Xb), we interpret the variable ybzXbas the probability

that for all S ∈ Xb, the product zS is the highest-ranked product in set S. The terms wa,bzXa , zXbcorrespond to the joint probabilities on the edge Xa,Xb in TG. Note that the above polytope

has O(|M|n2(tw(G)+1)

)variables and O(|M|ntw(G)+1) constraints. Recall from Lemma 2.5 that the

feasible region of the Rank Aggregation LP is given by the following polytope:


(1l[σ, i,S] : i∈ S, S ∈M)∈ 0,1∑


.

To complete the proof of Theorem 5.5, it suffices to show that M = K . It is clear that M ⊆K ,

so it remains to show that K ⊆M . We will prove the result by induction on the number of bags

in the tree decomposition TG of G. Consider the case where TG has a single bag, corresponding to

M. In that case, the variables in K consists of yzM for all zM such that DM(zM) 6= ∅. Define a

probability mass function λ : Pn→ [0,1] as follows:

λ(σ) =

yzM

|DM(zM)| if σ ∈DM(zM) for some zM,

0 otherwise


It is easy to check that λ(·) is a well-defined probability mass function over all rankings. Moreover,

it has the desired choice selection probability.

Now, consider the case where the tree decomposition has exactly two bags, say X1 and X2 with

an edge between X1 and X2. We define a distribution over rankings as follows:

λ(σ) =

wzX1 ,zX2|DX1 (zX1 )∩DX2 (zX2 )| if σ ∈DX1(zX1)∩DX2(zX2) for some zX1 , zX2 ,

0 otherwise

Once again, it is easy to that λ(·) is a well-defined probability mass function with the desired choice

probabilities.

Consider the case |B|= k > 2. Pick a bag in TG with at least two children; such a vertex exists

because |B|> 2 and TG is connected. We label this vertex as the root bag. Let C1, . . . ,Ch denote

the children of the root bag in the tree, with h≥ 2. For j = 1, . . . , h, let F j = Cj ∪T(Cj) denote

the collection of bags in the sub-tree rooted at Cj.

To show that K ⊆M , we will show that for every feasible solution of K , there exists a distribu-

tion λ(·) over rankings that matches the desired choice probabilities. By our inductive hypothesis,

since |root∪F j| ≤ k−1, there exists a probability mass function λj(·) over rankings that matches

the choice probabilities of every bag in root∪F j; that is, for all Xb ∈ root∪F j,∑σ∈Pn

λj(σ)∏S∈Xb

1l[σ, zS, S] = ybzXbj = 1,2, . . . , h .

To complete the result, we will combine the distributions λ1(·), . . . , λh(·) to create a probability

distribution λ(·) over all rankings with the desired property. The construction is similar to that

given in the proof of Theorem 4.2. Specifically, for each j = 1,2, . . . , h, let Aj denote the collection

of subsets ∪Xb∈FjXb. Let ~aj

= (ajB ∈ B : B ∈ Aj) ∈ B∈AjB denote the vector of elements, each

coming from a set in the collection Aj. Further, let γ = (γB ∈B : B ∈ root) be the vector of elements

from sets in the root bag. Let

wj(γ, ~a

j)

=∑σ

λj(σ) ·∏

B∈root1l[σ,γB,B] ·

∏B∈Aj

1l[σ,ajB,B] =∑

σ∈Droot(γ)∩DAj (~aj)

λj(σ).

It follows from our construction that

wj(γ, ~aj)≤∑σ

λj(σ) ·∏

B∈root1l[σ,γB,B] = yrootγ

Now, for all (γ, ~a1, . . . , ~a

h)∈ root× ( B∈A1B)× · · ·× ( B∈AhB), let

P(γ, ~a1, . . . , ~a

h) = Droot(γ)∩

(∩hj=1DAj (~a

j))


and

w(γ, ~a1, . . . , ~a

h) =

∏h

j=1wj(γ, ~a

j)(

yrootγ

)h−1,

where we define 0/0 to be 0. This is well-defined because we have shown that wj(γ, ~aj) ≤ yrootγ

for all j, so whenever the denominator is zero, so is the numerator. One difference from the proof

of Theorem 4.2 is that the collections A1, . . . ,Ah may be overlapping. Specifically, suppose B ∈

Aj1 ∩Aj2 for some j1 6= j2. Then, it follows from the running intersection property that B ∈ root.

Now, consider vectors ~aj1 and ~a

j2 such that aj1B , aj2B ∈ B but aj1B 6= aj2B . It then follows that for

any γ ∈ C∈rootC, we have that γB 6= aj1B or γB 6= aj2B . Without loss of generality, suppose that

γB 6= aj1B . Then, the vector of choices (γ, ~aj1) is inconsistent, resulting in wj1(γ, ~a

j1) = 0, which in

turn implies that w(γ, ~a1, . . . , ~a

h) = 0. We have thus argued that even though the same subset may

appear in different collections, our definition of w(·, . . . , ·) ensures that inconsistent choices receive

zero weight. Finally, following the arguments of the proof of Theorem 4.2, we can also show that

whenever P(γ, ~a1, . . . , ~a

h) = ∅, we have that w(γ, ~a

1, . . . , ~a

h) = 0.

Now define the distribution λ : Pn→ [0,1]:

λ(σ) =

w(γ,~a1,...,~ah)

|P(γ,~a1,...,~ah)| if σ ∈P(γ, ~a

1, . . . , ~a

h)

for some (γ, ~a1, . . . , ~a

h)

0, otherwise.

The rest of the proof follows the arguments of the proof of Theorem 4.2 with the summation∑

γ∈root

replaced by∑γ∈ B∈rootB

because root is now a bag of sets as opposed to simply being a set. We

omit the details.

D.3 Proof of Lemma 5.7

We first show that tw(G2)≥ n, where G2 is the choice graph of Del2, the collection of offer sets with

at most two products stocked out. We prove this result by showing that a complete graph Kn+1

with n+ 1 vertices is a minor of G2; that is, Kn+1 can be obtained from G2 by deleting edges and

vertices and by contracting edges. It then follows from a standard result (Robertson and Seymour

1986) that tw(G2)≥ tw(Kn+1) = n, where the equality follows from the standard fact that the tree

width of the complete graph Kn+1 is equal to n.

We show Kn+1 is a minor of G2 by establishing that Kn+1 can be obtained from G2 through

edge contraction. Recall that the graph G2 has 1 + n +(n2

)vertices consisting of sets N , N1,

N2, . . . ,Nn, N1,2, . . . ,N1,n, N2,3, . . . ,Nn−1,n. There is an edge between N and Ni, and

each Ni,j (with i 6= j) is connected to N` for all `. Consider the edge between N` and Ni,j.

If we contract this edge, it is equivalent to removing the node Ni,j and adding an edge between

N` and Ns for all s 6= ` because Ns is also connected to Ni,j. Thus, by repeatedly contracting


the edge between N` and Ni,j, the vertices N1, . . . ,Nn will be all connected to each other,

so we a complete graph among the vertices N,N1, . . . ,Nn, which is the desired result. Thus, we

have that tw(G2)≥ n. Since Gk is a subgraph of Gk+1, it follows from a standard result in graph

theory (Bodlaender et al. 2006) that tw(Gk+1)≥ tw(Gk), which implies that tw(Gk)≥ tw(G2)≥ n,

which is the desired result.

We will now show that the choice depth of the k-deletion collection is at most k. Consider the

following tree decomposition TGkof Gk. The tree TGk

is comprised of the k bags X1, . . . ,Xk defined

as

X` = NA : |A|= ` or `− 1

In other words, X1 = N,N1, . . . ,Nn and X2 consists of all subsets that is obtained by deleting

one or two products, and so on. We connect X` to X`+1 for all ` < k, resulting in the following

line graph:

X1 X2 Xk−1 Xk

The above line graph is a tree decomposition of Gk because: 1) ∪k`=1X` =M, 2) every edge in Delk

is contained in some bag X` by our construction, and 3) the running intersection property is true

by default because the only bags that contain common sets are adjacent nodes X` and X`−1, and

there no bag in between them.

Now note that for any `≤ k,∣∣∣⋃S∈X` S

∣∣∣−minS∈X` |S|= n− (n− `) = ` , because ∪S∈X`S =N for

all `. Therefore,

min

|Xb| − 1 ,

∣∣∣∣∣ ⋃S∈Xb

S

∣∣∣∣∣− minS∈Xb|S|

≤ `

This implies that cd(Gk,TGk)≤ k, which gives the desired result.


To establish the computational complexity, we revisit the dynamic programming equation in Theo-

rem 5.4. Following the arguments in the proof of Theorem 5.4, we obtain the following DP recursion:

JXb(zXb)

= czXb ,Xb +∑

Xa∈Children(Xb)

minJXa (zXa) : DXb


.

All the terms are as defined in the proof of Theorem 5.4. Now, the state (zS : S ∈Xb) of the value

function at bag Xb corresponds to the set of top ranked products for each set in the bag Xb. In

the case when the number of sets in each bag is “small,” it can be maintained directly in a an

efficient manner. However, in situations where the number of sets in each bag is large, maintaining

the state directly can be inefficient. Instead, we make the observation that the top-ranked product


for many of the sets within the same bag will be the same, especially when the graph has a small

choice depth. We exploit this property to maintain state more efficiently.

In order to obtain an efficient state representation, consider an arbitrary bag Xb. Let U =⋃S∈Xb S

and dXb =∣∣∣⋃S∈Xb S

∣∣∣−minS∈Xb |S|. We claim that maintaining the top-(1 + dXb) ranked products

within the universe U is sufficient to recover to the top-choices zS from all subsets S ∈ Xb. To

see this, suppose A=a1, a2, . . . a1+dXb

are the top-(1 + dXb) ranked products in the set U such

that ar is the rth-ranked product. Consider an arbitrary subset S ∈Xb. We claim that S ∩A 6=∅.

Otherwise, we must have that |A|+ |S| ≥ 1+dXb +minB∈Xb |B|= 1+dXb +∣∣∣⋃S∈Xb S

∣∣∣−dXb = 1+ |U |,which is a contradiction because both A and S are subsets of U by definition.

Because A consists of the top-ranked products, it must be that the products in S \A are never

chosen. In fact, the choice zS from S must be the top-ranked products in S ∩A, i.e.,

zS = ai∗ where i∗ = min

1≤ i≤ dXb + 1: ai ∈ S.

Therefore, we maintain the vector of the top dXb + 1 ranked products in⋃S∈Xb S. With this

representation, the number of states is |U |dXb+1 = O(ncd(G)+1), where we assume that we have a

tree decomposition whose choice depth is the minimum.

By following the rest of the arguments from the proof of Theorem 4.1 with the tree width replaced

by the choice depth, we obtain the computational complexity O(|M|n4+2cd(G)

).

The result of the theorem now follows.


The proof is similar to that of Theorem 5.5. First, it is clear that M ⊆ H . So, we focus on

establishing that H ⊆M . For that, suppose we are given variables y, x, and w belonging to H .

We will construct variables w such that y, w ∈K , the polyhedron defined in Theorem 5.5. Since

K = M by Theorem 5.5, it follows that there exists a distribution over rankings that is consistent

with the choice probabilities y, establishing that H ⊆M .

To construct w from y, x, and w, we proceed as follows. For any zXb = (zS : S ∈Xb), recall that

Ok(Ub, zXb) ⊂ Ok(Ub) denotes the top-k orderings of elements in Ub that are consistent with the

choices zXb . In other words,

Ok(Ub, zXb) = a∈Ok(Ub) : zS = aj∗ , j∗ = min1≤ j ≤ k : aj ∈ S , ∀ S ∈Xb .

Now for any b, b′ ∈ B such that Xb,Xb′ is an edge in TG, define

wb,b′

zXb ,zXb′=

∑a∈Ok(Ub,zXb ),a′∈Ok(Ub′ ,zXb′

) : Pk(a′,Ub′ )∩Pk(a,Ub)6=∅

wb,a,b′,a′ .


To show that w and y belong to K , it is sufficient to show that

yb′

zXb′=

∑zXb :DXb (zXb )∩DXb′

(zXb′)6=∅

wb,b′

zXb , zXb′and ybzXb

=∑

zXb′:DXb (zXb )∩DXb′

(zXb′) 6=∅

wb,b′

zXb , zXb′(6)

We focus on establishing the first equality. For that, we proceed as follows. Fix an ordering a′ ∈Ok(Ub′ , zXb′ ). We first note that Ok(Ub, zXb)∩Ok(Ub, zXb) = ∅ whenever zXb 6= zXb because the same

top-k ordering cannot result in distinct choices from the same subset. We now claim that

a∈Ok(Ub) : Pk(a,Ub)∩Pk(a′,Ub′) 6= ∅

=⋃·

zXb : DXb′(zXb′

)∩DXb (zXb )6=∅

a∈Ok(Ub, zXb) : Pk(a,Ub)∩Pk(a

′,Ub′) 6= ∅

(7)

It is clear that RHS ⊆ LHS, so we show that LHS ⊆ RHS. For that, consider an arbitrary top-k

ordering a∈ LHS. Let zXb be the choices from the offer sets in Xb according to the ordering a. Now,

because Pk(a,Ub)∩Pk(a′,Ub′) 6= ∅, we can pick a ranking σ ∈Pk(a,Ub)∩Pk(a

′,Ub′). Because

the ranking σ is consistent with the top-k ordering a′ and a′ is consistent with the choices zXb′ , it

must be that σ is consistent with the choices zXb′ ; that is, σ ∈DXb′ (zXb′ ). Moreover, by construction,

σ ∈DXb(zXb). We have thus shown that DXb′ (zXb′ )∩DXb(zXb) 6= ∅. By construction, we also have

that a∈a∈Ok(Ub, zXb) : Pk(a,Ub)∩Pk(a

′,Ub′) 6= ∅

, so a∈RHS. Since a is arbitrary, we have

that LHS⊆RHS, as desired.

It now follows that∑zXb :DXb (zXb )∩DXb′

(zXb′)6=∅

wb,b′

zXb , zXb′

=∑

zXb :DXb (zXb )∩DXb′(zXb′

)6=∅

∑a∈Ok(Ub,zXb ),a′∈Ok(Ub′ ,zXb′

) : Pk(a′,Ub′ )∩Pk(a,Ub) 6=∅

wb,a,b′,a′

=∑

a′∈Ok(Ub′ ,zXb′)

∑zXb :DXb (zXb )∩DXb′ (zXb′

)

∑a∈Ok(Ub,zXb ) : Pk(a′,Ub′ )∩Pk(a,Ub) 6=∅

wb,a,b′,a′

from (7)

=∑

a′∈Ok(Ub′ ,zXb′)

∑a∈Ok(Ub) : Pk(a′,Ub′ )∩Pk(a,Ub) 6=∅

wb,a,b′,a′

=

∑a′∈Ok(Ub′ ,zXb′

)

xb′,a′

= yb′

zXb′,

where the last two equalities follow from the fact that x, w, and y belong to H . The second

equality in (6) can be established in a similar fashion.

The result of the theorem now follows.


D.6 Proof of Proposition 5.11

The proof of Proposition 5.11 makes use of the following two lemmas. The first lemma establishes

that the minimum degree of each vertex in any choice graph associated with the collection Pairs =

i, j : i 6= j is at least 2(n−2). The second lemma shows that, for the purposes of computing the

choice depth and the tree width, it suffices to consider tree decompositions whose adjacent bags

do not contain each other.

Lemma D.1 Every vertex i, j of any choice graph associated with the collection Pairs is con-

nected by an edge to i, ` and `, j for every ` /∈ i, j. Therefore, every vertex has a degree of at

least 2(n− 2).

Proof: Let G = (Pairs,E) be an arbitrary choice graph associated with the collection Pairs =

i, j : i 6= j. Without loss of generality, we prove the result for the vertex 1,2: there are edges

from 1,2 to 1, ` and 2, `, for all `= 3,4, . . . , n.

The proof is by contradiction, so suppose that 1,2 is not connected to 1,3 by an edge. Let

A= 1,2, B= 1,3, and C = Pairs\1,2,1,3. Because 1,2 is not connected to 1,3

by an edge, it follows that C separates A from B in the graph G. Since G is a choice graph of

Pairs, we must then have A ñ B | C i.e., if DA(zA) ∩DC(zC) 6= ∅ and DB(zB) ∩DC(zC) 6= ∅, then

DA(zA)∩DB(zB)∩DC(zC) 6= ∅.

We arrive at a contradiction by construction choices zA, zB, and zC that violate the above impli-

cation. For that, let π be an arbitrary ordering among products 4,5,6, . . . , n. Consider an ordering

σ that has product 1 as the most preferred product, followed by product 2 as the second ranked

product, followed by product 3 as the third ranked, and then follows the ordering of π. So, under σ,

the top three products are 1,2,3 in that order. Let zA = 1 and let zC = (arg mini∈C σ(i) :C ∈ C) be

the vector consisting of the most preferred product for each set C ∈ C under σ. By our construction,

σ ∈DA(zA)∩DC(zC).

Now, consider the ordering τ that has product 2 as the most preferred product, followed by

product 3 as the second ranked product, followed by product 1 as the third ranked, and then

follows the ordering of π. Under the ordering τ , the top three products are 2,3,1 in that order.

Let zB = 3. Note that for every set C ∈ C, the most preferred product in C under τ is exactly the

same as the most preferred product in C under σ; that is, arg mini∈C σ(i) = arg mini∈C τ(i). This

follows because each set C in C belongs to one of three types: (1) C = a, b where a /∈ 1,2,3

and b /∈ 1,2,3, (2) C = 1, `, 2, `, or 3, ` for some ` /∈ 1,2,3, or (3) C = 2,3. If C is the

first type, then both σ and τ yield the same top ranked product by our construction. Similarly,

if C is the second type, then we still get the same top ranked product because under both σ and


τ , products 1,2,3 are always preferred over ` for ` /∈ 1,2,3. Finally, if C = 2,3, then the top

ranked product under both σ and τ is product 2 by our construction. Therefore, we have that

τ ∈DB(zB)∩DC(zC).

If G is indeed a choice graph, it must be that DA(zA)∩DB(zB)∩DC(zC) 6= ∅. Note that 2,3 ∈ C.

This means that there exists an ordering such that zA = z1,2 = 1, z2,3 = 2, and zB = z1,3 = 3

simultaneously. However, this violates the transitivity of preference. Contradiction! Therefore, it

must be the case that there is an edge between 1,2 and 1,3.

Similar arguments show that there are edges from 1,2 to 1, ` and 2, `, for all ` /∈ 1,2.

Therefore, the degree of the vertex 1,2 is at least 2(n− 2). This completes the proof.

Lemma D.2 For every choice graph G and every tree decomposition TG, there is a tree decompo-

sition T′G with a collection of bags Xb : b∈ B such that:

• cd(G,T′G) = cd(G,TG)

• For every pair of adjacent bags Xa,Xb in T′G, Xa *Xb and Xb *Xa.

Proof: The proof follows the arguments from the proof of Lemma 2 in Bodlaender and Koster

(2011). Specifically, suppose TG is a tree decomposition of G such that for some pair of adjacent

bags Xa and Xb, we have that Xa ⊆ Xb. Then, we contract the edge between the bags Xa and

Xb: remove the bag Xa and connect all its neighbors to the bag Xb. Let TG denote the resulting

tree. It can be seen that TG is a tree decomposition of G and is defined over the collection of bagsXb′ : b′ ∈ B

, where B= B \ a.

We claim that this process does not change the choice depth of the tree i.e., cd(G,TG) = cd(G, TG).

To see this, first note that∣∣⋃

S∈Xa S∣∣≤ ∣∣∣⋃S∈Xb S

∣∣∣ and minA∈Xa |A| ≥minA∈Xb |A| because Xa ⊆Xb.

Therefore, ∣∣∣∣∣ ⋃S∈Xa

S

∣∣∣∣∣− minA∈Xa

|A| ≤

∣∣∣∣∣ ⋃S∈Xb

S

∣∣∣∣∣− minA∈Xb

|A| .

It now follows that

cd(G,TG) = maxb′∈B

∣∣∣∣ ⋃S∈Xb′

S

∣∣∣∣ − minA∈Xb′

|A|

= maxb′∈B\Xa

∣∣∣∣ ⋃S∈Xb′

S


|A|

= cd(G, TG),

where the last equality follows because TG is defined over the collection of bags Xb′ : b′ ∈ B\Xa.

This establishes the claim.

Repeating the above process iteratively, we obtain the tree decomposition T′G with the desired

properties.

We now provide the proof of Proposition 5.11.


Proof of Proposition 5.11: Let G = (Pairs,E) be an arbitrary choice graph associated with the

collection of all pairs of products Pairs = i, j : i 6= j. It is a standard result that the tree width of

any graph G is lower bounded by the minimum degree of its vertices (Lemma 4 in Bodlaender and

Koster 2011). Then, it follows from Lemma D.1 that tw(G)≥ 2(n− 2), which is the desired result.

We will now establish a lower bound on the choice depth. Consider an arbitrary tree decom-

position TG with a collection of bags Xb : b ∈ B. By Lemma D.2, we can assume without loss of

generality that under TG, every pair of adjacent bags do not contain each other. There are two

cases to consider: |B|= 1 and |B| ≥ 2.

If TG has exactly one bag, then the bag must contain every set in the collection Pairs, and by

definition

cd(G,TG) = maxb∈B

∣∣∣∣ ⋃S∈Xb

S

∣∣∣∣ − minA∈Xb

|A|

=

∣∣∣∣ ⋃S∈Pairs

S

∣∣∣∣− minA∈Pairs

|A|= n− 2,

where the last equality follows because⋃S∈Pairs S = N and every set in the collection Pairs has

exactly two elements.

Suppose TG has at least two bags. Following the argument in Lemma 4 in Bodlaender and

Koster (2011), take a bag Xa that is a leaf of TG, and let Xb be its (unique) neighbor. Take a

set i, j ∈ Xa \ Xb. By Lemma D.2, such a set i, j exists. It must be the case that i, j does

not belong to any other bag except Xa, because otherwise it would belong to Xb by the running

intersection property of the tree decomposition. Therefore, all neighbors of i, jmust belong to Xa.

By Lemma D.1, this implies that

cd(G,TG) = maxb′∈B

∣∣∣∣ ⋃S∈Xb′

S


|A|

≥∣∣∣∣ ⋃S∈Xa

S

∣∣∣∣ − minA∈Xa

|A|= n− 2 ,

where the equality follows because there is an edge between i, j and i, `, and an edge between

i, j and `, j, for all ` /∈ i, j, so⋃S∈Xa S =N . Therefore, cd(G)≥ n− 2.

D.7 Extension to Allow for Incorporation of Additional Structure on Rankings

In this section, we extend our theoretical framework of rational separation to the case when our

focus is on distributions over a constrained set of rankings, denoted by Pconn ⊆Pn. As mentioned

in the main text in Section 5.3, a constrained set of rankings may be used to capture product

price/display promotions by constraining each ranking to prefer the promoted copy of the product

over its non-promoted copy. In fact, they may be used to capture even multi-level price promotions,

as detailed in the following example.


Example D.3 (Modeling Promotions) Suppose each product i has L different levels of pro-

motions, indexed by 1, . . . ,L. These levels may correspond to different levels of price discounts in

the context of price promotions or different levels of display prominence in the context of display

promotions. We assume that the levels are indexed such that for each product, level-L promo-

tion is preferred to level L − 1, which is preferred to level L − 2, and so on. To capture this

setting, we enlarge the universe to n(L+ 1) items, where each item is now represented as a tuple

(i, `)∈N ×0,1, . . . ,L, where i denotes the product index and ` denote the promotion level, with

`= 0 denoting the non-promoted copy. In this case, the constrained set of rankings is defined over

n(L+ 1) items, with

Pconn(L+1) =

σ ∈Pn(L+1) : ∀ i∈N, σ(i,L)<σ(i,L− 1)< · · ·<σ(i,1)<σ(i,0)

,

where the constraint σ(i,L) < σ(i,L − 1) < · · · < σ(i,1) < σ(i,0) captures the requirement that

level-L promotion is the most preferred, followed by level L− 1, and the non-promoted version is

the least preferred. The above constraint only applies to each individual product, and we do not

impose any constraints in the preference orderings of different products.

All the definitions in Sections 2 and 3 now readily extend to the constrained set of rankings Pconn .

Specifically, for any set S and zS ∈ S, let D conS (zS) denote the set of constrained rankings in Pcon

n

for which zS is the most preferred product in S; that is,

D conS (zS) = σ ∈Pcon

n : σ(zS)<σ(i) ∀ i∈ S, i 6= zS .

Similarly, for any collection of subsets A, let zA = (zS ∈ S : S ∈A) and D conA (zA) =

⋂S∈AD con

S (zS).

Then, we can extend our notion of rational separation in a straightforward way. Given three subsets

A, B, and C, we say that A and B are rationally separable given C, written as AñB |C, if for all

zA ∈A, zB ∈B, and zC ∈ zC , whenever D conA (zA)∩D con

C (zC) 6= ∅ and D conB (zB)∩D con

C (zC) 6= ∅, we

also have that

D conA (zA)∩D con

B (zB)∩D conC (zC) 6=∅ .

The same extension applies to any collections of subsets A, B, and C, and correspondingly, a choice

graph is defined just as before.

In order to describe the results on computational complexity, we assume that for every zA ∈A

and zB ∈B, we can determine whether or not D conA (zA)∩D con

B (zB) 6=∅ by querying an oracle that

takes O(1) operations. Then, it can be verified that all the results and proofs of Sections 4 and 5

continue to hold for the case of constrained rankings by replacing DA(zA) with D conA (zA) for all sets

A and products zA ∈A.


Category Shorthand Expanded Name # trans. # of offer sets Avg. offer set size

1 beer Beer 380,932 69 5.28

2 blades Blades 92,404 75 4.96

3 carbbev Carbonated Beverages 721,506 31 6.614 cigets Cigarettes 249,668 112 6.14

5 coffee Coffee 372,536 55 6.96

6 coldcer Cold Cereal 577,236 23 7.357 deod Deodorant 271,286 63 6.48

8 diapers Diapers 143,055 19 3.32

9 factiss Facial Tissue 73,806 64 4.7710 fzdinent Frozen Dinners/Entrees 979,936 30 7.37

11 fzpizza Frozen Pizza 292,878 79 6.3012 hhclean Household Cleaners 282,981 19 7.89

13 hotdog Hotdogs 101,624 102 6.41

14 laundet Laundry Detergent 238,163 84 6.7915 margbutr Margarine/Butter 140,969 29 7.52

16 mayo Mayonnaise 97,282 81 5.62

17 milk Milk 240,691 57 6.2818 mustketc Mustard 134,800 48 7.12

19 paptowl Paper Towels 82,636 65 6.25

20 peanbutr Peanut Butter 108,770 82 6.0121 saltsnck Salt Snacks 736,148 39 7.08

22 shamp Shampoo 290,429 73 6.40

23 soup Soup 905,541 24 7.8324 spagsauc Spaghetti/Italian Sauce 276,860 52 6.81

25 sugarsub Sugar Substitutes 53,834 104 5.9226 toitisu Toilet Tissue 112,788 37 6.32

27 toothbr Toothbrushes 197,676 135 6.16

28 toothpa Toothpaste 238,271 68 6.4029 yogurt Yogurt 499,203 66 5.92

Table EC.1 Summary statistics of the data used in the case study. For each category, we reported the total

number of purchase transactions, the number of offer sets, and the average of the offer set sizes (weighted by the

corresponding number of sales), observed in the first two weeks of the year 2007 across all the stores in the dataset.

Appendix E: Additional Implementation Details and Discussion of Numerical Results

In this section, we present the omitted details for case study in Section 6.

E.1 Summary statistics of the data

Table EC.1 reports the number of transactions, number of offer sets, and the average offer set size,

for each product category.

E.2 Details for computing the total losses for the MNL and LCMNL models

For the MNL model, we obtained the total loss for a category by solving the following optimization

problem:

Total loss

under MNL= min

β,γ− 1

|T |∑

(S,P )∈MMS,P

∑j∈S

fj,S,P log

(1

fj,S,P· eβj+γ·1l[j∈P ]∑

i∈S eβi+γ·1l[j∈P ]

),

where we used the following utility specification: for each product j ∈ 1, . . . , n, Utilityj = βj +

γ · 1l[j is on promotion] + εj , where (β1, . . . , βn) are the alternative specific utility parameters, γ


is the additional utility from the product being on promotion, and εj are i.i.d. standard Gumbel

random variables. The above optimization problem can be shown to be a convex program (Train

2009), and, hence, we solved it by using off-the-shelf convex optimization techniques.

In a similar manner, we computed the total loss for the LCMNL model by solving the following

optimization problem for each category:

Total loss under

LCMNL= min

α,β,γ− 1

|T |∑

(S,P )∈MMS,P

∑j∈S

fj,S,P log

(1

fj,S,P·

15∑k=1

αkeβ

kj +γk·1l[j∈P ]∑

i∈S eβki +γk·1l[j∈P ]

)

s.t.15∑k=1

αk = 1, αk ≥ 0 ∀ k

where we assumed that the population is described by a mixture of MNL models consisting of 15

classes, with αk denoting the fraction of customers in class k; βk = (βk1 , . . . , βkn), the parameters

of the corresponding MNL model of customers in class k; and γk, the additional utility to class

k customers from the production promotion. The above optimization problem is non-convex in

the model parameters α, β, and γ. However, we obtained an approximate solution by using the

popular expectation-maximization (EM) technique; see Train (2009) for details.

The parametric losses under the MNL and LCMNL models were then obtained by subtracting

the rationality loss from their respective total losses.

E.3 Breakdown of parametric losses

Figure EC.1 presents the breakdown of the parametric losses by the number K of latent classes.

The length of the first (leftmost) bar represents the parametric loss under the 15-class LCMNL

model, which is the most complex parametric model fitted in our experiment. The subsequent

bars show incremental losses incurred from fitting progressively “simpler” parametric models. So,

for instance, the length of the second bar represents the difference in the parametric losses under

the 10-class and 15-class LCMNL models; consequently, the sum of the lengths of the first two

bars represents the parametric loss under the 10-class LCMNL model. The product categories are

arranged in the descending order of their respective parametric losses under the MNL model.

We make the following observations from Figure EC.1. First, it is immediately apparent that

conditioned on restricting oneself to the RUM class, the loss incurred from assuming that the

utilities have independent Gumbel distributions (the “MNL assumption”) varies widely across

product categories, with the highest loss incurred for milk and the lowest loss incurred for sugar

substitute. In other words, the “value” of making more complex distributional assumptions varies

by categories. Second, we note that the rightmost bar is generally longer than the second and third


bars. This observation implies that as we add more complexity in the form of additional latent

classes, we observe the greatest reductions in parametric losses when we go from K = 1 (the MNL

model) to K = 5, with the reductions tapering off for K = 10 and K = 15. This finding is intuitive,

indicating that the marginal benefit from additional latent classes goes down. Finally, for yogurt

and cigarette categories, the parametric losses under the 15-class LCMNL model (represented by

the length of the first bar) are significantly higher than those for the other categories, indicating

a greater degree of heterogeneity, as measured by the number of latent classes, in the purchase

behavior for these categories.

0 2 4 6 8 10 12 14 16

SugarSubs/tuteToothpasteDeodorant

DiaperBlade

ShampooHouseholdCleaner

ToiletTissueFac/alTissue

FrozenDinner/EntréeColdCereal

BeerSoup

ToothbrushMargarine/BuFer

CarbonatedBeverageMayonnaiseSaltSnack

SpagheJ/ItalianSaucePapwerTowel

MustardPeanutBuFer

LaundryDetergentCigareFe

FrozenPizzaHotDogCoffeeYogurtMilk

KL-divergenceLoss(10-3)

LC15Param.Loss LC10Param.LossminusLC15Param.Loss LC5Param.LossminusLC10Param.Loss MNLParam.LossminusLC5Param.Loss

Figure EC.1 Breakdown of parametric losses: For each product category, the first (leftmost) bar shows the

parametric loss under the 15-class LCMNL model (dotted blue bars), followed by the difference

between the parametric losses under the 10-class and 15-class LCMNL models (solid red bars). The

difference represents the incremental loss from using a 10-class model instead of a 15-class model.

The figure also shows the difference between the losses under the 5-class and 10-class LCMNL

models (patterned green bars), and the rightmost bar shows the difference between the losses under

the 5-class LCMNL and simple MNL models (checkered orange bars).


E.4 Going beyond the RUM class

As noted in Section 6.3.3, our approach allows us to identify product categories for which it is

necessary to go beyond the RUM class to obtain acceptable performance. To obtain a model class

outside the RUM class, we extended the generalized attraction model (GAM) class. This model

class was proposed by Gallego et al. (2014) as a generalization of the basic attraction model

(BAM), of which the MNL model is a special case, to address the issue of overestimation of demand

recapture in revenue management (RM) applications. With each product j, it associates a “shadow”

attraction wj ≥ 0, in addition to a basic attraction vj ≥ 0, such that the probability P(j|S) that j

will be chosen from offer set S is given by

P(j|S) =vj

V0(S) +∑

i∈S vi, where V0(S) = v0 +

∑i/∈S

wi,

v0 is the basic attraction of the no-purchase option, indexed by 0, and S is implicitly assumed to

include the no-purchase option. The key difference from a BAM model is the presence of shadow

weights for products that continue to influence the choice behavior of a customer through V0(S)

even when not offered. Setting wj = 0 for all j results in the BAM model and wj = vj for all j results

in the independent demand model (IDM), in which the demand for a product is independent of

the offered set.

We note that the GAM model is outside the RUM class whenever wj > vj for some product j.

To see this, note that for two products i, j ∈ S,

viP(i|S)

− viP(i|S \ j)

= V0(S) + vj −V0(S \ j) = vj −wj < 0 =⇒ P(i|S)> P(i|S \ j), (8)

which violates rationality because it stipulates that choice probabilities can only decrease when a

product is dropped from the offering. We make use of this property below.

For our purposes, we extend the GAM model into a latent-class GAM (LC-GAM) model as

follows. We assume that the population is composed of K market segments with the choice behavior

of each segment described by a different GAM model. With segment k and product j, we associate

the following set of parameters: basic and shadow attractions vjk = eβjk and wjk = eγjk when j is

not promoted and v′jk = eβ′jk and w′jk = eγ

′jk when j is promoted, respectively. Because our dataset

does not contain a no-purchase option, we designate one of the n products as the ‘default’ product

dk for each segment k. We let αk ≥ 0, such that∑K

k=1αk = 1, denote the weight of segment k.

Then, the probability that product j is chosen when the offering is (S,P ) is given by

P(j|S,P ) =K∑k=1

αk ·eβjk+(β′jk−βjk)·1l[j∈P ]

Vk(S,P ) +∑

i∈S\P eβik +

∑i∈P\S e

β′ik

, (9)


where

Vk(S,P ) =

1 +∑

i/∈S\P eγik +

∑i/∈P e

γ′ik , if dk ∈ S

0, o.w.

denotes the weight of the default product, which includes the shadow weights of the products

not offered.

Under the above LC-GAM model, we computed the total loss by solving the following optimiza-

tion problem:

Total loss under

LC-GAM

= minα,β,β′,γ,γ′

− 1

|T |∑

(S,P )∈MMS,P

∑j∈S

fj,S,P log

(1

fj,S,P·P(j|S,P,β,β′,γ,γ ′)

)(10)

s.t.K∑k=1

αk = 1, αk ≥ 0 ∀ k

with K = 5, where we abuse notation to let P(j|S,P,β,β′,γ,γ ′) denote the choice probability

expression in (9) with the dependence on the model parameters made explicit. As for the LC-MNL

model, the above optimization problem is non-convex in the parameters, therefore, we used the

EM algorithm to solve it.

For each category, we verified that the LC-GAM model obtained from solving (10) is indeed

outside the RUM class by ensuring that the fitted model parameters satisfy at least one of the

following three conditions for some mixture component k and some product j: (a) wjk > vjk, (b)

w′jk > v′jk, or (c) vjk > v′jk. Each of the first two conditions implies that the resulting GAM for

segment k is outside the RUM class because of the observation in (8) above. The last condition

also implies that the resulting GAM is outside the RUM class because rationality stipulates that

promotion never decreases the attractiveness of a product.

Figure 4 (in Section 6.3.3 on p. 32) overlays the loss from fitting 5-class LC-GAM model onto

Figure 3 (in Section 6.3.2 on p. 31), plotting the rationality loss against the total loss under the

single-class MNL model. The plot shows how fitting the LC-GAM model allows us to breach the

LoR for most of the 29 product categories. In fact, we observe that 5-class LC-GAM is able to

obtain acceptable performance (loss < 0.03) for the Margarine/Butter category, which belongs to

‘Region 3’ and hence it is necessary to go beyond the RUM class to obtain acceptable performance.

Behaviors that are inconsistent with RUM models: Our analysis above shows that relax-

ing the RUM assumption can yield better fitting models. This finding is consistent with existing

literature that has identified consumer choice behaviors that violate the RUM assumption:


• Variety-seeking: van Trijp (1995) defined variety seeking as “the biased behavioral response by

some decision making unit to a specific item relative to previous responses within the same

behavioral category, due to the utility inherent in variation per se, independent of the instrumen-

tal or functional value of the alternatives or items”. This phenomenon is well documented in the

marketing literature; see, for example, Hoyer and Ridgway (1984), Kahn and Lehmann (1991),

Holbrook and Hirschman (1982), van Trijp et al. (1996). The frequent switching of brands from

one week to the next due to variety seeking often results in non-transitive preferences for each

customer and a violation of the RUM assumption in the aggregate.

• Product confusion due to choice overload: In a landmark study, Iyengar and Lepper (2000) showed

that people are more likely to purchase gourmet jams or chocolates when offered a limited array

of 6 choices rather than a more extensive array of 24 or 30 choices. This means that as we have

more choices, the probability of no-purchase can increase due to product confusion; see Chernev

et al. (2015) for the most recent comprehensive meta-analysis and references. This increase in

the choice probability of an existing item when new items are added is inconsistent with the

RUM principle.

• Synergistic products: Davis et al. (2011) discussed synergistic product categories in the context

of nested logit models, where adding a product to a nest increases the probability of purchasing

an existing product in the nest. A real-world example is the inclusion of “loss leaders” to drive

traffic to a particular category. Another example is that of “halo” products (such as iPhones)

that increase the sales of other offered products (such as Mac computers). Synergistic products

can be captured using a nested logit model with the nest dissimilarity parameters taking values

larger than one. Of course, when the nest dissimilarity parameters are larger than one, then the

model instances cease to be a part of the RUM family; see Davis et al. (2011) for details.

The LC-GAM models can capture the choice behaviors specified in the second and third bullets.

Cost of going beyond the RUM family: Despite the benefits described above, given the

current state of the literature, leaving the “well traversed” RUM class requires caution due to the

following potential downside risks:

1. Lack of a de facto non-RUM model class: The existing literature in marketing, econometrics,

and operations management does not offer a “go-to” model outside of the RUM model class.

In fact, an arbitrary non-RUM model (such as the 5-class LCGAM) can provide a worse fit

than the RUM class. Therefore, careful consideration is required to assess whether a candidate

non-RUM model will indeed provide a better fit than the RUM class before adopting it.

2. Lack of efficient computation schemes to fit the models: Techniques for estimating the param-

eters of non-RUM choice models have received little attention from researchers so far. Conse-

quently, data processing techniques and model specifications are not well understood yet.


3. Lack of efficient computation schemes for decision-making: General non-RUM models also lack

efficient techniques for solving the assortment, pricing, and inventory optimization problems.

Our work has shown the boundaries of the RUM class, and the potential benefits of going beyond

the RUM models. Thus, it lays the groundwork for future work to address the above challenges.

E.5 Factors that might influence rationality loss

For each product category, we define its market share concentration as follows:

log |N | −∑j∈N

mj logmj, where mj =

∑(S,P )∈M fj,S,P ·MS,P∑

(S,P )∈MMS,P

,

where mj denotes the fraction of the market captured by product j and the market concentration is

the KL-divergence between the uniform distribution and the market share distribution. In addition,

for each category, we define a “hedonic” indicator variable that is equal to 1 if the purchases from

that category are driven by its hedonic (or sensory) attributes (Baltas et al. 2017). We manually

classified each product category to be hedonic or not, by applying the following rule: a category is

hedonic if it is associated with food/snack (excl. condiments), drinks, or cigarettes; otherwise, it is

not. Of the 29 categories, our rule classified the following 14 product categories as hedonic: beer,

carbonated beverages, cigarettes, coffee, cold cereal, frozen dinner, frozen pizza, hot dog, milk,

peanut butter, salt snacks, soup, spaghetti sauce, and yogurt. The remaining 15 categories were

classified as non-hedonic. The classification we obtained is supported by the existing literature for

the following categories: carbonated beverages, paper towel, peanut butter (Crowley et al. 1992),

toothpaste, and yogurt (Givon 1984, Kahn et al. 1986, Baltas et al. 2017).

We regressed the rationality loss (loss) against the market concentration (mconc) and the

hedonic indicator (hedi) variables. Table EC.2 reports the results from fitting three differ-

ent models: lossj = β01 +β11 ·mconcj + εj (Model 1), lossj = β02 +β22 ·hedij + εj (Model 2), and

lossj = β03 +β13 ·mconcj +β23 ·hedi + εj (Model 3). We observe that all the estimated coefficients

are statistically significant.

We note that rationality loss is negatively correlated with the market concentration variable

(Model 1). As noted in Section 6.3.4, categories with lower market concentration, such as yogurt,

have frequent product introductions, and customers often exhibit variety seeking behavior, resulting

in higher rationality loss because of the non-transitivity of preferences caused by frequent brand

switching. Similarly, for categories with higher market concentration, such as diapers, customers

tend to be brand loyal, generally resulting in transitive preferences and lower rationality loss.

Model 2 shows that rationality loss is positively correlated with the hedonic indicator variable.

Existing work in marketing has shown that consumers exhibit complex choice behaviors when


purchasing hedonic products. For example, consumer self-control issues may lead to a purchase of

smaller quantities of hedonic products at a higher per unit price (e.g., chocolate); loss aversion and

anticipation utility of hedonic products are higher; consumption of hedonic products can evoke

a sense of guilt, and hedonic products are more likely to exhibit preference reversal, which is a

phenomenon that violates the standard utility theory; see, for example, Wertenbroch (1998), Dhar

and Wertenbroch (2000), O’curry and Strahilevitz (2001), Kivetz and Simonson (2002a,b), Okada

(2005) and the references therein. These complex choice behaviors are not sufficiently well-captured

by transitive preference orderings, leading to high rationality losses.

Finally, Model 3 shows that using the two variables together explains the variations in the

rationality loss better, with a substantially higher R2 value. The coefficients also remain statis-

tically significant, indicating that the rationality loss of a category is driven by both the market

concentration and hedonic attributes of the product category.

Our findings indicate that going beyond the RUM class may be particularly fruitful for product

categories with low market concentrations (or high variety seeking) and hedonic attributes.

Table EC.2 Regression Results

Dependent variable:

Rationality loss (KL-divergence (10−2))

Model 1 Model 2 Model 3

Market concentration −2.439∗∗ −2.011∗∗

(0.999) (0.914)

Hedonic indicator 1.455∗∗∗ 1.274∗∗

(0.498) (0.473)

Constant 3.959∗∗∗ 2.046∗∗∗ 3.131∗∗∗

(0.559) (0.346) (0.590)

Observations 29 29 29

R2 0.181 0.240 0.360

Adjusted R2 0.151 0.212 0.310

Residual Std. Error 1.391 (df = 27) 1.340 (df = 27) 1.254 (df = 26)

F Statistic 5.964∗∗ (df = 1; 27) 8.539∗∗∗ (df = 1; 27) 7.297∗∗∗ (df = 2; 26)

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

The values in the parentheses under the coefficients are the

standard errors of the parameter estimates.


E.6 Robustness of our results under different loss metrics

beer

blades

carbbevcigets coffee

coldcer

deod

diapers

factiss

fzdinent

fzpizza

hhclean

hotdog

laundet

margbutr

mayo

milk

mustketc

paptowl

peanbutrsaltsnck

shamp

soup

spagsauc

sugarsub

toitisu

toothbr

toothpa

yogurt

Region 1:

MNL acceptable

Region 2:


Region 3:

RUM unacceptable

0

20

40

60

80

0 20 40 60 80Total loss (% MAPE)

Rati

on

alit

ylo

ss(%

MA

PE

)

Figure EC.2 Robustness across different loss metrics: The figure plots the rationality loss against the

total losses under the MNL model, both reported in terms of the mean absolute percentage error

(MAPE). The plot region is partitioned assuming a MAPE of < 45% is acceptable. For categories

in Region 1, the MNL model provides acceptable performance. Categories in Region 2 can benefit

from fitting more complex RUM models. For categories in Region 3, one must go beyond RUM

class to obtain acceptable performance.

Number of Classification under MAPE

Product Categories Region 1 Region 2 Region 3 Total

Classification Region 1 12 1 1 14

under Region 2 3 1 1 5

KL-divergence Region 3 3 1 6 10

Total 18 3 8

Table EC.3 A contingency table showing the classification of the 29 product categories under the KL-divergence

and MAPE loss metric. The table is based on the classification of each product category into three

different regions in Figure 3 and Figure EC.2.

To test the robustness of our result, we use another loss metric based on the mean absolute

percentage error (MAPE). Figure EC.2 shows a scatter plot of the rationality loss against the

total losses under the MNL model under the MAPE metric. We use 45% as the cutoff threshold,

and as in Figure 3, it partitions the plot into three regions. It turns out that the classification


of products into the three regions is quite similar under both the KL-divergence loss metric and

MAPE. Table EC.3 shows the number of product categories that fall into each region under the two

metric. As see from Table EC.3, there are 19 = 12 + 1 + 6 product categories whose classification

under both KL-divergence and MAPE loss metrics are exactly the same, representing 66% = 19/29

of the categories that we considered in our dataset. This shows that our method is fairly robust

under different loss metric.

E.7 Robustness of our results across time

This section shows that the classification of product categories presented in Section 6.3.2 is robust

across time. In Section 6.3.2, we classify the product categories into three different regions for

the purpose of model selection. For a given threshold of acceptable performance loss, Region 1

comprises categories for which the MNL model is acceptable, Region 2 comprises the categories

for which fitting more complex RUM model is required to obtain acceptable performance, and

Region 3 comprises categories for which the RUM model is unacceptable. Figure 3 illustrates this

classification. The rationality and total losses reported in the figure were computed using sales

transactions data from weeks 1 and 2 of the calendar year 2007.

In order to test the robustness of this classification over time, we ran the same analysis using

data from weeks 27 and 28 (about six months later) of the same year 2007. Figure EC.3 shows the

corresponding classification of product categories. It turns out that the classification of products

into the three regions is quite similar for both time periods. Table EC.4 shows the number of

product categories that fall into each region for the two time periods. As seen from Table EC.4,

there are 23 = 13 + 2 + 8 product categories that are classified into the same regions for both

time periods, representing 79% = 23/29 of the categories considered in our dataset. This shows

that our classification remains fairly robust over time.

Number of Classification during weeks 1-2

Product Categories Region 1 Region 2 Region 3 Total

Classification Region 1 13 1 0 14

during Region 2 1 2 2 5

weeks 27-28 Region 3 1 1 8 10

Total 15 4 10

Table EC.4 A contingency table showing the classification of the 29 product categories during the time periods

weeks 1-2 and weeks 27-28. The table is based on the classification of the product categories into

three different regions in Figures 3 and EC.3.


beer

blades

carbbev

cigets

coffee

coldcer

deod

diapers

factiss

fzdinent

fzpizza

hhclean

hotdog

laundet

margbutrmayo

milk

mustketc

paptowl

peanbutr

saltsnckshamp

soup

spagsauc

sugarsub

toitisu

toothbr

toothpa

yogurt

Region 1:

MNL acceptable

Region 2:


Region 3:

RUM unacceptable

0

2

4

6

8

0 2 4 6 8

Total loss (KL-divergence (10−2))

Rat

ion

alit

ylo

ss(K

L-d

iver

gen

ce(1

0−2))

Figure EC.3 Robustness across time: Rationality and total losses computed using transactions data from

weeks 27-28 of the calendar year 2007. The classification of the product categories into the three

regions is similar to one computed using transactions data from weeks 1-2 of the same year, as

shown in Figure 3.

References

Baltas, G., F. Kokkinaki, A. Loukopoulou. 2017. Does variety seeking vary between hedonic and utilitarian

products? the role of attribute type. Journal of Consumer Behaviour .

Bodlaender, H. L., A. M. C. A. Koster. 2011. Treewidth computations II. Lower bounds. Information and

Computation 209(7) 1103–1119.

Bodlaender, H. L., T. Wolle, A. M. C. A. Koster. 2006. Contraction and treewidth lower bounds. Journal

of Graph Algorithms and Applications 10(1) 5–49.

Chernev, A., U. Bockenholt, J. Goodman. 2015. Choice overload: A conceptual review and meta-analysis.

Journal of Consumer Psychology 25(2) 333–358.

Crowley, A. E., E. R. Spangenberg, K. R. Hughes. 1992. Measuring the hedonic and utilitarian dimensions

of attitudes toward product categories. Marketing Letters 3(3) 239–249.

Davis, J., G. Gallego, H. Topaloglu. 2011. Assortment optimization under variants of the nested logit model.

Working Paper, Cornell University.

Dhar, R., K. Wertenbroch. 2000. Consumer choice between hedonic and utilitarian goods. Journal of


Gallego, G., R. Ratliff, S. Shebalov. 2014. A general attraction model and sales-based linear program for

network revenue management under customer choice. Operations Research 63(1) 212–232.

Givon, M. 1984. Variety seeking through brand switching. Marketing Science 3(1) 1–22.


Holbrook, M. B., E. C. Hirschman. 1982. The experiential aspects of consumption: Consumer fantasies,

feelings, and fun. Journal of consumer research 9(2) 132–140.

Hoyer, W. D., N. M. Ridgway. 1984. Variety seeking as an explanation for exploratory purchase behavior:

A theoretical model. Thomas C. Kinnear, ed., Advances in Consumer Research, vol. 11. Association

for Consumer Research, Provo, UT, 114–119.

Iyengar, S. S., M. R. Lepper. 2000. When choice is demotivating: Can one desire too much of a good thing?

Journal of personality and social psychology 79(6) 995.

Kahn, B. E., M. U. Kalwani, D. G. Morrison. 1986. Measuring variety-seeking and reinforcement behaviors

using panel data. Journal of Marketing Research 23(2) 89–100.

Kahn, B. E., D. R. Lehmann. 1991. Modeling choice among assortments. Journal of retailing 67(3) 274–299.

Kivetz, R., I. Simonson. 2002a. Earning the right to indulge: Effort as a determinant of customer preferences

toward frequency program rewards. Journal of Marketing Research 39(2) 155–170.

Kivetz, R., I. Simonson. 2002b. Self-control for the righteous: Toward a theory of precommitment to indul-

gence. Journal of Consumer Research 29(2) 199–217.

O’curry, S., M. Strahilevitz. 2001. Probability and mode of acquisition effects on choices between hedonic

and utilitarian options. Marketing Letters 12(1) 37–49.

Okada, E. M. 2005. Justification effects on consumer choice of hedonic and utilitarian goods. Journal of


Robertson, N., P. D. Seymour. 1986. Graph minors. II. Algorithmic aspects of tree-width. Journal of

Algorithms 7(3) 309–322.

Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM journal on computing 1(2)

146–160.

Train, K. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge University Press.

van Trijp, H. C. M. 1995. Variety-seeking in product choice behavior. theory with applications in the food

domain. Ph.D. thesis, Wageningen University.

van Trijp, H. C. M., W. D. Hoyer, J. J. Inman. 1996. Why switch? product category-level explanations for

true variety-seeking behavior. Journal of Marketing Research 33(3) 281–292.

Wertenbroch, K. 1998. Consumption self-control by rationing purchase quantities of virtue and vice. Mar-

keting science 17(4) 317–337.

the limit of rationality in choice modeling: formulation...

Documents