an empirical analysis of the modeling benefits of complete

1

An Empirical Analysis of the Modeling Benefits of Complete Information 1

for eCRM Problems 2

3

Abstract 4 Due to the vast amount of user data tracked online, the use of data-based analytical methods is 5

becoming increasingly common for e-businesses. Recently, the term analytical eCRM has been 6

used to refer to the use of such methods in the online world. A characteristic of most of the current 7

approaches in eCRM is that they use data collected about users’ activities at a single site only. As 8

we argue in this paper, this can present an incomplete picture of user activities since it does not 9

capture activities at other sites. A more complete picture can be obtained by getting across-site 10

data on users. Such data is expensive, but can be obtained by firms directly from their users or from 11

market data vendors. A critical question for e-businesses is whether such data is worth obtaining, 12

an issue that little prior research has addressed. In this paper, using a data mining approach, we 13

present an empirical analysis of the modeling benefits that can be obtained by having complete 14

information. Our results suggest that the magnitudes of gains that can be obtained from complete 15

data range from 10% to 50% depending on the problem it is used for and the performance metric 16

considered. Qualitatively, we find that variables related to customer loyalty and browsing intensity 17

are particularly important and these variables are difficult to derive from data collected at a single 18

site. More important, we find that a firm has to collect a reasonably large amount of complete data 19

before any benefits can be reaped and caution against acquiring too little data. 20

21

Keywords: Data mining, incomplete data, online personalization, information 22

sharing, knowledge management. 23

ISRL Categories: HA, UF, CB13, AI0624

2

An Empirical Analysis of the Modeling Benefits of Complete Information 1

for eCRM Problems 2

3 4

1. INTRODUCTION 5

Effective Customer Relationship Management (CRM) is an important and enduring issue for any 6

business (Verhoef 2003, Pan and Lee 2003, Varian 2001). It has been shown that customer data 7

can play a critical role in designing effective CRM strategies (Padmanabhan and Tuzhilin 2003). 8

This is particularly the case in the online world since web servers automatically collect vast 9

amounts of behavioral data on customers. 10

In the recent past the term analytical eCRM (Swift 2002) has been used to refer to the use of 11

data-based analytical methods for customer analysis, customer interactions and profitability 12

management. A characteristic of most of the existing approaches to eCRM is that these methods 13

build profiles and models based on data collected by a single web site from users’ interactions with 14

the site. In this paper we refer to such data as site-centric data, which we define to be clickstream 15

data collected at a site augmented with user demographics and cookies (Sen et al. 1998). In this 16

sense, traditional approaches are myopic – they are based on firms building models from data 17

collected at their site only. However, the myopic nature of most current eCRM methods is not due 18

to the fact that site-centric data has been proved adequate for understanding customer behavior; 19

rather, it is due to the nature of data ownership constraints – most sites only have access to their 20

own log files. 21

For example, consider two users who browse the web for air tickets. Assume that the first 22

user’s session is as follows Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2, Expedia1, 23

Expedia2, Travelocity3, Travelocity4, Expedia3, Cheaptickets3 where Xi represents some page i, at 24

website X and in this session assume that the user purchases a ticket at Cheaptickets. Assume 25

that the second user’s session is Expedia1, Expedia2, Expedia3, Expedia4 and that this user 26

3

purchases a ticket at Expedia (in the booking page Expedia4, in particular). Expedia’s (site-centric) 1

data would include the following: 2

User1: Expedia1, Expedia2, Expedia3 3 User2: Expedia1, Expedia2, Expedia3, Expedia4 4

We define user-centric data to be site-centric data plus data on where else the user went in the 5

current session. In this sense, user-centric data is the “complete” version of site-centric data. In the 6

above example, the user centric data for Expedia is: 7

User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2, Expedia1, Expedia2, 8 Travelocity3, Travelocity4, Expedia3, Cheaptickets3 9

User2: Expedia1, Expedia2, Expedia3, Expedia4 10

Reconsider the site-centric data for Expedia. In one case (user 2) the first three pages result in the 11

user booking a ticket at the next page. In the other case (user 1), the first three pages result in no 12

booking. Expedia sees the “same” initial browsing behavior, but with opposite results – one which 13

resulted in a booking and one which did not. However sophisticated the analytical methods used, it 14

is difficult to differentiate these two sessions based on site-centric data alone. Nevertheless, most 15

conventional techniques implicitly try to do so, coerced by incomplete information. 16

The example above suggests that user-centric data may be more useful than site-centric 17

data for problems in analytical eCRM. This observation is indeed actionable since technologies 18

exist to collect user-centric data. Firms can obtain user-centric data in one of two ways: 19

1. Firms can get customers to download client-side tracking software that captures all the 20

user’s online activities and periodically uploads this data to a server. (Firms have to provide 21

appropriate incentives.) 22

2. Market data vendors, such as Netratings and comScore Networks, collect user-centric data 23

by signing up panels and using client-side tracking software to capture all the browsing 24

activities of their panelists. Firms can purchase user-centric data directly from such 25

vendors. 26

4

Obtaining user-centric data is certainly possible. A firm’s decision to do so should be based on 1

whether the benefits exceed the costs of obtaining the data. In this paper, we focus on quantifying 2

the gains that may be obtained from user-centric data and on quantifying how much user-centric 3

data is necessary to obtain results that out-perform site-centric data. By doing so, this paper 4

provides answers that will help firms in understanding the benefits that may be obtained from user-5

centric data. Actual decisions, of course, will also depend on the particulars of the individual 6

businesses. 7

Specifically, we consider three important and common problems in eCRM and for each 8

problem we evaluate how user-centric models compare to site-centric models. For each of these 9

three problems we answer the following questions: 10

1. What is the magnitude of the gain that can be expected from acquiring complete user-11

centric data and what factors drive the gains that are obtained? This is an important 12

question since the answers to this can help in understanding the benefits of user-centric 13

data. However from a practical perspective, although there may be gains to be had from 14

user-centric data, cost considerations alone may prevent firms from being able to acquire 15

user-centric data from all their customers. This motivates the second question that we 16

study. 17

2. If it is only possible to acquire user-centric data for a sample of customers, how much data 18

should be acquired in order to do better than with site-centric data alone? We use the term 19

critical mass to refer to this number. In addition to the practical reason for studying this 20

question, this question is interesting from a research perspective since it addresses a 21

tradeoff between having more (less informative) site-centric data or less (more informative) 22

user-centric data. 23

The main results of this paper are that the magnitudes of the gains are significant, but the actual 24

values depend on the performance metric. User-centric data can increase the predictive accuracy 25

of the desired target variable by 10% to 20% for the three problems considered. The gains are 26

5

more striking for computing lift gains, where the magnitude of the gains is between 30% and 50%. 1

Analyses of the models built reveal several important predictors for the three problems considered. 2

In particular, the key predictors from the user-centric data capture effects that relate to customer 3

loyalty and browsing intensity – both of which are hard, if not impossible, to determine accurately 4

from site-centric data. Further, the critical mass results reveal that the critical mass numbers are 5

fairly high – ranging from 25 to 44% depending on the problem and metric considered. This is an 6

important result since it suggests that if firms acquire partial user-centric data, then acquiring too 7

little can be counter-productive in that user-centric models built from such data can actually be 8

worse than site-centric models. 9

The rest of this paper is organized as follows. In the next section we explain our choice of 10

the eCRM problems considered, the modeling technique used and the performance metrics of the 11

models (for which the gains are studied). Based on this discussion we state the goals of the paper 12

more specifically in terms of determining two 3x3 tables. Section 3 presents the methodology used 13

and particularly elaborates data preparation algorithms. The results are presented in Section 4 14

followed by a discussion in Section 5 and conclusions in Section 6. 15

16

2. SPECIFIC OBJECTIVES 17

In order to compare models built on site-centric data to models built on user-centric data, three 18

dimensions have to be clearly specified: 19

1. The choice of the problems for which the models are built. 20

2. The choice of the modeling technique used. 21

3. How the comparison of site-centric and user-centric models is done. 22

In this section we develop reasonable choices for these three dimensions, and based on this 23

discussion we identify the specific values that need to be estimated. 24

25

6

2.1 Choice of the Problems 1

In order to compare models built on site-centric and user-centric data, we consider three important 2

and common problems in eCRM. For each of the problems we evaluate how user-centric models 3

compare to site-centric models. The three problems are: 4

1. Predicting repeat visits of users. This is important in the context of understanding loyalty and 5

switching propensities of online users. In particular, we build models that at the end of each user 6

session, predict if this user is likely to visit a site again at any time in the future. In prior work, Lee 7

(2000) modeled repeat visit behavior of online users using an NBD (negative binomial distribution) 8

model. Chen and Hitt (2002) studied consumers’ switching behavior between online brokers based 9

on repeat visits using a logit model. 10

2 & 3. Predicting likelihood of purchase at a site. Within this category we identify two sub-problems. 11

The first is a real-time prediction task - within a current user’s session, at every successive click, 12

predicting if this user is likely to purchase within the remainder of the session. We term this the 13

within-session prediction problem. The second is a prediction problem with a longer time horizon – 14

at the end of each user session, predicting if this user is likely to purchase during any subsequent 15

session in the future – the future-session prediction problem. In prior work, VanderMeer et al. 16

(2000) developed a personalization system to predict a user’s next access in the session, and 17

issue signals when the user is going to make a purchase. Theusinger and Huber (2000) modeled 18

online users’ within-session purchase of software’s. Moe and Fader (2002 A) studied users’ future-19

session purchase behavior for Amazon.com. Fader and Hardie (2001) investigated customers’ 20

future-session purchase behavior for CDNow and discovered that customers’ purchase propensity 21

diminishes over time. 22

While all of these problems have been studied before in prior work as mentioned above, 23

there has been no work done in comparing the effects of doing so under incomplete and 24

(comparatively) complete information as we do in this paper. 25

7

2.2 Choice of the Models 1

Prior work for these problems built a wide range of different models ranging from linear models 2

(Mena 1999), logit models (Chen and Hitt 2002), other probabilistic models (Lee 2000, Moe and 3

Fader 2002 A, B) and non-linear models (e.g. classification trees and neural networks) 4

(Theusinger and Huber 2000, VanderMeer et al. 2000). In our experiments we build classification 5

tree models for the following reasons. (1) All the three tasks considered are binary and 6

classification trees have been shown to be highly accurate for such tasks (Padmanabhan et al. 7

2001). (2) Classification trees are non-parametric and distribution-free in the sense that they make 8

no distributional assumptions on the data, and thus trees can accommodate a wide range of data 9

types (3) In our experiments we build site-centric and user-centric models for 95 sites (see Section 10

2.4) and three problems. For a single model choice, this amounts to 2*95*3=570 models. A typical 11

single neural network that we built took approximately 2 hours on a Linux workstation with 1G main 12

memory and a 1GHz processor. Classification trees can be built quickly and therefore enable 13

comparison on a larger scale. (4) Finally, classification trees are good at explaining the results – 14

the trees can be visualized easily and variable importances can be derived based on the tree. 15

2.3 Model Comparison Methods 16

In our experiments when models on site-centric and user-centric data are compared, we focus on 17

out-of-sample performance of these models. In particular, for each of the 570 datasets considered 18

(see Section 2.2), the data is split so that a random 40% is used as in-sample to build the model 19

and the remaining 60% is used as out-of-sample data on which the model is evaluated. In the out-20

of-sample data we compute three metrics for each model: 21

1. Overall predictive accuracy, 22

2. Target variable predictive accuracy, and 23

3. A measure of how much “lift” the model provides. 24

The most common method for comparing approaches is overall predictive accuracy. However, 25

when the dependent variable has unequal priors this method has limitations (Fawcett and Provost 26

8

1996, Johnson and Wichern 1998). For example, assume 5% of the sessions result in bookings. A 1

trivial rule saying that no sessions result in a booking will have an accuracy of 95%. However, this 2

is not a useful model. Part of the problem is that in such cases, it is more expensive to label a 3

booking session as a non-booking one - failure to recognize a potential purchase could result in 4

losing the customer. One method addressing this limitation is to specify prior misclassification costs 5

and build models to minimize total costs (Johnson and Wichern 98). The problem with this method 6

is that it is sensitive to the actual misclassification costs used and in many applications (a) it may 7

not be easy to determine these costs and therefore (b) the choice of these costs may be ad-hoc. 8

In light of this, we consider two additional metrics. First, besides overall prediction accuracy, 9

we consider the target variable’s predictive accuracy (i.e. what percentage of the actual purchases 10

were predicted correctly). For example, the trivial model stated above that had 95% overall 11

accuracy will get 0% target variable accuracy. Second, as is standard in the database marketing 12

literature (Ling and Li 98, Hughes 96), we use a lift index (see below) to compare models. 13

0

20

40

60

80

100

0 20 40 60 80 100

Top x% of sorted out of sample data

Perc

enta

ge o

f act

ual

book

ings

Random Lift

14

Figure 1 Constructing Lift Curves 15

Assume that a classification model predicts that the ith record in the out-of-sample data is a 16

“booking” session with confidence pi. The out-of-sample data is then sorted in descending order of 17

pi and a lift curve is constructed as follows. Any point (x, y) belongs on the lift curve if the top x% of 18

this sorted data captures y% of the actual booking sessions. If the model is random, then the top 19

x% of the data would be expected to contain x% of the bookings. The difference (y-x) is hence the 20

lift obtained as a result of the model. Figure 1 presents an example of using a lift curve to 21

determine the performance of a model. For example, the model in Figure 1 picks out 50% of the 22

9

true booking sessions from just the top 20% of the sorted data. The area between the 45-degree 1

line and the lift curve of the model is a measure of how much “lift” this model provides and we use 2

the value of this area as the index of the lift. Clearly this area is within the range [-0.5, 0.5] and the 3

greater the value, the better is the model. Hence in this paper our goal is to compute two 3x3 4

tables in which the rows are the specific problems and the columns are the performance metrics. 5

The first table represents the magnitude of the gains that can be expected from having complete 6

user-centric data. The other table represents the critical mass numbers – i.e., the percentage of 7

user-centric data that is needed to build models that are as good as the models built from the entire 8

site-centric data. In the next section we describe the methodology used to determine these values. 9

10

3. METHODOLOGY 11

The gains and critical mass numbers may be obtained as follows. 12

1. Select N sites and obtain site-centric and user-centric data for each site. 13

2. Prepare each site’s user-centric and site-centric datasets (construct relevant variables and 14

apply feature selection) as applicable for each problem and thereby obtain N pairs of 15

datasets for each problem. 16

3. For each problem build N pairs of models and for each performance metric determine the 17

overall magnitude gains and critical mass averaged over the N observations. 18

Since step 3 is straightforward, in this section we describe how the first two steps are done. 19

20

3.1 Generating Site-Centric and User-Centric Data 21

Ideally we would like to select N websites and obtain site-centric and user-centric data directly from 22

them. This is impossible since no site on the web has complete user-centric data on its customers 23

and site-centric data is usually proprietary. However, we could reproduce various sites’ site-centric 24

and user-centric data if we had a random panel of users and observed their behavior across 25

10

several sites. Firms such as Jupiter Media Metrix and Netratings collect such data at a user-level 1

based on a large randomly selected panel of users. In our research we use raw data from Jupiter 2

Media Metrix that consisted of records of 20,000 users’ web surfing behaviors over a period of 6 3

months. The data included user demographics and transaction history over the entire period. The 4

total size of the raw data amounted to 30GB and represented approximately 4 million total user 5

sessions. The data that are captured in each user session are the set of visited pages and the 6

times of visit. In addition, the vendor categorized the various sites accessed by the users into 7

categories such as books, search engines, news, CDs, travel etc. In particular, we chose 95 sites 8

spanning five categories (book, music, travel, auction and general shopping malls) that represent 9

sites that sell products such that each site had at least 500 user sessions. Below we present more 10

formally the algorithms to create site-centric and user-centric data from such raw data. 11

First, we present the algorithm used for constructing each site’s site-centric data from raw 12

user-level browsing data. Figure 2 presents CalcSiteData, an algorithm for constructing site-centric 13

data from user-level web browsing data. 14

Let S1, S2, …, SN be N user sessions and usr be a function such that usr(Si) represents the 15

user in session Si. We define each session Si to be a tuple <pi,1, pi,2, …, pi,ki>, where pi,j is the jth 16

page accessed in session Si and ki is the length of session Si. We say a page p is in session Si if p 17

is any element in <pi,1, pi,2, …, pi,ki>. 18

We also define an operation add_page such that add_page(<p1,p2,…,pk>,q) returns 19

<p1,p2,…,pk, q>. Finally let function rootdomain(p) return the website’s root domain (e.g. 20

www.expedia.com) of a given page p (e.g. www.expedia.com/xyz.html). 21

CalcSiteData works by taking each user session and constructing snapshots for each unique 22

site in the session such that the snapshot consists of pages belonging to that particular site. For 23

example, given a single session <Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2, 24

Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3, Cheaptickets3>, CalcSiteData extracts 3 25

tuples: 26

11

1. <Cheaptickets1, Cheaptickets2, Cheaptickets3> for site Cheaptickets 1

2. <Travelocity1, Travelocity2, Travelocity3, Travelocity4> for site Travelocity, and 2

3. <Expedia1, Expedia2, Expedia3> for site Expedia. 3

Inputs: (a) User sessions S1, S2, …, SN, (b) functions usr and rootdomain 4

Outputs: Site-centric data for each website site. 5 for (i = 1 to N) { 6 Let UniqSites = {X |∃ page p in Si such that X=rootdomain(p)} 7 forall (site ∈ UniqSites) { 8 site_centric_session = <> 9 forall pages pi in session Si { 10 if (rootdomain(pi) = site) { 11 site_centric_session = 12 add_page(site_centric_session, pi) 13 } 14 } 15 output ‘site, demographics of usr(Si), site_centric session’ 16 } 17 } 18

Figure 2 Algorithm CalcSiteData 19 20 From the tuples extracted from all the user sessions, grouping the tuples for each individual site 21

results in the site-centric data for that site. In addition to the tuples representing the individual 22

pages, site-centric data also contains all the user demographics contained in the original raw data. 23

CalcSiteData is formally presented in Figure 2. 24

Observe that in effect what CalcSiteData tries to do is to “recreate” each site’s web logfile 25

data based on the sample of 20,000 users’ web accesses. Given that it is impossible to obtain the 26

complete logfile data for every commercial site, the strength of CalcSiteData is that it can simulate 27

(and adduce) individual logfiles from publicly available user-level browsing data. 28

The creation of user-centric data for each site is straightforward. For each site s, the subset 29

of all user sessions that contain s represent the user-centric data for the site. For instance, the user 30

session <Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2, Expedia1, Expedia2, Travelocity3, 31

Travelocity4, Expedia3, Cheaptickets3> will belong to the user-centric data of Expedia, Travelocity 32

and Cheaptickets. Algorithm CalcUserData is formally presented in Figure 3. 33

12

Inputs: (a) User sessions S1, S2, …, SN, (b) functions usr and rootdomain 1

Outputs: User-centric data for each website site. 2 for (i = 1 to N) { 3 Let UniqSites = {X |∃ page p in Si such that X=rootdomain(p)} 4 forall (site ∈ UniqSites) { 5 output ‘site, demographics of usr(Si), Si’ 6 } 7 } 8

Figure 3 Algorithm CalcUserData 9

10

3.2 Data Preparation for the Three Problems 11

We first summarize, in section 3.2.1, the data preparation method for the within-session prediction 12

problem. The other two problems (future-session prediction and repeat visit prediction) share a 13

common structure – at the end of a session predicting a binary event in the future. The data 14

preparation for these two problems is hence the same and we discuss this in section 3.2.2. 15

3.2.1 Data Preparation for the Within-Session Problem 16

The goal of the model being built is to predict at any point within a session if a purchase will occur 17

in the remainder of the session. Consider a user’s session <p1, p2, p3, p4, p5> at a site where a 18

purchase occurs in page p4. This single session is not a single data record for modeling. Rather, it 19

provides 5 data records. Three sessions that began with {p1}, {p1 p2} and {p1 p2 p3} respectively 20

resulted in a subsequent purchase. Two that began with {p1 p2 p3 p4} and {p1 p2 p3 p4 p5} did not. 21

This distinction is important. It indicates that sessions create data records proportional to 22

their length. In general, a session of length k provides k data records for modeling. In total, the 23

number of records is therefore the sum of all session lengths. Given the explosion in the number of 24

points this creates we sample each session probabilistically based on its length. For example, if a 25

sampling rate is 0.2, a session of length k on average provides 0.2*k data records for modeling. 26

Associated with each sampling, a random ‘clipping point’ is chosen within the session. Explanatory 27

13

variables constructed from information before the clipping point are used to predict whether a 1

purchase1 occurs in the part of the session after the clipping point. 2

A key strength of probabilistic sampling is that the size of the preprocessed data can be 3

chosen based on time and space constraints available. Choosing a maximum desired data size, 4

dnum, is equivalent to choosing a sampling rate per session of dnum/numtotal, where numtotal is 5

the sum of the lengths of all the sessions in the data. The actual algorithm, ProbilisticClipping, is 6

presented at Appendix A. 7

Preprocessing user-centric data for the within-session prediction problem is similar to the 8

method described above. This involves using Probabilistic Clipping partially – only for creating 9

fragments of all sessions in site-centric data and then augmenting these fragments with user-10

centric information before creating the summary variables. For example, consider a single user’s 11

session. Let the records in the site-centric and user-centric data for Expedia contain the following 12

records: 13

Site-centric record: Expedia1, Expedia2, Expedia3 14

User-centric record: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2, Expedia1, 15

Expedia2, Travelocity3, Travelocity4, Expedia3 16

If the sampling rate is 0.7, assuming that as part of the Probabilistic Clipping procedure for site-17

centric data, the following 2 fragments of the session are created Expedia1 and Expedia1, 18

Expedia2, Expedia3. Preprocessing user-centric data for Expedia would then create the fragments 19

Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2, Expedia1 and Cheaptickets1, 20

Cheaptickets2, Travelocity1, Travelocity2, Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3. 21

Based on these two fragments, summary variables are created. Clearly the summary variables in 22

user-centric data will contain additional metrics such as percentage of total hits to Expedia’s site in 23

1 We use Jupiter Media Metrix’s heuristic that considers properties of secure-mode transactions to infer bookings. In their experiments using ten commercial sites that provided them exact URLs of purchases, this heuristic identified 87% of the total purchases and had a false positive rate under 25%.

14

the current session (20% for the first fragment and 33% for the second fragment in the example). 1

The detailed set of user-centric metrics and site-centric metrics is presented in Appendix B. 2

Note that as a result of probabilistic clipping a session might result in several data records. 3

This occurs since the prediction problem addressed here is to predict at every click whether a 4

purchase is likely to occur. The data generated from probabilistic sampling, is therefore, the correct 5

sample since we are sampling from the space of all session fragments. Since longer sessions have 6

more session fragments, they will translate into more records. This is a natural result of what we 7

want to sample from, and is not over-representation. 8

3.2.2 Data Preparation for the Future-Session and Repeat Visit Problems 9

The future-session (or repeat visit) prediction problem is predicting at the end of a given session, 10

whether a user at a site will make a purchase (or visit the site again) at some point in the future. To 11

illustrate the difference in using site-centric data and user-centric data, consider a user at a given 12

point in time with three web sessions involving visiting Expedia. The prediction task involving site-13

centric data is to predict if a given user with n prior sessions with Expedia, will book at some future 14

session based on the historical sessions s1, s2,…sn, where each si is of the form Expedia1, 15

Expedia2, Expedia3 for example. The prediction task involving user-centric data is to predict if a 16

given user with n prior sessions with Expedia, will book at some future session based on the 17

historical sessions u1, u2,…un, where each ui is of the form Cheaptickets1, Cheaptickets2, 18

Travelocity1, Travelocity2, Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3 for example. 19

Hence for a user with N total sessions at Expedia, the site-centric and user-centric data would 20

each contain N records, each with increasing historical contents. The individual summary variables 21

are listed in Appendix C. 22

23

4. RESULTS 24

We started with user-level browsing data consisting of 20,000 users’ browsing behavior over a 25

period of 6 months. These data were provided to us by Jupiter Media Metrix. We constructed site-26

15

centric and user-centric datasets for 95 sites using the process described in Section 3.1. The 1

number of records for each site ranged from 500 to 12,500 raw user sessions. Based on the 2

generated site-centric and user-centric data, we then applied feature selection to filter out possibly 3

irrelevant variables for each site. The feature selection problem is the problem of selecting a “best” 4

subset of features (variables) from a finite space of features (Olafsson and Yang 2002). There are 5

several techniques recognized as valid. For our experiments, we choose the wrapper algorithm2 6

implemented in Weka – a popular data mining package implemented in Java (Witten and Frank 7

1999). Table 1 reports the average number of variables selected by the wrapper algorithm. Note 8

that we apply feature selection to each dataset separately – this means that two site-centric 9

datasets may actually differ in the set of final variables selected from the entire set of available 10

variables. For example for the future-session problem, the selected features of Amazon for the site-11

centric data and for the user-centric data are (age, edu, hhsize, booklh, sesslh, mpselllh) and 12

(gender, booklh, bookgh, basket, booksh, peakrate) respectively (see Appendix C for detailed 13

explanations of these variables). This ensures that the best model is built for each possible site-14

centric or user-centric dataset. The values in parenthesis in Table 1 are the number of variables of 15

each dataset before feature selection. Note that the feature selection algorithm greatly reduced the 16

number of variables (e.g. for the site-centric data of the within-session problem, the average 17

number of variables selected across 95 sites is 7.5 vs. 14 original variables.) 18

Table 1 Average number of variables chosen by feature selection 19

Site-centric User-centric Within-Session 7.5 (14) 15.8 (38) Future-Session 7.0 (11) 15.7 (28)

Repeat-Visit 6.2 (11) 12.6 (28)

20

Using only those variables selected by the feature selection algorithm we then built 570 21

models (2x3x95: 2 model types (user-centric and site-centric), 3 predictive problems for 95 sites). 22

2 The specific wrapper package in Weka we use is weka.attributeSelection.WrapperSubsetEval -S “weka.attributeSelection.BestFirst -D 0".

16

For lack of space, we do not report the individual tree models here. However all the results 1

(including all 570 trees and actual variables selected by feature selection) are available at the 2

online appendix (2003). Below we only report the summary results. 3

4.1 Computing the Magnitude Gains 4

For the within-session problem, we apply ProbabilisticClipping with a sampling rate of 0.2 to obtain 5

site-centric and user-centric datasets for each of the 95 sites’ raw session datasets. The resulting 6

95 pairs of datasets contain between 1000 and 40,000 records each. As discussed in section 7

3.2.2, the datasets for the future-session problem and the repeat-visit problem have the same 8

structure. The datasets for these two problems contained between 500 and 12,500 records. 9

Table 2 Summary Results 10

Overall Accuracy Target variable Accuracy Lift Gains Mean Variance Mean Variance Mean Variance Site-centric 0.852 0.004 0.419 0.058 0.115 0.013 Within-Session User-Centric 0.865 0.004 0.496 0.056 0.163 0.013 t 5.231 6.220 7.067 p <0.001 <0.001 <0.001 Site-centric 0.878 0.006 0.295 0.050 0.111 0.013 Future-Session User-Centric 0.886 0.006 0.336 0.043 0.138 0.014 t 4.087 3.976 6.333 p <0.001 <0.001 <0.001 Site-centric 0.658 0.004 0.487 0.046 0.056 0.002 Repeat-Visits User-Centric 0.671 0.004 0.536 0.036 0.075 0.001 t 2.970 5.647 5.827 p 0.002 <0.001 <0.001

11

Each of these 570 datasets (95 × 3 × 2) was divided into a random 40% for model building 12

and a remaining 60% as out-of-sample data. Although decision trees are accurate classifiers, they 13

belong to a family of techniques that are unstable in the sense that a perturbation in the dataset 14

can result in a different tree generated. Hence for each dataset we generate five decision trees 15

based on different random 40-60 data partitioning. All the results thus are based on the 5-run 16

average. The actual performance of all the models is summarized in Table 2, which reports the 17

mean and variance for site-centric models and user-centric models for all three problems under 18

17

three performance measures. In addition, we report the t-values and p-values from the paired t-1

tests between site-centric models and user-centric models. Based on the out-of-sample model 2

comparisons for 95 sites, the user-centric models significantly outperform site-centric models in all 3

cases (all 9 p-values are less than 0.05). 4

Table 3 The magnitude of gains (in percentage) 5

Overall Accuracy

Target variable Accuracy

Lift Gains

Within-Session 1.5% 18.4% 41.7% Future-Session 0.9% 13.9% 24.3% Repeat-Visits 2.0% 10.1% 33.9%

6

Table 3 specifically lists the magnitude gains (in percentage) derived from Table 2. The 7

results show that user-centric data can increase the predictive accuracy of the desired target 8

variable by 10% to 20% for the three problems considered, while maintaining the overall accuracy. 9

The gains are more pronounced for lift gains, where the magnitude of the gains is between 24% 10

and 42%. However, as mentioned in the introduction, in this paper we only empirically study the 11

modeling gains that are obtained. Translating these gains into dollar terms is important yet domain-12

dependent. It is beyond the scope of this paper to investigate the effect of these gains on 13

managerial decisions. We plan to address this in future research. 14

4.2 Identifying the Drivers of the Gains 15

What drives these results? Figure 4 indicates that in our data, 85% of user sessions involve the 16

user visiting more than one site, 27% actually visit more than 10 web sites. Given these numbers, if 17

the additional user-centric data is relevant, then adding this information can help build better 18

predictive models. 19

20

18

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 9 10 10+No. of sites visited in a session

Per

cent

of t

otal

ses

sion

s

1

Figure 4 Distribution of Number of Sites Visited Across User Sessions 2

Also, the models from user-centric data clearly indicate that in addition to the effects 3

captured by site-centric models, there are several other factors, not captured in site-centric data 4

that have highly significant effects. Due to space considerations, we present only one pair of 5

sample trees with the top 3 levels. 6

Figure 5 Site-Centric Tree (Left) vs. User-Centric Tree (Right) of Dell Dataset 7

Figure 5 represents the sample trees of Dell built from site-centric data and user-centric data for 8

the future-session problem. The site-centric tree is relatively simple with only two levels that 9

represent three rules. In contrast, all the variables in the top 3 levels in the user-centric tree are 10

user-centric variables. Average number of sessions per site (sespsite), bookings in the past at any 11

site (bookgh), the market share of a user’s time spent at this site (minutesh) the percentage of 12

sessions that start with this site (entrate) are all very significant for the future-session purchase 13

prediction problem. 14

minutelh>53.7 ≤53.7

hhsize>3 ≤3

0 (100%)1 (85%)

0 (98%)

sespsite>4 ≤4

bookgh minutesh>3 ≤3 >0.12≤0.12

entrate0 (100%)1 (85%)≤0>0

0 (90%)1 (100%)

0 (97%)


hhsize>3 ≤3

0 (100%)1 (85%)

0 (98%)


hhsize>3 ≤3

0 (100%)1 (85%)

0 (98%)

sespsite>4 ≤4


entrate0 (100%)1 (85%)≤0>0

0 (90%)1 (100%)

0 (97%)

sespsite>4 ≤4


entrate0 (100%)1 (85%)≤0>0

0 (90%)1 (100%)

0 (97%)

19

In order to determine what specific variables in user-centric data drive the results, we sorted 1

the variables based on the following heuristic. For each tree we consider the variables that appear 2

in the splits of the top five levels of the tree. We weigh each level linearly, with variables in the first 3

level being given a weight 5, the second level a weight 4 and so on. We then extract the variables 4

in the top 5 levels and add their importance for all 95 sites for each of the three problems. The ten 5

most important variables are reported in Table 4. Overall, 19 out of the 30 most important variables 6

are user-centric ones. Detailed analyses below further bring out several important effects captured 7

only in the user-centric case. 8

Table 4 Top10 variables based on importance 9

Within-Session Future-Session Repeat-Visit Rank Variable Importance Variable Importance Variable Importance

1 peakrate* 141 booklh 147 sesslh 205 2 minutelc 108 bookgh* 132 sespsite* 110 3 path* 71 sesslh 92 bookgh* 103 4 mpessgh* 49 peakrate* 74 hpsesslh 94 5 mpsesslh 46 age 60 sessgh* 60 6 bookgh* 44 hpsesslh 54 awareset* 55 7 booklh 43 exitrate* 45 minutelh 54 8 sessgh* 40 booksh* 44 basket* 50 9 exitrate* 39 mpsesslh 43 exitrate* 50 10 hpsessgh* 38 mpsessgh* 40 entrate* 48

(The ones with * are user-centric variables, see Appendix B & C for the 10 detailed descriptions of these variables) 11

12 For the within session prediction problem, seven out of 10 most important variables in the 13

sorted list were user-centric variables which cannot be generated from site-centric data. The three 14

most important ones, with strong effects, were (i) the percentage of historical sessions in which the 15

current site was the one where the user spent the most time (peakrate) and (ii) whether in the 16

current session this site is the entry site or the one in which the user spent the most time (path) 17

and (iii) minute per session in the past (mpsessgh). Note that these variables all capture some 18

measures of user loyalty to a specific site. Inherently loyalty is extremely difficult to capture from 19

site-centric data alone - this can never give you information on a user’s activity at competitors’ 20

20

sites. These findings indicate that loyalty is a significant driver of purchase behavior and one that 1

can be better captured in user-centric data. 2

For the future session prediction problem, 5 out of 10 most significant variables in the 3

sorted list were user-centric measures. The three most significant user-centric variables in the list 4

were (i) past bookings at all sites (bookgh) (ii) the percentage of historical sessions in which the 5

current site was the most visited (peakrate) and (iii) the percentage of sessions at this site in which 6

the user stopped immediately after this site (exitrate). The first is a measure of browsing/buying 7

intensity, and not surprisingly heavy buyers are more likely to book in the future. The other two 8

again capture loyalty effects. Interestingly, exitrate is a strong indicator of future purchase and one 9

that can only be determined from user-centric data. 10

For the repeat visit prediction problem, the most significant variable is a site-centric one 11

sesslh – the number of sessions of the user to this site in the past. Not surprisingly, this is a highly 12

positive, though trivial, indicator of repeat visits. The next two are user-centric variables that 13

capture browsing/buying intensity (total purchases in the past across all sites -- bookgh and 14

number of sessions per site -- sespsite). The next most significant user-centric variable was basket 15

size-- the number of sites a user was visiting in the current session. This is an indicator of 16

comparison shopping, and captures this user’s loyalty to the site. 17

In summary, variables that capture two important drivers – loyalty and browsing/buying 18

intensity tend to be highly important for all the three problems considered in this paper. Among 19

these two, loyalty just cannot be measured from the more common site-centric data. The other 20

effect of browsing/buying intensity, though partially captured in site-centric data, is important 21

enough that the actual values captured in user-centric data turn out to be much better. 22

23

4.3. Critical mass Results 24

Thus far we computed the magnitude of gains that can be obtained from having user-25

centric data for all customers. We also studied the factors that contribute to the better performance 26

21

of user-centric models. As mentioned in the introduction, cost considerations alone may preclude a 1

firm from acquiring all the user-centric data. In order to determine how much user-centric data is 2

needed, in this section we compute the critical mass numbers for all the three problems and all the 3

three performance metrics. We define critical mass as the percentage of user-centric data that is 4

required to build a model that is as good as the site-centric model. This metric is computed as Kmin / 5

N where Kmin is the minimum amount of N user-centric records needed to build a decision tree that 6

can “outperform” (in terms of a single performance metric) the site-centric model built from the N 7

site-centric records. In computing the metric, we are in fact comparing the two scenarios as shown 8

in Figure 6. Scenario 1 represents model building with site-centric data, where we have all N 9

customers’ records yet with fewer variables (X1,…, XM). In contrast, scenario 2 represents user-10

centric model where we have fewer customers’ records (K as shown in Figure 6) yet more 11

variables (the additional user-centric variables XM+1,…, Xp in Figure 6). The critical mass K thus 12

represents the tradeoff between knowing more customers (scenario 1) and knowing more 13

information per customer (scenario 2). 14

X1 … XM XM+1 … XP

Y

X1 … XM XM+1 … XP

Y

1 1 2 2 … … K K

… … N N

15 Scenario1: Site-centric model (fewer variables) Scenario 2: User-centric model (fewer records) 16 17

Figure 6: Two scenarios of model building 18 19

We determine the critical mass of a site for each problem and an associated performance 20

metric as follows. For each site we first build a site-centric model using all N in-sample records 21

available. We then randomly select K records from the corresponding N user-centric records and 22

build a tree on the data. In particular, we vary the actual value K from 2% - 100% incrementally, 23

with step size at 2%. The critical mass is determined when the user-centric model with K records 24

22

starts to outperform the site-centric model. Figure 7 graphically depicts the determination of the 1

critical mass for an average site ubid.com3 for the within-session problem and the “target variable 2

accuracy” performance metric. In Figure 7, the performance of site-centric data is represented by 3

the straight line (0.401 in terms of booking accuracy) since we use all N available records. As K 4

increases, note that the performance of user-centric model (shown as the dotted curve in Figure 7) 5

increases as would be expected. The point at which the user-centric model outperforms the site-6

centric model afterwards is 44%. Therefore the critical mass for ubid is 44%. 7

8 9 10 11 12

13 14 15 16 17

Figure 7 Critical mass determination for Ubid.com 18 19

Table 5 summarizes the critical mass results4 (mean and variance) for the three problems under 20

the three chosen performance measures. All the critical masses are in the range of 25%-44%. This 21

implies that for these problems and performance metrics, it would be necessary to acquire a 22

reasonably large percentage of user-centric data in order to do better than existing site-centric data 23

alone. This is an important result, since it suggests that too small a sample may actually be 24

counter-productive in the sense that the user-centric models can indeed be worse than the site-25

centric models. Note as well the specific shape of the user-centric curve. Knowing this will facilitate 26

estimating the amount of user-centric data required for a given level of improvement in the 27

measure of performance, here booking accuracy, as part of a cost-benefit analysis. 28

29 3 Its critical mass 44% is about the same as the average critical mass (43.2%) of the 95 sites. 4 All the results are averaged by five runs with different random seeds when choosing K random user-centric data records.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 40 50 60 70 80 90 100

% of User-Centric Data

Boo

king

Acc

urac

y

Site-CentricUser-Centric

23

1 2

Table 5 Summary critical mass results for the three problems 3 Overall

Accuracy Target Variable

Accuracy Lift Gains Mean Variance Mean Variance Mean Variance

Within-Session 33.5% 0.217 43.2% 0.245 25.1% 0.218Future-Session 39.3% 0.245 41.9% 0.247 34.5% 0.224

Repeat-Visit 37.2% 0.245 41.8% 0.251 34.8% 0.227 4

5

5. DISCUSSION 6

The results presented in this paper are significant for the following reasons: 7

1. In the context of predictive modeling, even though the nature of the result is not 8

surprising, there has been no prior work that made this comparison explicitly to understand the 9

magnitude of the gains that can be achieved from user-centric data. As expected, although the 10

gains are significant, the magnitude of the difference between site-centric and user-centric 11

approaches varies by problem type and performance metric. An interesting observation from Table 12

3 is that for the within-session prediction problem the gains that can be achieved are higher than 13

for the other problems. It is worth noting that this is the most difficult one among the three problems 14

and is one that most online retailers care about since being able to predict purchase (or the lack of 15

it) within a session is indeed clearly actionable. In terms of the performance metric, the gains are 16

larger for the lift metric. However the gains across performance metrics are not directly comparable 17

– it is not clear that a 42% gain in left is better than an 18% gain in target variable accuracy. To 18

address this in future research we plan to consider the dollar value gains that can be obtained for 19

each problem. Studying this will involve an understanding of the decision contexts and how the 20

results of acting on the models will create value. 21

2. For the specific problems considered we determined the significant variables that impact 22

the models. In particular we find that variables capturing customer loyalty and browsing/buying 23

24

intensity are significant. These variables can be computed better from user-centric data and to 1

some extent help understand why user-centric data can be useful. 2

3. As mentioned in the introduction, it may be appealing for firms to collect user-centric data 3

on only some customers. For the specific problems and models considered in this paper we show 4

that too little user-centric data can actually be worse than just using the site-centric models. Our 5

results here actually make a case for the use of site-centric data when it may not be possible to 6

acquire the reasonably large amount of user-centric data needed. While this result is certainly not 7

contradictory to the claim that complete user-centric data is better, it provides some degree of 8

support for the common use of site-centric models in eCRM problems since for many firms the 9

acquisition of a large amount of user-centric data may be prohibitively expensive. 10

4. There are interesting implications. These results suggest that there may be value in 11

business models that collect user-level data and provide these to individual sites to build models. 12

Indeed some potential opportunities include customer opt-in models for licensing user-level data 13

real-time to electronic commerce sites for building more effective models. 14

More generally, the methodology used in this paper may be useful in valuing additional user 15

data and may also be useful in formally studying the value of privacy. For example, approximate 16

values of a user’s unknown data may be computed based on the magnitude gains that can be 17

expected from using this data. These are exciting possibilities for future research. 18

There are several limitations might limit the scope of the paper, that are important to detail, 19

are listed below. We hope other independent studies and future experimentation should examine 20

these in greater depth. 21

(1) The models do not consider the content within the pages since most were dynamically 22

generated (e.g. a profile of an air ticket search) and cannot be recreated. Given that user-centric 23

data is a superset of site-centric data, there is no reason to believe that this will help site-centric 24

models more than user-centric ones. The data provided to us did not have measures of content, 25

though some ISPs and other data vendors, such as ComScore, do track this data. 26

25

(2) Reproducing site-centric datasets from the MediaMetrix user-level panels is only an 1

approximation. Given the impracticality of obtaining site-centric datasets directly from hundreds of 2

individual sites, it is not possible to determine how good the approximation really is. 3

(3) Third, the results are specific to the metrics and the model (decision tree) considered in 4

this paper and that might affect generalizability. Indeed it is possible that there are other variables 5

not considered here that may also make a difference to the comparison. We believe that our 6

choice of the variables considered here is reasonable, but acknowledge the difficulty of ever having 7

a ‘complete set’ of variables. 8

9

6. FUTURE RESEARCH AND CONCLUSIONS 10

There are several exciting opportunities for future research that extends this work. Below we 11

describe three important extensions. 12

13

14

15

16

17

18

19

20

Figure 8: Optimal selection of # of customers to acquire 21

22

First, as mentioned in the introduction, future research is needed to determine the cost-23

benefit tradeoffs from a firm’s perspective. While there are different methodologies that can 24

potentially be used to study this question, below we outline an analytical approach that can 25

determine how much data to acquire. Let K be the number of customers to acquire. Assume a 26

NK*CM

P(K)

C(K)

26

linear cost function5 of data acquisition C(K) = α K. As Figure 7 shows, the model accuracy is 1

roughly a concave function of K. Suppose therefore there is a concave profit function P(K). Then 2

the manger’s decision is an optimization problem to determine K* such that the overall profit P(K)- 3

α K is maximized, subject to N ≥ K > CM, where CM is the critical mass. From the first order 4

condition, we know that K* satisfies P′ (K*) = α. If K* ≤ CM, then the site is better off just using site-5

centric data to build the model. If CM < K* < N, then acquire K* number of customers. If K ≥ N, then 6

acquire all customers. Figure 8 illustrates this simple decision making scenario. This paper helps 7

such an approach in that it can be used to determine a) the critical mass and 2) the accuracies of 8

the site-centric and user-centric models as a function of K (and these functions may then be used 9

to develop an appropriate profit function). 10

Second, the observation that user-centric data may be better suggests that information 11

integration strategies are worth studying, but future research needs to carefully weigh the benefits 12

of information sharing with the costs incurred by all parties in doing so (including a potential 13

reduction in benefits if all parties have full information). Information integration, in the online world, 14

can be pursued at three very different levels, and we describe these possibilities below. Associated 15

with each possibility are important research questions that need to be addressed in future work in 16

order to build effective information integration strategies. 17

1. Firm-Firm: At this level, information integration is achieved by strategic alliances between 18

firms to share data. Similar to ours, recent work in finance (Kallberg and Udell 2002) and 19

economics (Japelli and Pagano 2000) demonstrates the value of information sharing at a firm level. 20

Kallberg and Udell (2002) studied the effect of lenders sharing credit information about borrowers 21

and show that information sharing adds value in solving the problem of borrower information 22

opacity. Japelli and Pagano (2000) find that bank lending is higher and default rates are lower, in 23

countries where lenders share information, regardless of the private or public nature of the 24

information sharing mechanism. Using a data mining based approach, our research complements 25

5 For example, Media Metrix used to pay each panelist $100 per year. In this case α=100.

27

the recent work in information sharing and is the first to demonstrate the value of this in the online 1

world. However, an obstacle for inter-firm information sharing has been the need to maintain the 2

privacy of user data. While current work in various disciplines argue for information sharing 3

between firms, much research needs to be done to study market mechanisms and incentives to 4

make this happen while being sensitive to privacy concerns in the online world. 5

2. Firm-Customer: At this level, information integration can be achieved by firms directly 6

acquiring additional information from customers. Indeed this is much easier to do in the online 7

world and can be enabled by using automated Web Services, technologies that are getting much 8

attention for their seamless applications integration capabilities. In a recent work, Zheng and 9

Padmanabhan (2002) proposed an approach based on firms selectively choosing customers to 10

acquire additional data from. The advantage of this approach is that privacy issues are 11

automatically handled since customers choose to provide this data themselves. Hagel and Rayport 12

(1997) predict that customers will put aside concerns over privacy if they receive sufficient value in 13

return for providing personal information. One approach is to pay consumers for the use of their 14

personal information, either through royalty payments, or by providing discounts as some 15

supermarket loyalty programs do for their members. An open research issue is how to structure the 16

right incentives for customers to reveal their private data in the online world. 17

3. Firm-Market: At this level, information integration can be achieved by obtaining 18

information from a market that provides integrated customer information. This has interesting 19

implications for user privacy, since an information market can be an efficient way to deal with 20

customers’ private data. Laudon (1996) proposes an innovative market-based solution to user 21

privacy, in which individuals would have a common law property right to their personal information, 22

these rights could be sold, and national information markets would emerge. Such information 23

markets can be easily created for the online world, and could be another source for firms to acquire 24

customer information from. 25

28

Third, as a by-product of our methodology, a number of significant predictor variables have 1

been identified from the user-centric data that are not available from the site-centric data. See, 2

e.g., Table 10 as well as Appendix B & C. This presents the prospect of using data on these 3

variables, along with other data, broadly for purposes of marketing, much as demographic data 4

(ZIP code, occupation, etc.) are used. Systematic exploration of this possibility must, however, 5

await future research. 6

In summary, in this paper we compared models using site-centric data versus models using 7

user-centric data for three different prediction tasks common in eCRM applications. Based on 8

these comparisons we empirically determined the magnitudes of the gains that can be achieved 9

using user-centric data and also determined the minimum amount of user-centric data that needs 10

to be acquired if a firm cannot acquire all user-centric data. We discussed the implications of these 11

results and also outlined interesting questions for future research. The main contributions of the 12

paper are the empirical determination of the gains, the specific user-centric variables identified as 13

relevant and the critical mass needed for each of the key problems considered in the paper. We 14

also wish to note that the methodology used in the paper to answer the questions posed is novel 15

and is an important contribution in itself. This methodology is one that can be applied to any user-16

centric dataset that may be collected, and can be easily generalized to different problem contexts, 17

evaluation metrics and modeling methods. 18

19

29

APPENDIX A. ALGORITHM PROBABILISTICCLIPPING 1

In this appendix we presented ProbabilisticClipping, an algorithm for preprocessing site-centric 2

data for within-session prediction task. Before presenting the algorithm, we first present some 3

preliminaries. 4

Let S1, S2, …, SN be N user sessions in a site’s site-centric data. Assume that in this data 5

the number of unique users is M and users are identified by a userid ∈ {1,2,…,M}. We define each 6

session Si to be a tuple of the form <ui, Ci> where ui is the userid corresponding to the user in 7

session Si and Ci is a set of tuples of the form <page, accessdetails>, where each tuple represents 8

data that a site captures on each user click. Corresponding to a click, page is the page accessed 9

and accessdetails is a set of attribute-value pairs that represents any other information that a site 10

can capture from each user click. This includes standard information from http headers such as 11

time of access, IP address, referrer field etc. and other information such as whether the user made 12

a purchase in this page. In particular we assume that accessdetails necessarily contains 13

information on the time a page is accessed. For example, based on the above representation 14

scheme, a user session at Expedia is represented as follows: 15

S1 = <1, { <home.html, { (time, 02/01/2001 23:43:15), (IP, 128.122.195.3) } >, 16 <flights.html, { (time, 02/01/2001 23:45:15), (IP, 128.122.195.3) } >, 17 <hotels.html, { (time, 02/01/2001 23:45:45), (IP, 128.122.195.3) } > 18 } > 19

Given a session Si = <ui, Ci>, define function kth-click(Si, j) that returns a tuple <page, 20

accessdetails> ∈ Ci if page is the jth page accessed in the session as determined from the time 21

each page is accessed. For instance, in the above example kth-click(S1, 2) is <flights.html, { (time, 22

02/01/2001 23:45:15), (IP, 128.122.195.3) } >. 23

Also we define function fragment(Si, j, fraglength) that represents data captured from a set 24

of consecutive clicks in the session. In particular, fragment(Si, j, fraglength) is the set of all tuples 25

30

kth-click(Si, m) such that j ≤ m ≤ minimum( (j + fraglength –1), |Ci| ), where Si = <ui, Ci>. For 1

example, fragment(S1, 2, 2) = { <flights.html, { (time, 02/01/2001 23:45:15), (IP, 128.122.195.3) } >, 2

<hotels.html, { (time, 02/01/2001 23:45:45), (IP, 128.122.195.3) } > }. For any given set, f, of 3

<page, accessdetails> pairs, and any given session Si, we say f is a fragment of Si if there exists j, 4

k such that f = fragment(Si , j, k). 5

Finally, prior work (VanderMeer et al. 2000, Mobasher and Dai 2000, Mena 1999) in 6

building online customer interaction models assumes that three sets of variables are particularly 7

relevant: (1) Current visit summaries (e.g. time spent in current session) (2) Historical summaries 8

of the user (e.g. average time spent per session in the past) and (3) User demographics. 9

Corresponding to this, we assume three user-defined functions as inputs to the algorithms: 10

1. summarize_current(f, Si), defined when f is a fragment of Si. This function is assumed to 11

return user-defined summary variables for the current fragment and session. For example 12

for the running example used in this section, summarize_current(fragment(S1, 1, 2), S1) 13

may return numpages=2, tot_time=150 seconds, booked = 1 assuming the user made a 14

booking in one of the three pages accessed in the session. 15

2. summarize_historical(f, S, i), where f is a fragment of session Si and S = {S1, S2,…, SN}. 16

This function is assumed to return summary variables based on all previous sessions. Note 17

that the historical summaries are usually about the specific user in session Si. 18

3. demographics(ui) which returns the demographic information available about user ui. 19

Prior work in the IS and marketing literature proposed different sets of relevant usage metrics 20

(Novak and Hoffman 1997, Korgaonkar and Wolin 1999, Cutler 2000, Johnson et al. 2002, 21

Kimbrough et al. 2000). Borrowing on these, for the experiments presented in this paper, we define 22

summarize_historical and summarize_current to return 9 metrics for site-centric data and 32 for 23

user-centric data. These metrics are presented in Appendix B. 24

Algorithm ProbabilisticClipping is presented in Figure A. Based on the desired data size, the 25

sample rate is first computed (as described previously) in steps 1-7. Steps 9-25 iterates over all the 26

31

sessions repeatedly until the desired number of records is sampled. Each time, a session is 1

sampled probabilistically based on the expected number of records that should be derived from it. 2

Inputs: (a) User sessions S1, S2, …, SN 3 (b) Desired number of data records, dnum 4 (c) functions summarize_current, summarize_historical, demographics 5 Outputs: Data records D1, D2,…, DP. 6

1 S = S1 U S2… U SN 7 2 numtotal = 0 8 3 for (i = 1 to N) { 9 4 <u, C> = Si 10 5 numtotal = numtotal + |C| 11 6 } 12 7 samplerate = dnum/numtotal 13 8 p = 0; i = 1 14 9 while (p < dnum) { 15 10 <u, C> = Si 16 11 session_len = |C| 17 12 rand = random_real(0,1) 18 13 if (rand < session_len * samplerate) { /* whether to sample */ 19 14 clip = random_int(1, session_len) /* where to clip */ 20 15 f = fragment(Si, 1, clip) 21 16 current = summarize_current(f, Si) 22 17 history = summarize_historical(f, S, i) 23 18 demog = demographics(u) 24 19 Dp = current U history U demog 25 20 p = p + 1 26 21 output ‘Dp’ 27 22 } 28 23 i = i + 1 29 24 if (i > N) {i = 1} 30 25 } 31

32

Figure A Algorithm Probabilistic Clipping 33

32

APPENDIX B. METRICS FOR WITHIN-SESSION PREDICTION 1

No. Variable Description 1 gender “1”—Male, “0” -- Female 2 age Age of the user 3 income Income of the user

4 edu “0” – high school or less, “1”-- college, “2” – post college 5 hhsize Size of house hold 6 child “1” – have, “0” – not have 7 booklh No. of bookings the user made at this site in the past 8 sesslh No. of sessions to this site so far 9 minutelh Time spent in this site so far in minutes

10 hpsesslh Average hits per session to this site 11 mpsesslh Average time spent per sessions to this site 12 hitlc No. of hits to this site up to this point in this session 13 minutelc Time spent up to this point in this session 14 weekend Indicating if this session occurs on weekend 15 bookgh No. of past bookings of all sites so far 16 sespsite Average sessions per site so far 17 sessgh Total no. of sessions visited of all sites so far 18 minutegh Total minutes of all sites 19 hpsessgh Average hits per session 20 mpsessgh Average minute per session 21 awareset Total no. of unique shopping sites visited 22 basket Average no. of shopping sites visited per session 23 single Percentage of single-site sessions 24 booksh Percentage of total bookings are to this site 25 hitsh Percentage of total hits are to this site 26 sessh Percentage of total sessions are to this site 27 minutesh Percentage of total minutes are to this site 28 entrate No. of sessions start with this site/total sessions of this site 29 peakrate No. of sessions the user spend the most time within this site/total sessions 30 exitrate No. of sessions end with this site/total sessions of this site 31 SErate No. of sessions start with this site/total sessions of this site 32 hitgc Total hits of all sites in the current session 33 basketgc No. of shopping sites in this session 34 minutegc Time spent of all sites in this session 35 SEgc Indicating if this session uses search engines 36 path Indicating if this site is an entry/peak 37 hitshc Hits to this site/ hits to all sites in this session 38 minutshc Minutes to this site/total minutes in this session

39 bookfut Binary dependent variable, indicating if this user is going to book in the remainder of the session (after the clipping point)

Note: variables 1-14 are site-centric variables; 15-38 are additional user-centric variable and the 2 last is the dependent variable. 3

4 5 6

33

1

APPENDIX C. METRICS FOR FUTURE-SESSION AND REPEAT VISIT PROBLEMS 2

No. Variable Description 1 gender “1”—Male, “0” -- Female 2 age Age of the user 3 income Income of the user

4 edu “0” – high school or less, “1”-- college, “2” – post college 5 hhsize Size of house hold 6 child “1” – have, “0” – not have 7 booklh No. of past bookings to this site so far 8 sesslh No. of sessions to this site so far 9 minutelh Time spent in this site so far in minute 10 hpsesslh Average hits per session to this site 11 mpsesslh Average time spent per sessions to this site

12 bookgh No. of past bookings of all sites so far 13 sespsite Average sessions per site so far 14 sessgh Total no. of sessions visited of all sites so far 15 minutegh Total minutes of all sites 16 hpsessgh Average hits per session 17 mpsessgh Average minute per session 18 awareset Total no. of unique shopping sites visited 19 basket Average no. of shopping sites visited per session 20 single Percentage of single-site sessions 21 booksh Percentage of total bookings are to this site 22 hitsh Percentage of total hits are to this site 23 sessh Percentage of total sessions are to this site 24 minutesh Percentage of total minutes are to this site 25 entrate No. of sessions start with this site/total sessions of this site

26 peakrate No. of sessions the user spend the most time within this site/total sessions of this site

27 exitrate No. of sessions end with this site/total sessions of this site 28 SErate No. of sessions containing search engines

29 dependent

Binary dependent variable. 1) for user-level prediction, this indicates if this user is going to book in the future (after the session); 2) for repeat visit prediction, indicates if he is going to repeat visit in the future (after this session).

3

Note: variables 1-11 are site-centric variables; variable 12-28 are additional user-centric variables; 4

and variable 29 is the dependent variable. 5

34

REFERENCES 1

Adomavicius, G. and Tuzhilin, A. “User Profiling in Personalization Applications through Rule 2

Discovery and Validation,” Proceedings of the Fourth ACM SIGKDD International Conference on 3

Knowledge Discovery and Data Mining. 1999, pp. 377-381. 4

Aggarwal, C.C., Sun, P. Z. and Yu, S. “Online Generation of Profiling Association Rules,” 5

Proceedings of the Third ACM SIGKDD International Conference on Knowledge Discovery and 6

Data Mining, 1998, pp. 129-133. 7

Chen, P-Y, and Hitt, L. “Measuring Switching Costs and the Determinants of Customers Retention 8

in Internet-Enabled Businesses: A Study of the Online Brokerage Industry,” Information Systems 9

Research (13:3), 2002, pp. 255-274. 10

Cutler, M. “E-Metrics: Tomorrow’s Business Metrics Today,” Proceedings of the Fifth ACM 11

SIGKDD International Conference on Knowledge Discovery and Data Mining. 2000. 12

Fader, P., and Hardie, B. “Forecasting Repeat Sales at CDNow: a Case Study,” Interface (31:3) 13

2001, pp. 94-107. 14

Fawcett, T., and Provost, F. “Combining Data Mining and Machine Learning for Effective User 15

Profile,” Proceedings of the first ACM SIGKDD International Conference on Knowledge 16

Discovery and Data Mining. 1996, pp. 8-13. 17

Forrester Research Inc. “Getting the Retail Technology Advantage,” Forrester Research Inc. 18

Cambridge MA. 2002. 19

Hagel, J., and Rayport, J. “The Coming Battle for Customer Information,” Harvard Business 20

Review. January-February, 1997, pp. 53-65. 21

Hagel, J., and Singer, M. “Net Worth: Shaping Markets when Customers Make the Rules,” Harvard 22

Business School Press. Cambridge, MA. 1999. 23

Hughes, A. M. “The Complete Database Marketing: Second-Generation Strategies and 24

Techniques for Tapping the Power of your Customer Database,” Chicago, III.: Irwin Professional. 25

1996. 26

35

Jappelli, T. and Pagano, M. “The European Experience with Credit Information Sharing,” Working 1

Paper, Center for Studies in Economics and Finance (CSEF), University of Salerno, Italy. 2000. 2

Johnson, E., Moe, W., Fader, P., Bellman, S. and Lohse, J. “On the Depth and Dynamics of Online 3

Search Behaviour,” The Wharton School Working Paper #00-014, University of Pennsylvania. 4

2002. 5

Johnson, R., and Wichern, D. “Applied Multivariate Statistical Analysis,” Prentice Hall, New Jersey, 6

Fourth Edition, 1998. 7

Kallberg, J., and Udell, G. “The Value of Private Sector Credit Information Sharing: the U.S. Case,” 8

Journal of Banking and Finance. Forthcoming. 9

Kimbrough, S., Padmanabhan, B. and Zheng Z. “On Usage Metric for Determining Authoritative 10

Sites,” Proceedings of Workshop on Information Technologies and Systems (WITS 2000). 11

Brisbane, Australia. 2000. 12

Korgaonkar, P., and Wolin, L. D. “A Multivariate Analysis of Web Usage,” Journal of Advertising 13

Research (39), 1999, pp. 53-68. 14

Laudon, K. “Markets and Privacy,” Communications of the ACM (39: 9), 1996, pp. 92-104. 15

Lee, Sukekyu. “An Investigation of Repeat Visit Behavior on the Internet: Model and Applications,” 16

Unpublished dissertation in the Marketing Department, University of south California. 2000. 17

Ling, C., and Li, C. “Data Mining for Direct Marketing: Problems and Solutions,” Proceedings of the 18

third ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, 19

pp. 73-79. 20

Mena, J. “Data Mining your Website,” Digital Press of Butterworth-Heinemann. 1999. 21

Mobasher, B., and Dai, H. “Discovery of Aggregate Usage Profiles for Web Personalization,” In 22

Proceedings of ACM Conferences on Web KDD. 2000. 23

Moe, W., and Fader, P. “Dynamic Conversion Behavior at E-commerce Sites,” Marketing 24

Department, The Wharton School, Working Paper #00-023. 2002a. 25

36

Moe, W., and Fader, P. “Capturing Evolving Visit Behavior in Clickstream Data,” Marketing 1

Department, The Wharton School, Working Paper #00-003. 2002b. 2

Novak, T., and Hoffman, D. “New Metrics for New Media: Toward the Development of Web 3

Measurement Standards,” World Wide Web Journal (2: 1), 1997, pp. 213-246. 4

Nasraoui, O., Frigui, H., Joshi, A., and Krishnapuram, R. “Mining Web Access Logs Using 5

Relational Competitive Fuzzy Clustering,” Proceedings of the Eight International Fuzzy Systems 6

Association World Congress, Taipei. 1999. 7

Olafsson, S., J. Yang. “Scalable optimization-based feature selection. Proceedings of SIAM 8

Workshop on Discrete Mathematical Data Mining. Arlington, VA 53-64. 2002. 9

OnlineAppendix. 2003. All 570 individual tree models are available online. For the blind review 10

process, we attached the models in a zipped file and this may be available from the SE/AE. 11

Padmanabhan, B., Tuzhilin, A. “On the use of optimization for data mining: theoretical interactions 12

and eCRM opportunities”, Management Science, (49: 10), 1327-1342. 2003 13

Padmanabhan, B., Zheng, Z. and Kimbrough, S. “Personalization from Incomplete Data: What You 14

Don’t Know Can Hurt”, Proceedings of the Sixth ACM SIGKDD International Conference on 15

Knowledge Discovery and Data Mining, San Francisco. 2001. 16

Pan, S., Lee, J. “Using e-CRM for a unified view of customer”, Communications of the ACM (46: 4), 17

95-105, 2003 18

Park, Y., and Fader, P. “Modeling Browsing Behavior at Multiple Sites,” Marketing Science, 19

Forthcoming. 2004. 20

Rafaely, O. “Characterize Visitors on Order Size,” Proceedings of KDD-CUP, Boston. 2000. 21

Schechter, S. Krishnan, M. and Smith, M. “Using Path Profiles to Predict HTTP Requests,” In the 22

Proceedings of the 7th International WWW Conference, Brisbane, Australian. 1998. 23

Sen, S., Padmanabhan, B., Tuzhilin, A., White, N. and Stein, R. “The Identification and Satisfaction 24

of Consumer Analysis-Driven Information Needs of Marketers on the WWW,” European Journal 25

Of Marketing (32:7), 1998, pp. 688-702. 26

37

Shaw, Stephen. “Personal Communication with CEO of Autometrics,” Autometrics Inc., London. 1

2002 2

Swift, Ronald S. “Accelerating Customer Relationships,” New York: Prentice Hall. 2001. 3

Theusinger, C., and Huber, K. “Analyzing the Footsteps of Your Customers,” Web-KDD 2000. 4

VanderMeer, D., Dutta, K. and Datta, A. “Enabling Scalable Online Personalization on the Web,” 5

In the Proceedings of Electronic Commerce (EC00)/ ACM. Minneapolis. 2000. 6

Varian, H. “Economics of Information Technology,” Mattioli Lecture Notes at Bocconi University. 7

Available at http://www.sims.berkeley.edu/~hal/Papers/mattioli.pdf. 2001. 8

Verhoef, P. “Understanding the Effect of Customer Relationship Management Efforts on Customer 9

Retention and Customer Share Development”, Journal of Marketing (67: 4), 30 – 45. 2003 10

Witten, I., E. Frank. “Data Mining: Practical Machine Learning Tools and Techniques with Java 11

Implementations.” Morgan Kaufman, 1999. 12

Zheng, Z., and Padmanabhan, B. “On Active Learning for Data Acquisition,” In Proceedings of the 13

2nd IEEE International Conference on Data Mining, 2002. 14

XYZ 2001. Conference presentation by the authors (anonymized for blind review). 15

an empirical analysis of the modeling benefits of complete

Documents