tracking of consumer behaviour in e-commercefusion.isif.org › proceedings › fusion2013 › html...

Tracking of consumer behaviour in e-commerceMaria Rosario Mestre∗†, Pedro Vitoria‡

∗Signal Processing Laboratory, Department of Engineering, University of Cambridge, UK†Featurespace, Broers Building, Hauser Forum, Cambridge, UK

E-mail: ∗[email protected], †[email protected]‡Mathematical Institute, Oxford-Man Institute for Quantitative Finance, University of Oxford, UK

Abstract—With the increasing amount of transactional dataavailable on e-commerce websites, it is possible to gain a deeperinsight into the dynamics of a given business. Based on the profilesof the different types of customer and the changes they undergoover time, new marketing strategies can be developed to targetspecific groups of users. The aim in this work is to estimate thefuture state of a customer and decide whether to target thatcustomer or not. In the first part of the paper, we describeour proposed algorithm which uses hierarchical clustering anda hidden Markov model (HMM). The clustering can have one(non-augmented) or two levels (augmented). We compare theaugmented and non-augmented method to a benchmark withsynthetic and real data to show that our model outperforms theothers in predicting future customer behaviour. In the secondpart of the paper, we use a decision-theory tool to estimatewhether it is financially beneficial for the business to adopt ourproposed model, as opposed to a less complex one. We concludethat there might not be any benefit at all, even though the modelis more accurate in the predictions. This will depend on the utilityfunctions at stake.

I. INTRODUCTION

Marketing has been undergoing some important changesin recent times. The amount of stored customer data on theweb allows for the adoption of new models of interactionbetween companies and customers. Traditionally, market seg-mentation was done according to established segments, suchas gender, age, location. But now websites dispose of largeamounts of customer transactional data, which can be used toinfer customer’s preferences and habits. New techniques canbe developed for customer relationship management (CRM).Depending on what the e-commerce is trying to optimise,different marketing strategies can be devised that will targetdifferent sets of customers. For example, the company mightbe interested in decreasing the proportion of customers churn-ing or might want to increase the average customer value.

The model that can typically be found in the literatureassumes a static set of customer segments with customerschanging states or segments over time [1]. The segmentsshould be ‘human-readable’. In other words, a marketer shouldbe able to give them a label and, later on, design marketingcampaigns tailored for particular segments. Segments can befound using clustering algorithms [2]. After doing the segmen-tation, a cluster might emerge with ‘high value’ customers, thatis customers that are actively using the e-commerce websiteand purchasing products on it. It has long been argued thata company should target those customers with marketingcampaigns. However, recent works in the marketing literature

claim that this type of strategy might not ultimately optimisethe customer value to the company ([1], [3]). The reasonbehind this is that customers tend to spend brief periods oftime in a ‘high activity - high spending’ state, making thisstate very unstable and sporadic. Other states associated withlower immediate customer values might however represent amore loyal and solid customer base yielding a higher customervalue in the long term.

Using models for customer behaviour over time can thusprovide new insights to the company and help them design anoptimal marketing strategy [4]. In this paper, we develop analgorithm to track customer behaviour using transactional data.We model customer transactions as multi-dimensional objectsindexed by time. We first segment the market by finding staticclusters representing different object states using a Gaussianmixture model (GMM). Following this, we model how objectsmove between clusters by a hidden Markov model (HMM).A HMM can be used to model the dynamics of objects andpredict their future state within a probabilistic framework.We introduce several novelties in our approach. Firstly, weuse a market segmentation based on a variant of hierarchicalclustering. A first level of clustering is done on the ‘human-readable’ transactional features. These features represent thepositions of customers in the transactional space. A secondlevel of clustering is done using a different set of features:the velocities of the customers in this space. Secondly, we usethis augmentation of the state-space to improve the predictionability of the HMM.

In section II, we discuss our proposed algorithm in detail.We then show results on synthetic data and real transactionaldata in section III. We compare the models in terms of theirpredictive accuracy. However, this is not necessarily enoughinformation for the business to take a decision whether toadopt our predictive model or not. Therefore, we introduce aframework based on graphical models to quantify the improve-ment brought by our proposed model given specific utilityfunctions in section IV.

A. Related work

The use of transactional data for the purpose of understand-ing and predicting customer behaviour is a relatively recenttopic in the literature ([5], [6], [7], [8]). In [6] and [7], theauthors decouple ‘transactional data’ from ‘context data’ inorder to improve the prediction of customer behaviour in an e-commerce setting. They build a predictive model with current

features and try to predict the value of another set of features(dependent variables). The difference with our approach is thatthey do not incorporate the time dynamics in their model. Theyalso use different clustering algorithms. The work presented in[5] compares different types of granularity in the segmentationfor preditive accuracy. One of their conclusions is that forsparse datasets with few transactions the 1-to-1 model can beimproved upon by clustering groups of customers.

The use of Markov models is not new in the marketingliterature, ([1], [3], [4]). The works by [3] and [4] use a HMMto model the interactions between customers and companiesin order to estimate the effect of marketing expenditure onthe dynamics of the model. Both works use time-varyingcovariates to model the heterogeneity of customers. The workin [3] treats the problem of finding the optimal marketingresource allocation as a dynamic optimization problem. Intheir model, long-term effects of marketing expenditure de-pends on the transition matrix. The work in [1] predictsfuture customer state using a Markov model and infers futurecustomer-lifetime-value for different time horizons. Similar tothis work, they compare predictions with observed values tomeasure predictive accuracy. They conclude that a Markovmodel outperforms all other models that do not account fortime dynamics.

The work in [9] gives a series of examples where cash flowsare assigned to the states in the Markov chain. The problembecomes a Markov decision problem (MDP) where the optimalmarketing policy needs to be found. The models differ fromthe ones presented here in that customers are described byfew behavioural features and only some state transitions areallowed. The state of a customer is known at any given time,whereas we model it as a random variable. They introducethe idea of augmenting the state-space like we propose in thispaper. MDPs are also used in recommendation systems [10].We could extend this work in the future to incorporate a MDP.

In terms of the clustering algorithm, we used a variant ofhierarchical clustering where the levels are built on differentsets of features. In [11] the authors also use hierarchicalclustering with mixture models combined with a HMM. How-ever, they perform the levels of clustering on the same setof features and allow their clusters to have several states.They need to estimate the number of clusters, as well as thenumber of states. In our case, each cluster can only be in onestate because we assume our clusters are static in time. Theypropose using Bayesian model selection to find the optimalnumber of clusters, which is something that could also beapplied to our setting but we leave this for future work.

II. ALGORITHM

The main novelty in our approach is that we use a variantof hierarchical clustering to augment the state-space of aHMM. This algorithm has not been used in an e-commercecontext before. The use of hierarchical clustering comes as anatural choice when doing market segmentation. At a first levelof clustering, the ‘traditional’ or expected segments can bedefined by the marketer. This means that the first level might

not contain all the transactional features that are available,since this would make the segments more difficult to interpret.Once the first level has been defined, then it is possible to addmore levels to the clustering by introducing new features. Thesole purpose of the new clustering level is to achieve a betterprediction accuracy on the original clusters. In this paper, weuse the velocities of our data points as the additional features.However, this technique could be easily extended to use othertypes of features, such as context-based ones. The clusters thusdefined become the hidden states of the HMM.

In this paper, we use a mixture of Gaussians to modelthe clusters in the data at the different levels. At the firstlevel, there are K1 clusters indexed by k ∈ {1, . . . ,K1}.At the second level, there are K2 clusters indexed by k′ ∈{1, . . . ,K2}. The number of clusters K1 and K2 are knowna priori, the automatic identification of the number of clustersis beyond the scope of this paper. At the end, we end up withK = K1×K2 augmented clusters indexed by k ∈ {1, . . . ,K}.We assume that the clusters are static over time. We denotezit ∈ {1, . . . ,K1} the variable that indicates the membershipof yit, which is object i observed at time t. There are N objects,each with a varying number of observations Ti. The densityof the Gaussian mixture model is given by

g(yit) =

K1∑k=1

wkp(yit|µk,Σk),

where wk are the mixture weights, {µk,Σk}k=K1

k=1 are themixture component parameters and

p(yit|zit = k) =

(2π)−d2 |Σk|−

12 exp

(−

(yit − µk)Σ−1k (yit − µk)

2

).

The Expectation-Maximisation (EM) algorithm is a well-known method to find the maximum-likelihood parameters ofeach cluster, as well as the cluster weights.

We will omit the superscript i for notational simplicityfrom now on. Once we have the clusters at the first level,we proceed to find the K2 subclusters by introducing a newset of features. In this paper, we have used the velocities of theobjects y′t = yt−yt−1. For each level-1 cluster k, we aggregateall the observations assigned to that cluster {yt : zt = k}and compute their new set of features y′t. These are thevelocities of the objects in cluster k. We then perform thesecond level of clustering by estimating the subclusters overthis set of data points with the EM algorithm. From this, wefind the parameters of the velocity subclusters {µ′k,k′ ,Σ

′

k,k′}.The membership to the second-level states is denoted byz′t ∈ {1, . . . ,K2}.

The augmented data space is yt = [yt, y′t]T . The new aug-

mented cluster centre is µk = [µk, µ′k,k′ ]T and its augmented

covariance matrix is Σk = diag(Σk,Σ′

k,k′). We denote themembership to the augmented state-space by zt ∈ {1, . . . ,K}.The state {zt = K2(k − 1) + k′} corresponds to the level-1k-th cluster and the level-2 k′-th cluster, or {zt = k, z′t = k′}.

Once we have estimated the parameters of all the clusters,we can proceed to estimating the parameters of the HMM. TheHMM is defined by a Markov chain over the hidden variableszt (or zt) with transition probability matrix P , as well as bythe emission probabilities for each state p(yt|zt) which willbe defined by Gaussians with parameters {µk,Σk} for thenon-augmented state-space and {µk, Σk} for the augmentedone. Given the transition probability matrix for the HMM, itis possible to compute the predicted state of an object givenits past observations

p(zt+∆|y1, . . . , yt)

=∑zt

p(zt+∆, zt|y1, . . . , yt)

=∑zt

p(zt+∆|zt, y1, . . . , yt)p(zt|y1, . . . , yt). (1)

Computing the second term in (1) is a filtering task that caneasily be done using the forward algorithm. The first term isjust

p(zt+∆|zt, y1, . . . , yt) = p(zt+∆|zt), (2)

by the Markov property. This quantity can easily be computedusing the Chapman-Kolmogorov equation for the transitionmatrix P : P (zt+∆ = j|zt = i) = (P∆)ij .

The aim of our method is to get an estimate zt+∆ for thefuture state of an object. We start by computing the distribu-tion over future states p(zt+∆|y1, . . . , yt) as described above.However, for the augmented model we need an additional step.Since we are only interested in predicting the first level ofclustering, we need to sum up over the second level,

p(zt+∆ = k|y1, . . . , yt)

=

K2∑k′=1

p(zt+∆ = (k, k′)|y1, . . . , yt).

We then choose the maximum-a-posteriori (MAP) estimatorfor the future state of each observation

zt+∆ = argmaxk

p(zt+∆ = k|y1, . . . , yt). (3)

To get the equations for the augmented state-space, we onlyneed to replace yt by yt in (1), (2) and (3), and replace zt byzt in (1) and (2). Algorithm 1 and Figure 1 summarise all thesteps for the prediction in the augmented state-space model.

For notational simplicity, we will use z(t+ ∆) to describethe state estimation of an object at time t+∆. We compute theMAP estimator using both the non-augmented and augmentedstate-space, and we denote them as z(t + ∆) and z+(t + ∆)respectively. As a benchmark for the real data, we built anestimator zB(t + ∆) that assumes the object is in the samecluster it was at time t. We use this to understand whetherthe prediction error is due to a lack of general movement inthe data. Finally, we need to choose a groundtruth so thatwe can compare all predictions to it and compute an error. Asgroundtruth, we use the most-likely cluster membership at timet+∆, zML(t+∆) = argmaxz p(yt+∆|z). With this notation,

the benchmark estimator is such that zB(t+∆) = zML(t). Asan error metric, we then compute the root-mean-square errors(RMSE) for each estimator:√

E{(µz(t+∆) − µzML(t+∆))2}, for the non-augmented case,√E{(µz+(t+∆) − µzML(t+∆))2} for the augmented case and√E{(µzB(t+∆) − µzML(t+∆))2} for the benchmark.

The aim of this paper is to predict the future segment,and thus an error metric based on the centres of the clustersis appropriate. We also tried other error metrics, such aspercentage of right estimates. The results show similar trendsas the ones presented here, we could not add them for spaceconstraints.

Algorithm 1 Proposed augmented predictive model

1: Y1: position features, Y2: velocity features2: {K1,K2} ← {3, 2}3: Level-1 clustering4: Aggregate all observations in Y1

5: Run EM algorithm to estimate {µk,Σk}k=K1

k=1

6: Level-2 clustering7: for all clusters k ∈ [1,K1] do8: Find Y

′

k = {y′t ∈ Y2 : zt = k}9: Run EM algorithm on Y

′

k to estimate{µ′(k,k′),Σ

′(k,k′)}

k′=K2

k′=1

10: end for11: Time dynamics12: Augment data: Y ← [Y1, Y2]

13: Augment state parameters to obtain {µk, Σk}k=Kk=1

14: Run HMM on Y to estimate transition matrix P15: Get prediction zit+∆ for i ∈ [1, N ]

Position

Level 1 clustering

Velocity Velocity Velocity

Change in features

Level 2 clustering

Augmentation

1~k 2

~k 3

~k 4

~k 5

~k

2Kk 1k2Kk 2Kk 1k1k

Kk ~

1k 2k 1Kk

Fig. 1: Hierarchical clustering.

III. RESULTS ON PREDICTIVE ACCURACY

A. Synthetic data

We first tested the concept on synthetic data. We simu-lated a simple scenario similar to the one in [12], wherethe data comes from 3 different 2-dimensional clusters. Theclusters follow Gaussian distributions with different meansand identical variances, as shown in Figure 2. We work withN = 1000 points and T = 30 time steps. At the beginningof the simulation, there are only 2 clusters c1 and c2. Attime t = 10, cluster c1 splits in 3: 40% of the points stayin the same cluster, 30% go into cluster c2 and 30% pointscreate a new cluster c3. The transition is done smoothly over10 time points. This simulation is meant to represent someof the properties typically seen in real data such as a lownumber of transitions [5]. Most customers will not use the e-commerce website more than a couple of times, and thereforethe number of transactions will be small, and so will thenumber of transitions between states. Once a customer is ina state, he will most probably stay there unless he uses thewebsite again.

We then applied Algorithm 1 to this synthetic data. Afternormalising all data points, we computed their velocities fromthe normalised data and used that data set as the augmenteddata set. We computed predictions on the data points in clusterc1 only and their respective RMSE. We show the results inFigure 3.

30% 30%

40%

Fig. 2: Synthetic data scenario.

The non-augmented state-space model performs marginallyworse than the benchmark model. This might be due to thefact that 40% of all data points stay in cluster c1. The non-augmented transition matrix is almost diagonal since thereis only one transition happening for 60% of all data pointsonce in 30 time steps. The benchmark model is like an HMMpredictor with a diagonal transition matrix. This explains whyboth models give near-identical errors. However, it becomes

0 5 10 15 200

0.5

1

1.5

step ∆

RM

SE

Prediction accuracy for the synthetic data

non−augmented MAP

augmented MAP

benchmark

Fig. 3: Prediction error curves for the synthetic data.

apparent how by adding the velocity of the data points, wecan improve the RMSE.

B. Real data

The real data is transactional data from an e-commercewebsite recorded on a 3-month period in 2012. The featuresare fairly generic so that it will not affect the understandingof the experimental results for other applications. We pre-processed the data by sampling it every 2.4 hours for atotal of 1,000 sampling points. The number of customersjoining the e-commerce increases over time up to a numberof approximately 18,000. However, in order to gather enoughdata per customer, we only used data from customers whojoined the website by the 300th sampling point. At eachsampling point, we checked whether the state of the customersalready in the system changed compared to the current state.In other words, we created datasets of transactions made byeach customer.

The first level clustering is done by using meaningful fea-tures for the business. Similarly to the synthetic data example,we also use velocities for the second level clustering. Allfeatures have been normalised beforehand. As is the standardpractice, we split the data in training and test sets. The trainingset is the set of customers that were active from the timets = 100 onwards, and it amounts to 2,208 customers. The testset consists of the different set of 4,425 customers who startedtheir activity in the interval [101, 300], and thus excludes thecustomers from the training set.

1) Only 2 features: As an example, we chose two featuresdescribing a transaction: ‘Event Number’ and ‘Average Prod-uct Value’ that we abbreviate as EN and APV. The numberof events of a customer is not the number of transactions, ascustomers can log into the website yet not purchase or changeanything on their profiles. It is a cumulative feature. Theaverage product value is the average amount of money spenton the e-commerce website, which is a meaningful feature forthe business.

−1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

EN

AP

V

Position clusters with K1=3

cluster 1(3%)

cluster 2 (25%)

cluster 3 (72%)

(a)

−1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

EN

AP

V

Position clusters with K1=4

cluster 1(3%)

cluster 2 (17%)

cluster 3 (66%)

cluster 4 (14%)

(b)

0 5 10 15 20 25 30 35 400.2

0.4

0.6

0.8

1

1.2

1.4

1.6

step ∆

RM

SE

Out−of−sample prediction accuracy for K1=3

non−augmented MAP

augmented MAP

benchmark

(c)

0 5 10 15 20 25 30 35 400.2

0.4

0.6

0.8

1

1.2

1.4

1.6

step ∆

RM

SE


non−augmented MAP

augmented MAP

benchmark

(d)

Fig. 4: Results of level-1 clustering and prediction error for 3 and 4 clusters.

We ran the algorithm using K1 = 3 and K1 = 4, keepingK2 = 2. The results are displayed in Figure 4. The normalisedaggregated features are showed in Figures 4a and 4b. Some ofthe trajectories formed by customers are apparent after visualinspection. We have also showed in colours the assignmentsof all points to the different clusters by finding the maximum-likelihood estimate for each point zML. We observe that thereis a small distinct group of customers representing 3% ofall transactions with a high average product value (cluster1). The optimal partition for the remaining transactions isnot straightforward after visual inspection. Most transactionsare densely distributed in the negative range: this is due tothe fact that features are normalised and that the number oftransactions per customer is very low on average. We do notpropose an optimal clustering in this paper, and it is alsobeyond the scope of this work to find the optimal numberof clusters.

The prediction accuracy for K1 = 3 is best for thebenchmark predictor zB . This could mean that the clusters are

not fine enough to capture the different dynamics in the data,and thus there is a very high probability of staying in the samecluster over time. The results change when we introduce anadditional cluster. By introducing the new cluster, we manageto enclose in a single group the number of transactions withhigh EN (cluster 4 in Figure 4b). This reveals an importantimprovement in the prediction accuracy for a step size ∆ > 12in the augmented case and ∆ > 20 in the non-augmentedcase relative to the benchmark. We can also appreciate howaugmenting the state-space improves the prediction accuracy.This gives substance to our intuition that the velocity of atransaction bears meaningful information about its value inthe future.

We now show results for the case of K1 = 4 clusters. Inorder to understand the business dynamics, it is useful to lookat the centres of the clusters and give them a meaning:

1) Customer has a medium NE and a very high APV2) Customer has a medium NE and a medium APV3) Customer has a very low NE and a low APV

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

70

80

90

100

number of transactions

pe

rce

nta

ge

of

cu

sto

me

rsCluster proportions over time

cluster 1

cluster 2

cluster 3

cluster 4

Fig. 5: Cluster proportions over time.

4) Customer has a very high NE and a low APVFrom these descriptions, one should expect that a customer

will most likely start from cluster 3, a cluster that capturestransactions with low Event Number. The customer could thenend up in clusters 1, 2 or 4. Figure 5 shows the proportion ofcustomers in each clusters stratified by number of transactions.As expected, the proportion of customers in cluster 3 isthe highest at the beginning but decreases over time. Theproportion of customers in cluster 4 has the largest increase,while the one in cluster 1 increases but remains very smallover time. If we now look at the transition matrix P raised toa power of ∆ = 20,

P 20 =

1 0 0 00 0.5511 0.0176 0.43130 0.1994 0.7365 0.06410 0 0 1

, (4)

it becomes obvious that states 1 and 4 are both absorbing. Inboth states, the number of events is medium to high which isnot surprising since customers that have done more than 20transactions are considered very active. However, both statesdiffer in the average purchased product value. While state 1 ishighly desirable, it is not the same for state 4. The transitionprobabilities from states 2 and 3 to state 1 are 0 after 20steps, which means that the initial distribution will determinethe proportion of customers in this high-value state.

2) All features: We also used all the features available tous: ‘Event Number’, ‘Average reward value’, ‘Number prod-ucts’, ‘Average product value’ and ‘Average reward discount’.Customers on this e-commerce website buy products fromthe website which eventually return rewards. Sometimes therewards have discounts, which means that a customer gets thesame reward out of a smaller expense. We ran our algorithmusing K1 = 3 and K2 = 2 and report the results here. Thebusiness meanings of each state are shown in Table I. Themost desirable state is state 3. State 2 represents a negative

state for the company, as it is a group of customers that spendslittle and gets high rewards. The transition matrix is nearlydiagonal, showing the sparsity in cluster changes in the data. Inthis case, the underlying Markov chain admits an equilibriumdistribution with proportions 94%, 6% and 0% for clusters1, 2 and 3 respectively. In the long run most customers willbe in the cluster with medium amount of expenditure andlow average reward value and discount. The high-value statewhich represents 3% of all data transactions will not haveany customer in the long run. This observation shows thatthe business in question needs to take action and change thecustomer dynamics in order to preserve the high-value state 3.Figure 6 shows the prediction error on the set of all features.Once again, the augmented state-space outperforms all othertechniques for step sizes ∆ > 16.

0 5 10 15 20 25 30 35 400.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

step ∆

RM

SE


non−augmented MAP

augmented MAP

benchmark

Fig. 6: RMSE for 3 clusters, all features.

We can also quantify the improvement achieved by aug-menting the state-space by using information theory. In thiscase we compare the entropy of the future position withand without knowing the current velocity. More precisely,we compare H(zt+1|zt) and H(zt+1|zt, z′t). For the non-augmented matrix, the entropy is given by

H(zt+1|zt) =∑j

p(zt = j)∑k

H(zt+1 = k|zt = j) (5)

= 0.0393.

Similarly, for the augmented matrix, the entropy is given by

H(zt+1|zt, z′t) (6)

=∑j,j′

p(zt = j, z′t = j′)∑k

H(zt+1 = k|zt = j, z′t = j′)

= 0.0354.

For equation (6), we need to sum over the columns of thetransition matrix to get the transition probabilities p(zt+1 =k|zt = j, z′t = j′) =

∑k′ p(zt+1 = k, z′t+1 = k′|zt = j, z′t =

j′). For both equations (5) and (6), we need to know thepreviously mentioned stationary distribution of the underlying

Cluster business meaning Proportion ofall transactions

‘EventNumber’

‘Averagerewardvalue’

‘Numberproducts’

‘Averageproductvalue’

‘Averagerewarddiscount’

‘Medium spender with low discount’ 56% M L M L L‘Low spender with high discount’ 41% L L L L H‘Very active, high spender withmedium discount’

3% H HH HH HH M

TABLE I: Qualitative cluster meaning based on the relative values of the features. HH = very high, H = high, M = mediumand L = low.

Markov chain. Both entropy values are very low, and this isdue to the sparsity in the transactional data: most objects makefew transactions and stay in the same cluster. However, byaugmenting the state-space, we have been able to decreasethe amount of ‘randomness’ by 10%. By doing so, there isalso an increase in the predictability of the process.

IV. RESULTS ON THE DECISION-MAKING PROCESS

In this section, we use tools developed in decision theoryto find the value of information (VOI) of our augmentedmodel. We would like to measure the value of augmentingour model by observing the velocities. In other words, howmuch is a marketer willing to pay to know the velocities ofthe transactions? We can model our decision problem as aprobabilistic graphical model [13]. In these models, there aresome random variables (also called chance variables) that willdescribe a specific part of the business, like the state of themarket. The business can take actions. Each action will have acost associated to it. In terms of marketing, an action could beto offer a customer a voucher or a price discount. The value foreach action will depend on the future state of the consumerzt+∆, as well as the action taken at time t. In our specificexample, we discussed how state 1 was a medium-value state,state 2 was a low-value state and state 3 was a high-valuestate. Ideally, we would like customers to end up in state 1or 3, but avoid state 2. In that case, we need to perform anaction if the predicted state of a person is state 2.

Figure 7 represents our decision problem as a graphicalmodel, where random variables are represented by circles,action variables by rectangles and the value of the utilityfunction is represented by a diamond shape. In this model,the distribution over the future state p(zt+∆) is computedusing the augmented transition matrix and depends on boththe position cluster zt and the velocity cluster z′t of an objectat time t. If the dashed line is present, the action will dependon both the position and the velocity of the object. However, ifwe remove this dashed line, it will only depend on the position.Our aim is to compute the gain in maximum expected utilityof our augmented model (with dashed line) when compared toa model where actions are independent of velocities (withoutdashed line). This will be the VOI of knowing the velocities.

We only present a simple toy example with 3 utility func-tions shown in Table II. We model the action by a variableD that takes a value of 0 if no action is performed, and 1 ifan action is performed. The value to the business for differentpairs {D, zt+∆} is given by the utility function Ui. We have

!ztzt

zt+Δ Ac#on

Value

Fig. 7: Graphical model for the real data example.

assumed that taking an action had a cost of 5 for states 1 and3, since performing an action for those states is not necessary.However, it becomes important for the business to performan action whenever we would like to prevent a customer fromjoining state 2, which is considered a ‘bad’ state. We introduce3 utility functions to show how they can impact the final value.They only differ in their value taken for state 1, as this willbe the critical state since it is absorbing.

State zt+∆ Action taken D U1 U2 U3

1 0 -5 0 51 1 -10 -5 02 0 -10 -10 -102 1 10 10 103 0 10 10 103 1 5 5 5

TABLE II: Value for each pair of {Future state, Action}.

The first step is to compute the expected utility EU giventhe action D and the pair of variables {zt, z′t}, which is givenby

EU(D(zt, z′t)|zt, z′t) =

∑k

p(zt+∆ = k|zt, z′t)U(k,D(zt, z′t)).

(7)

In the case of the graphical model in Figure 7 without thedashed line, the action cannot depend on the velocities andtherefore

EU(D(zt)|zt, z′t) =∑k

p(zt+∆ = k|zt, z′t)U(k,D(zt)).

The optimal set of actions for each {zt, z′t} is thus

Dopt(zt, z′t) = argmax EU(D(zt, z

′t)|zt, z′t), and

Dopt(zt) = argmax EU(D(zt)|zt, z′t).

Once we know the optimal actions, we can then computethe maximum expected utility (MEU) by summing over theinitial distribution of the variables {zt, z′t} in the HMM

MEU =∑k,k′

p(zt = k, z′t = k′)EU(Dopt(zt, z′t)|k, k′), and

MEU =∑k,k′

p(zt = k, z′t = k′)EU(Dopt(zt)|k, k′).

0 5 10 15 20 25 30 35 40−6

−4

−2

0

2

4

6

8

MEU for different utility functions

step ∆

ME

U

U1

U2

U3

Fig. 8: Maximum expected utility (MEU) for the differentutility functions in Table II.

Figure 8 shows the MEU for the different utility functionsUi. The choice of Ui affects the final maximum expectedutility, U3 is the only utility to achieve positive gains overthe entire range of prediction steps. When we compared themodels with and without the dashed line, we did not see anygain in utility. This means that increasing the complexity ofthe model by augmenting the state-space does not increasethe business value of the model. From this section, we canconclude that even though our augmented model achievedlower error and lower entropy, there is no benefit to thebusiness of using this model instead of the non-augmentedone. This result is not very surprising, since both the non-augmented and augmented transition matrices are nearly di-agonal and block-diagonal respectively. However, we do notexclude the possibility that for more complex utilities, and/ormore complex augmentations, the value of the augmentationis non-zero.

V. CONCLUSION

We have built a model for customer behaviour predictionthat relies on performing a variant of hierarchical clustering onthe data followed by modelling the dynamics with an HMM.In this paper, we use the velocities of the transactions in thesecond level clustering. After applying the method to the realdata, we learnt about some of the business dynamics at play.For example, we saw how sometimes there can be differentlong-run absorbing states. From a marketer point of view, it

might be desirable to increase the long-term proportions incertain clusters by taking action and therefore altering some ofthe transition probabilities of the model. We demonstrated thatthe velocity of an object can improve the prediction accuracy,but the clustering stage can affect the results dramatically, aswe saw with the real data example.

We then used a graphical model with an associated utilityfunction to quantify the improvement to a business of increas-ing the complexity of the predictive model. We concluded thatthere was no benefit at least for some utility functions. Inthis work, we did not attempt to find the optimal clusteringof the data for prediction and utility maximisation purposes.A better clustering of the data might have achieved a largerutility improvement. We leave this for future work. Moreover,we worked on a Gaussian state-space model to keep the modeltractable which might not be an optimal choice for this typeof data, but this could be extended for a non-Gaussian case.

ACKNOWLEDGMENTS

We would like to thank FeatureSpace (who own all rights)for their collaboration and financial support for this work. Thesecond author gratefully acknowledges the support of FCTdoctoral grant SFRH / BD / 68331 / 2010

REFERENCES

[1] C. Homburg, V. Steiner, and D. Totzek, “Managing dynamics in acustomer portfolio,” Journal of Marketing, vol. 73, no. 5, pp. 70–89,2009.

[2] G. Punj and D. W. Stewart, “Cluster analysis in marketing research:review and suggestions for application,” Journal of marketing research,pp. 134–148, 1983.

[3] V. Kumar, S. Sriram, A. Luo, and P. Chintagunta, “Assessing the effectof marketing investments in a business marketing context,” MarketingScience, vol. 30, no. 5, pp. 924–940, 2011.

[4] O. Netzer, J. Lattin, and V. Srinivasan, “A hidden markov model ofcustomer relationship dynamics,” Marketing Science, vol. 27, no. 2, pp.185–204, 2008.

[5] T. Jiang and A. Tuzhilin, “Segmenting customers from population toindividuals: Does 1-to-1 keep your customers forever?” Knowledge andData Engineering, IEEE Transactions on, vol. 18, no. 10, pp. 1297–1311, 2006.

[6] M. Faraone, M. Gorgoglione, and C. Palmisano, “Contextual seg-mentation: using context to improve behavior predictive models in e-commerce,” in Data Mining Workshops (ICDMW), 2010 IEEE Interna-tional Conference on. IEEE, 2010, pp. 1053–1060.

[7] C. Palmisano, A. Tuzhilin, and M. Gorgoglione, “Using context toimprove predictive modeling of customers in personalization applica-tions,” Knowledge and Data Engineering, IEEE Transactions on, vol. 20,no. 11, pp. 1535–1549, 2008.

[8] I. V. Cadez, P. Smyth, E. Ip, and H. Mannila, “Predictive profiles fortransaction data using finite mixture models,” Tech. Rep., 2001.

[9] P. E. Pfeifer and R. L. Carraway, “Modeling customer relationships asmarkov chains,” Journal of interactive marketing, vol. 14, no. 2, pp.43–55, 2000.

[10] G. Shani, D. Heckerman, R. I. Brafman, and C. Boutilier, “An mdp-based recommender system,” in Journal of Machine Learning Research.Morgan Kaufmann, 2002, pp. 453–460.

[11] C. Li and G. Biswas, “Clustering sequence data using hidden markovmodel representation,” in Proceedings of SPIE- The International Societyfor Optical Engineering, vol. 3695. Citeseer, 1999, pp. 14–21.

[12] K. Xu, M. Kliger, and A. Hero, “Evolutionary spectral clustering withadaptive forgetting factor,” in Acoustics Speech and Signal Processing(ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp.2174–2177.

[13] D. Koller and N. Friedman, Probabilistic graphical models: principlesand techniques. MIT press, 2009.

tracking of consumer behaviour in e-commercefusion.isif.org › proceedings › fusion2013 › html...

Documents