incorporating sequential information into traditional classification models by using an...
TRANSCRIPT
www.elsevier.com/locate/dsw
Decision Support Systems
Incorporating sequential information into traditional classification
models by using an element/position-sensitive SAM
Anita Prinzie T, Dirk Van den Poel
Department of Marketing, Faculty of Economics and Business Administration, Ghent University, Hoveniersberg 24, Ghent, Belgium
Available online 22 April 2005
Abstract
The inability to capture sequential patterns is a typical drawback of predictive classification methods. This caveat might be
overcome by modeling sequential independent variables by sequence-analysis methods. Combining classification methods with
sequence-analysis methods enables classification models to incorporate non-time varying as well as sequential independent
variables. In this paper, we precede a classification model by an element/position-sensitive Sequence-Alignment Method (SAM)
followed by the asymmetric, disjoint Taylor–Butina clustering algorithm with the aim to distinguish clusters with respect to the
sequential dimension. We illustrate this procedure on a customer-attrition model as a decision-support system for customer
retention of an International Financial-Services Provider (IFSP). The binary customer-churn classification model following the
new approach significantly outperforms an attrition model which incorporates the sequential information directly into the
classification method.
D 2005 Elsevier B.V. All rights reserved.
Keywords: Sequence analysis; Binary classification methods; Sequence-alignment method; Asymmetric clustering; Customer-relationship
management; Churn analysis
1. Introduction
In the past, traditional classification models like
logistic regression have been applied successfully to
the prediction of a dependent variable by a series of
non-time varying independent variables [5]. In case
there are time-varying independent variables, these are
0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2005.02.004
T Corresponding author. Tel.: +32 9 264 35 20; fax: +32 9 264
42 79.
E-mail addresses: [email protected] (A. Prinzie),
[email protected] (D. Van den Poel).
typically included in the model by transforming them
into non-time varying variables [3]. Unfortunately, this
practice results in information loss as the sequential
patterns of the data are neglected. Hence, although
traditional classification models are highly valid and
robust for modeling non-time varying data, they are
unable to capture sequential patterns in data. This
caveat might be overcome by modeling time-varying
independent variables by sequence-analysis methods.
Unlike traditional classification methods, sequence-
analysis methods were designed for modeling sequen-
tial information. These methods take sequences of data,
42 (2006) 508–526
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 509
i.e., ordered arrays, as their input rather than individual
data points. With the exception of marketing, sequence
analysis is commonly applied in disciplines like
archeology [31], biology [37], computer sciences
[38], economics [18], history [1], linguistics [23],
psychology [7] and sociology [2]. Sequence-analysis
methods can be categorized depending on whether the
sequences are treated as a whole or step by step [1].
Step-by-step methods examine relationships among
elements or states in the sequences. Time-series
methods are used to study the dependence of an
interval-measured sequence on its own past. When
the variable of interest is categorical, Markov methods
are appropriate. The latter methods calculate transition
probabilities based on the transition between two
events [36]. Transitions from one prior category can
be modeled by using event-history methods, also
known as duration methods, hazard methods, failure
analysis, and reliability analysis. The central research
question studied is time until transition. Whole-
sequence methods use the entire sequence as unit of
analysis to discover similarities between sequences
resulting in typologies. The central issue addressed is
whether there are patterns in the sequences, either over
the whole sequences or within parts of them. There are
two approaches to this pattern question. In the algebraic
approach, each sequence is reduced to some simplest
form and sequences with similar dsimplest formsT aregathered under one heading. In the metric approach, a
similarity measure between the sequences is calculated
which is then subsequently processed by clustering,
scaling and other categorization methods to extract
typical sequential patterns. Methods like optimal
matching or optimal alignment are commonly applied
within this metric approach. In an intermediate
situation, local similarities are analyzed to find out
the role of key subsequences embedded in longer
sequences [48].
Given that traditional classification models are
designed for modeling non-time varying independent
variables and that sequence-analysis methods are
well-suited to model dynamic information, it follows
that a combination of both methods unites the best of
both worlds and allows for building predictive
classification models incorporating non-time varying
as well as time-varying independent variables. One
possible approach amongst others exists in preceding
the traditional classification method by a sequence-
analysis method to model the dynamic exogenous
variables (cf. serial instead of parallel combination of
classifiers, [25]). In this paper, we precede a logistic
regression, as a traditional classification method, by
a Sequence-Alignment Method (i.e., SAM), as a
whole sequence method using the metric approach.
The SAM analysis is used to model a time-varying
independent variable. We identify how similar the
customers are on the dynamic independent variable
by calculating a similarity measure between each
pair of customers, the SAM distances. These
distances are further processed by a clustering
algorithm to produce groups of customers which
are relatively homogeneous with respect to the
dynamic independent variable. As we cluster on a
dimension influencing the dependent variable, the
clusters are not only homogeneous in terms of the
time-varying independent variable, but should also
be homogeneous with respect to the dependent
variable. This way, we make the implicit link
between clustering and classification explicit. After
all, clustering is in theory a special problem of
classification associated with an equivalence relation
defined over a set [35]. Including the cluster-
membership information as dummies in the classi-
fication model not only allows for modeling the
dynamic independent variable in an appropriate way,
it should even improve the predictive performance.
In this paper, we illustrate the new procedure,
which combines a sequence-analysis method with a
traditional classification method, by estimating a
customer-attrition model for a large International
Financial-Services Provider (from now on referred
to as IFSP). This attrition model feeds the managerial
decision process and helps refining the retention
strategy by elucidating the profile of customers with
a high defection risk. A traditional logistic regression
is applied to predict whether a customer will churn or
not. This logistic regression is preceded by an
element- and position-sensitive Sequence-Alignment
Method to incorporate a time-varying covariate. We
will calculate the distance between each customer on a
sequential dimension, i.e., the evolution in relative
account-balance total of the customer at the IFSP, and
use these distances as input for a subsequent cluster
analysis. The cluster-membership information is
incorporated in the logistic regression by dummies.
We hypothesize that the logistic-regression model
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526510
with the time-varying independent variable included
as cluster dummy variables will outperform the
traditional logistic regression where the same sequen-
tial dimension is incorporated by creating as many
non-time varying independent variables as there are
time points on which the dimension is measured.
The remainder of this paper is structured as
follows. In Section 2 we describe the different
methods used. We discuss the basic principles of
SAM and underline how the cost allocation influences
the mathematical features of the resulting SAM
distance measures determining whether a symmetric
or asymmetric clustering algorithm is appropriate. We
outline how a modification of Taylor’s cluster-
sampling algorithm [41] and Butina’s cluster algo-
rithm based on exclusion spheres [6] allows clustering
on asymmetric SAM distances. In Section 3 we
outline how the new procedure proposed in this paper
is applied within a financial-services context to
improve prediction of churn behavior. Section 4
investigates whether the results confirm our hypoth-
esis on improved predictive performance. We con-
clude with a discussion of the main findings and
introduce some avenues for further research.
2. Methodology
2.1. Sequence-alignment method (SAM)
The Sequence-Alignment Method (SAM) was
developed in computer sciences (text editing and
voice recognition) and molecular biology (protein and
nucleic acid analysis). A common application in
computer sciences is string correction or string editing
[46]. The main use of sequence comparison in
molecular biology is to detect the homology between
macromolecules. If the distance between two macro-
molecules is small enough, one may conclude that
they have a common evolutionary ancestor. Applica-
tions of sequence alignment in molecular biology use
comparatively simple alphabets (the four nucleotide
molecules or the twenty amino acids) but tend to have
very long sequences [48]. Conversely, in marketing
applications, sequences will mostly be shorter but
with a very large alphabet. Besides SAM applications
in computer sciences and molecular biology, there are
applications in social science [2], transportation
research [21] and speech processing [33]. Recently,
SAM has been applied in marketing to discover
visiting patterns of websites [17].
Sankoff and Kruskall [39], Waterman [47] and
Gribskov and Devereux [13] are good references on
Sequence-Alignment Method. SAM handles variable-
length sequences and incorporates sequential infor-
mation, i.e., the order in which the elements appear in
a sequence, into its distance measure (unlike conven-
tional position-based distance measures, like Eucli-
dean, Minkowsky, city block and Hamming dis-
tances). The original sequence-alignment method
can be summarized as follows. Suppose we compare
sequence a, called the source, having i elements a =a
[a1, . . ., ai] with sequence b, i.e., the target, having j
elements b =b [b1, . . ., bj]. In general, the distance or
similarity between sequence a and b is expressed by
the number of operations (i.e., total amount of effort)
necessary to convert sequence a into b. The SAM
distance is represented by a score. The higher the
score, the more effort it takes to equalize the
sequences and the less similar they are. The elemen-
tary operations are insertions, deletions and substitu-
tions or replacements. Deletion and insertion
operations, often referred to as indel, are applied to
elements of the source (first) sequence in order to
change the source into the target (second) sequence.
Substitution operations indicate deletion+ insertion.
Some advanced research involves other operations
like swaps or transpositions (i.e., the interchange of
adjacent elements in the sequence), compression (of
two or more elements into one element) and expan-
sion (of one element into two or more elements).
Every elementary operation is given a weight (i.e.,
cost) greater than or equal to zero. It is common
practice to make assumptions on the weights in order
to achieve the metric axioms (nonnegative property,
zero property, triangle inequality and symmetry) of
mathematical distance (e.g., equal weights for dele-
tions and insertions to preserve the symmetry axiom)
[39]. Weights may be tailored to reflect the impor-
tance of operations, the similarity of particular
elements (cf. element sensitive), the position of
elements in the sequence (cf. position sensitive), or
the number/type of neighboring elements or gaps [48].
A different weight for insertion and deletion as well as
position-sensitive weights result in SAM distances
which are no longer symmetric: cf. |ab|~= |ba|. The
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 511
latter has its implications on the clustering algorithm
that could be used (cf. infra). Different meanings can
be given to the word ddistanceT in sequence compar-
ison. In this paper, we express the relatedness
(similarity or distance) between customers on their
evolution in relative account-balance total at the IFSP
by calculating the weighted-Levenshtein [28] dis-
tance between each possible pair of customers (i.e.,
pairwise-sequence analysis). The weighted-Leven-
shtein distance defines dissimilarity as the smallest
sum of operation-weighting values required to change
sequence a into b. This way a distance matrix is
constructed and consecutively, used as input for a
cluster analysis.
2.2. Cluster analysis of weighted-Levenshtein SAM
distances
We cluster the customers on the weighted-Lev-
enshtein SAM distances, expressing how dissimilar
they are on the sequential dimension. The cluster-
membership information resulting from this cluster
analysis is translated into cluster dummies, which
represent the sequential dimension in a subsequent
classification model. We hypothesize that a classi-
fication model including cluster indicators (operation-
alized as dummy variables) based on SAM distances
will outperform a similar model where the same
sequential dimension is incorporated by as many non-
time varying independent variables as time points on
which the dimension is measured. After all, these
dummies are good indicators of what type of
behavior the customer exhibits towards the sequential
dimension (i.e., time-varying independent variable), as
well as towards the dependent variable (cf. explicit
typology of customers on time-varying covariate
results in implicit typology on the dependent variable).
A distance matrix holding the pairwise weighted-
Levenshtein distances between customer sequences, is
used as a distance measure for clustering. As
discussed earlier, depending on how the weights
(i.e., costs) for SAM are set, the distances in the
matrix are symmetric or asymmetric. Most common
clustering methods employ symmetric, hierarchical
algorithms such as Wards, Single-, Complete-, Aver-
age-, or CentroRd linkage [14,19,24], non-hierarchical
algorithms such as Jarvis–Patrick [20], or partitional
algorithms such as k-means or hill-climbing. Such
methods require symmetric measures, e.g. Tanimoto,
Euclidean, Hamman or Ochai, as their inputs. One
drawback of these methods is that they cannot capture
important asymmetric relationships. Nevertheless,
there exist many practical scenarios where the under-
lying relation is asymmetric. Asymmetric relation-
ships are common in transportation research (cf.
different distance between two cities A and B
(|AB|~= |BA|) due to other routes (e.g. the Vehicle
Routing Problem [42])), in text mining (cf. word
associations, e.g. most people will relate ddataT to
dminingT more strongly than conversely [43]), in
sociometric ratings (cf. a person i could express a
higher like or dislike rating to person j than vice
versa), in chemoinformatics (cf. compound A may
fit into compound B while the reverse is not
necessarily true) and to a lesser extent in marketing
research (cf. brand-switching counts [9], dfirstchoiceT–dsecond choiceT connections [44] and the
asymmetric price effects between competing brands
[40]). A good overview of models for asymmetric
proximities is given by Zielman and Heiser [49].
Although there are a lot of research settings in-
volving asymmetric proximities, only a few clus-
tering algorithms can handle asymmetric data. Most
of these are based on a nearest-neighbor table
(NNT). Krishna and Krishnapuram [26] provide a
clustering algorithm for asymmetric data (i.e.,
CAARD algorithm which closely resembles the
Leader Clustering Algorithm (LCA) [15]) with
applications to text mining. Ozawa [35] defines a
hierarchical asymmetric clustering algorithm called
Classic, and applies it on the detection of gestalt
clusters. His algorithm is based on an iteratively
defined nested sequence of NNRs (i.e., Nearest
Neighbors Relations). MacCuish et al. [30] con-
verted the Taylor–Butina exclusion region grouping-
algorithms [6,41] into a real clustering algorithm,
which can be used for both disjoint or non-disjoint
(overlapping), either symmetric or asymmetric clus-
tering. Although this algorithm is designed for
clustering compounds (i.e., the chemoinformatics
field with applications like compound acquisition
and lead optimization in high-throughput screening),
in this paper it is employed to cluster customers on
marketing-related information. More specifically, we
apply the asymmetric, disjoint version of the
algorithm to the asymmetric SAM distances obtained
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526512
earlier. The asymmetric disjoint Taylor–Butina algo-
rithm is a five-step procedure (Fig. 1) [29]:
1. Create the threshold nearest-neighbor table using
similarities in both directions.
2. Find true singletons, i.e., data points (in our case
customers) with an empty nearest-neighbor list.
Those elements do not fall into any cluster.
3. Find the data point with the largest nearest-
neighbor list. This point tends to be in the center
of the k-th (cf. k clusters) most densely occupied
region of the data space. The data point together
with all its neighbors within its exclusion region,
constitute a cluster. The data point itself becomes
the representative data point for the cluster.
Remove all elements in the cluster from all
nearest-neighbor lists. This process can be seen
as putting an dexclusion sphereT around the newly
formed cluster [6].
4. Repeat step 3 until no data points exist with a non-
empty nearest-neighbor list.
5. Assign remaining data points, i.e., false singletons,
to the group that contains their most similar nearest
neighbor, but identify them as bfalse singletonsQ.These elements have neighbors at the given
similarity threshold criterion (e.g. all elements with
a dissimilarity measure smaller than 0.3 are
deemed similar), but a dstrongerT cluster represen-
Threshold=.15
Representative Compound False Singleton
Exclusion Regions diameter set by threshold value
True Singleton Dissimilarity in both directions
Fig. 1. Asymmetric Taylor–Butina Schematic [29].
tative, i.e., one with more neighbors in the list,
excludes those neighbors (cf. cluster criterion).
2.3. Incorporating cluster membership information in
the classification model
After having applied SAM and cluster analysis
using the asymmetric Taylor–Butina algorithm, we
build a classification model to predict a binary target
variable, in our application dchurnT. As a classificationmethod, we use binary logistic regression. We build
two churnmodels using the logistic-regressionmethod.
One model includes the sequential dimension as cluster
dummies resulting from clustering the SAM distances
(from now on referred to as LogSeq). The second
model incorporates the sequential dimension in a
traditional way by asmany non-time varying regressors
as there are time points, on which the dimension is
measured (from now on referred to as LogNonseq).
Both models are estimated on a training sample and
subsequently validated on a hold-out sample, contain-
ing customers not belonging to the training sample. We
compare the predictive performance of the LogSeq
model with that of the LogNonseq model.
In order to test the performance of the LogSeq
model on the hold-out sample, we need to define a
procedure to assign the hold-out customers to the
clusters identified on the training sample. We define
five sequences per cluster in the training sample as
representatives. By default the grouping module of the
Mesa Suite software package, which implements the
Taylor–Butina clustering algorithm [29], returns only
one representative for each identified cluster. We
prefer to have more than one representative customer
for each cluster in order to improve the quality of
allocation of customers in the hold-out sample to the
clusters identified on the training sample. Therefore,
once we have found a good k-th cluster solution on
the training sample, we apply the Taylor–Butina
algorithm to the cluster-specific SAM distances in
order to obtain a five-cluster solution delivering five
representatives for the given cluster. This way, each
cluster has five representatives. Next, we calculate the
SAM distances of the hold-out sequences towards
these groups of five cluster representatives and vice
versa. Each hold-out sequence is assigned to the
cluster to which it has the smallest average distance
(i.e., smallest average distance towards five cluster
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 513
representatives). This cluster membership information
is transformed into cluster dummy variables.
The predictive performance of the classification
models (in this case: logistic regression) is assessed by
the Area Under the receiver operating Curve (AUC).
Unlike the Percentage Correctly Classified (i.e., PCC),
this performance measure is independent of the
chosen cut-off. The Receiver Operating Character-
istics curve plots the hit percentage (events predicted
to be events) on the vertical axis versus the percentage
false alarms (non-events predicted to be events) on the
horizontal axis for all possible cut-off values [12]. The
predictive accuracy of the logistic-regression models
is expressed by the area under the ROC curve (AUC).
The AUC statistic ranges from a lower limit of 0.5 for
chance (null-model) performance to an upper limit of
1.0 for perfect performance [12]. We compare the
predictive performance of the LogSeq model with the
predictive accuracy of the LogNonseq model. We
hypothesize that the LogSeq model will outperform
the LogNonseq model.
3. A financial-services application
We illustrate our new procedure, which combines
sequence analysis with a traditional classification
method, on a churn-prediction case to support the
customer-retention decision system of a major Finan-
cial-Services Provider (i.e., IFSP). Over the past two
decades, the financial markets have become more
competitive due to the mature nature of the sector on
the one hand and deregulation on the other, resulting in
diminishing profit margins and blurring distinctions
between banks, insurers and brokerage firms (i.e.,
universal banking). Hence, nowadays a small number
of large institutions offering a wider set of services
dominate the financial-services industry. These devel-
opments stimulated bank assurance companies to
implement Customer Relationship Management
(CRM). Under this intensive competitive pressure,
companies realize the importance of retaining their
current customers. The substantive relevance of
attrition modeling comes from the fact that an increase
in retention rate of just one percentage point may result
in substantial profit increases [45]. Successful cus-
tomer retention allows organizations to focus more on
the needs of their existing customers, thereby increas-
ing the managerial insights into these customers’ needs
and hence decreasing the servicing costs. Moreover,
long-term customers buy more [11] and if satisfied,
might provide new referrals through positive word-of-
mouth for the company. These customers tend to be
less sensitive to competitive marketing actions.
Finally, losing customers leads to opportunity costs
due to lost sales and because attracting new customers
is five to six times more expensive than customer
retention [4,8]. For an overview on the literature in
attrition analysis we refer to Van den Poel and
Lariviere [45]. Combining several techniques (just
like in this paper) to achieve improved attrition models
has already been shown to be highly effective [27].
3.1. Customer selection
In this paper, we define a dchurnedT customer as
someone who closed all his accounts at the IFSP. We
predict whether customers still being customer at
December 31st, 2002, will churn on all their accounts
in the next year (i.e., 2003) or not. Several selection
criteria are used to decide which customers to include
into our analysis. Firstly, we only selected customers
who became customers from January 1st, 1992
onwards because the information in the data ware-
house before this date is less detailed. Secondly, we
only select customers having at least three distinct
purchase moments before January 2003. This con-
straint is imposed because we wish to focus the
attrition analysis on the more valuable customers.
Given the fact that most customers at the IFSP only
possess one financial service, the selected customers
clearly belong to the more precious clients of the IFSP.
Thirdly, we only keep customers still being customers
on December 31st, 2002 (cf. prediction of churn event
in 2003). This eventually results in 16,254 customers
left among which 399 customers (2.45%) closed all
their accounts in 2003. We randomly created a training
and hold-out sample of 8127 customers each, among
which 200 (2.46%) and 199 (2.45%) are churners
respectively. There is no overlap between the training
and hold-out sample.
3.2. Construction of the sequential dimension
As discussed earlier, we want to include a sequential
covariate in a traditional classification model. One such
Table 1
Four elements of the relative account-balance total dimension
Dimension relbalance Definition
relbalanceJanMar
relbalanceMarJul
relbalanceJulOct
relbalanceOctDec
(account-balance total March 2002 –account balance total January 2002) /account-balance total January 2002(account-balance total July 2002 – account-balance total March 2002) /account-balance total March 2002(account-balance total October 2002 –account-balance total July 2002) / account-balance total July 2002(account-balance total December 2002 –account-balance total October 2002) /account-balance total October 2002
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526514
sequential dimension likely to influence the churn
probability is the customers’ evolution in account-
balance total at the IFSP.We define the latter variable as
a sum of the customers’ total assets (i.e., total
outstanding balance on short- and long-term credit
accounts+ total debit on current account) and total
liabilities (i.e., total amount on savings and investment
products+credit on current account+sum of monthly
insurance fees). Although this account-balance total is
a continuous dimension, it is registered in the data
warehouse at discrete moments in time; at the end of the
month for bank accounts and on a yearly basis for
insurance products. We have reliable data for account-
balance total from January 1st, 2002 onwards.We build
sequences of relative difference in account-balance
total (i.e., relbalance) rather than sequences of absolute
account-balance total with the aim to facilitate the
capturing of overall trends in account-balance total.
Each sequence contains four elements (see Table 1):
relbalanceJanMar, relbalanceMarJul, relbalanceJu-
lOct, relbalanceOctDec.
able 2
alues for the categorical relative account-balance total dimension and element-based costs
lement Values of relbalance Deletion/Insertion cost of element
0 0.2
�0.5b relbalanceb0 0.4
�2.5b relbalanceV�0.5 0.6
�10b relbalanceV�2.5 0.8
relbalanceV�10 1
0b relbalanceb0.05 0.4
0.05V relbalanceb0.5 0.6
0.5V relbalanceb2.5 0.8
relbalancez2.5 1
T
V
E
0
1
2
3
4
5
6
7
8
Besides observing the account-balance total at discrete
moments in time, we converted the ratio-scaled relative
account-balance total sequence into a categorical
dimension. The latter is crucial to ensure that the
SAM analysis will find any similarities between the
customers’ sequences. Based on an investigation of the
distribution of the relative account-balance total, nine
categories are distinguished representing approxi-
mately an equal number of customers (cf. to enhance
discovery of similarities between customers). See
Table 2.
For the LogSeq model customers are clustered
using the SAM and Taylor–Butina algorithm on
their evolution in relative account-balance total as
expressed by a sequence of four categorical relative
account-balance total variables. In the LogNonseq
model, the sequential dimension is included by the
four relative account-balance total variables (i.e.,
relbalanceJanMar, relbalanceMarJul, relbalanceJu-
lOct and relbalanceOctDec) measured at the ratio
scale level.
3.3. Non-time varying independent variables
Besides the sequential dimension, several non-time
varying covariates are created (see Table 3). Two
blocks of independent variables can be distinguished.
A first block captures behavioral/transactional-related
information. Some of these variables are related to the
number of accounts open(ed)/closed, while others
consider the account-balance total of the customer over
a certain period of time. We also include some
exogenous variables expressing when and in what
service category the next expiration will occur. Finally,
we tried to incorporate some regressors expressing how
active the customer is: cf. his recency or number of
Table 3
Non-time varying independent variables to predict churn behavior
Name of Variable Description
st_days_until_next_exp Number of days until the first next expiration date (from January 1st, 2003 on) for a
service still in possession at December 31st, 2002. Standardized.
st_days_since_last_exp Number of days between December 31st, 2002 and last expiration date of a service before
January 1st, 2003. Standardized.
st_days_since_last_intentend Number of days between December 31st, 2002 and date (before January 1st, 2003) on
which the customer intentionally closed a service. Remark: Mostly the expiration date is
equal to the closing date.
dummy_cat1_next_exp. . .dummy_cat14_next_exp
Dummies indicating in what service category the next coming expiration date in 2003 is.
nbr_purchevent_bef2003 Number of distinct purchase moments the customer has before January 1st, 2003. The
minimum value is 3 due to customer selection.
st_nbr_serv_opened_bef2003 Number of accounts a customer ever opened before January 1st, 2003. Standardized.
nbr_serv_still_open_bef2003 Number of services the customer still possesses on December 31st, 2002.
nbr_serv_open_cat1_bef2003. . .
nbr_serv_open_cat14_bef2003
Number of accounts still open in service category 1 (respectively 2, . . ., 14) on December
31st, 2002.
dummy_lastcat1_opened_bef2003. . .
dummy_lastcat14_opened_bef2003
Dummies indicating in what service category the customer last opened an account before
January 1st, 2003.
nbr_serv_closed_bef2003 Number of accounts the customer has closed (intentionally closed or due to expiration)
before January 1st, 2003.
nbr_serv_intentend_bef2003 Number of services the customer intentionally closed before expiration date, before
January 1st, 2003.
nbr_serv_closed_cat1_bef2003. . .
nbr_serv_closed_cat14_bef2003
Number of accounts expired or intentionally closed in service category 1 (respectively 2,
. . ., 14) before January 1st, 2003.
dummy_lastcat1_closed_bef2003. . .
dummy_lastcat14_closed_bef2003
Dummies indicating in what service category the customer last closed (intentionally or due
to expiration) a service before January 1st, 2003.
dummy_last_cat1_intentend_bef2003. . .
dummy_last_cat14_intentend_bef2003
Dummies indicating in what service category the customer last intentionally closed an
account before January 1st, 2003.
ratio_closed_open_bef2003 (Number of accounts closed before 2003/number of accounts opened before 2003)*100.
ratio_stillo_open_bef2003 (Number of services still open on December 31st, 2002/number of services ever opened
before January 1st, 2003)*100.
st_recency Time between last purchase moment and December 31st, 2002. Standardized.
st_avg_balance_total3 The average account-balance total of the customer over the last three months of 2002.
(Account-balance total October 2002+account-balance total November 2002+account-
balance total December 2002) /3. Standardized.
st_avg_balance_total6 The average account-balance total of the customer over the last six months of 2002.
Standardized.
st_ratio_tot_3_6 Ratio of the total account-balance total of the customer over the last three months of 2002
and the total account-balance total of the customer over the last six months before 2003.
Standardized.
st_avg_diff_balance_total2 ((account-balance total December 2002�account-balance total November 2002)+
(account-balance total November 2002�account-balance total October 2002)) /2.
Standardized.
st_avg_diff_balance_total3 ((account-balance total December 2002�account-balance total November 2002)+
(account-balance total November 2002�account-balance total October 2002)+
(account-balance total October 2002�account-balance total September 2002)) /3.
Standardized.
st_ratio_curr_avgtotal3 Account-balance total December 2002/average account-balance total calculated over
October, November and December 2002.
st_ratio_curr_avgtotal6 Account-balance total December 2002/average account-balance total calculated over last
six months of 2002.
st_avg_balance_min4to2 (account-balance total November 2002+account-balance total October 2003+
account-balance total September 2002) /3. Standardized.
(continued on next page)
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 515
Name of Variable Description
st_ratio_curr_avgtotalmin4to2 Turnover in December 2002/avg_account-balance total_min4to2. Standardized.
last_avg_reinvest_time Latest average reinvest time before 2003. The average reinvest time indicates how long
the customer waits before investing money that is suddenly available again.
st_last_use_homebanking Number of days since last use of home banking. Deduced from the last logon date or if
missing from the last transaction date or if missing from the first logon date or if missing
from the home banking start date. Standardized.
months_last_titu_nozero Number of months ago since customer was titular of at least one account where the
balance is non-zero.
dummy_contentieux Dummy indicating whether the customer is at least at one account contentieux (bad debt)
on December 31st, 2002.
lor Length of relationship expressed in years.
age_31_Dec_2002 Age of the customer on December 31st, 2002.
age_becoming_customer Age of the customer when becoming customer at the IFSP.
gender Gender of the customer.
dummy_cohort_G1. . .dummy_cohort_G5 Dummies indicating whether the customer belongs to cohort group 1, 2, 3, 4, 5 or 6.
Cohort 1: 1900Vbirth dateV1924 (i.e., early baby boomers). Cohort 2: 1925VbirthdateV1945 (i.e., GI generation). Cohort 3: 1946zbirth dateV1955 (i.e., Silent
Generation). Cohort 4: 1956zbirth dateV1964 (i.e., late baby boomers). Cohort 5:
1965Vbirth dateV1980 (i.e., X generation). Cohort 6: birth dateN1980 (i.e., Y generation).
Table 3 (continued)
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526516
months since being titular of at least one account. The
second block of variables involves non-transactional
data, i.e., socio-demographical information like gen-
der, age and cohort.
4. Results
4.1. An element and position-sensitive SAM analysis
4.1.1. Operation costs
We calculated the distance between each customer in both directions on the relative categorical account-balance
total dimension mainly using the weighted-Levenshtein distance. All sequences have length four, i.e., there are
four elements in each sequence (see Section 3.2). Typically, the operation weights for deletion and insertion are set
to 1. In this paper, we do not follow this approach. In order to ensure that different trajectories result in different
SAM distances, the operation weights for deletion and insertion are not set equal. We define wdel=1 and wins just
below 1; wins=0.9. Similarly, to favor different SAM distances, the weight of reordering is not set like in many
research studies to the sum of the cost of one deletion and insertion. We arbitrarily set the reordering weight to 2.3.
We intentionally define the reordering weight to be larger than the sum of the deletion and insertion weights to
simplify and speed up the search for the optimal alignment (cf. calculating an optimal alignment is equivalent to
finding the longest common subsequence of the two sequences compared when the substitution weight is at least
equal to the sum of the deletion and insertion weights).
4.1.2. Element-sensitive costs
We adapted this rather conventional SAM-distance calculation to an element-sensitive SAM analysis to better
reflect the research context. Besides the operation weights we charge an element-based cost depending on the
element in the source being deleted or inserted. As can be seen from the values assigned to the categorical relative
account-balance total variable, (cf. Table 2), some values are more related to each other. For instance, values 4 and
8 are more divergent than values 0 and 1. Therefore, a different extra cost is added depending on the element being
inserted in or deleted from the source. The latter increases the variance in the final SAM distances. We set the
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 517
,
element-based costs for deletion and insertion equal. The element-cost setting does not distinguish between the
categories of relbalance reflecting positive evolutions in relbalance and categories expressing a negative evolution
on relbalance (e.g. equal element costs for element value 1 or 5). See Table 2 (cf. supra).
4.1.3. Absolute position-sensitive costs
Besides incorporating an element-based cost in the calculation of the conventional SAM distances, we convert
the SAM-distance measure into a position-sensitive measure. Normally, SAM describes differences between
sequences only in terms of the difference in sequential order of the elements, by changing the order of the common
elements of the source sequence if it differs from that of the common elements in the target (i.e., reordering), and in
terms of the difference in element composition by deleting the unique elements from the source and inserting the
unique elements of the target sequence in the source. Hence, conventional SAM is not sensitive to the positions at
which elements are inserted, deleted or reordered. After all, in bio-informatics studies, this position-sensitivity is
not useful, as the elements in DNA strings are relatively independent from each other. Consecutive DNA elements
are not likely to affect one another. However, one can think of many other sequences where the elements are
influencing each other. Whenever the elements in the sequence are measured at consecutive time points, we can
assume that previous values influence subsequent values in the sequence. For instance, in activity-sequence
analysis sequential relationships between activities is a primary concern. Likewise, we assume in our application
that the elements in the sequence consisting of the evolution in relative account-balance total of the customer over
2002, are correlated. Although there are many applications where the elements in the sequence are influencing
each other, there is, to our knowledge, only one research study, which incorporates the positional component into
the original SAM distance concept.
Joh et al. [22] developed a position-sensitive sequence-alignment analysis for activity analysis. Position-
sensitivity is taken into account by considering the distance by which the sequential order of the source
element is changed. The reordering distance h is measured as h = |i� j|, where i and j are the positions of the
reordered elements in the source and target sequences. The position-sensitive SAM distance is defined as
follows:
d a; bð Þ ¼ min wdDþ wiI þ gXRr¼1
hr
#"ð1Þ
where wd, weight for deletion; D, number of deletions; wi, weight for insertion; I, number of insertions; gweight for reordering; R, number of reorderings; hr, distance of reordering the rth common element.
The authors show that for larger values of the reordering weight, there is a significant difference between
the clustering solution found using the traditional SAM measure and the one resulting from the position-
sensitive SAM analysis. Whereas Joh et al. [22] developed a relative position-sensitive SAM analysis, i.e., a
SAM analysis that is sensitive to the difference in positions of common elements that need to be reordered, we
wish to develop an absolute position-sensitive SAM analysis which does not only consider the positions of
elements reordered, but also the positions at which elements are deleted from the source as well as the
positions in the source at which elements are inserted. We prefer an absolute position-sensitive measure to a
relative measure because we wish to distinguish between operations applied in the beginning of the sequence
and operations performed at the end of the sequence. The rationale for this comes from the fact that we assume
recent evolutions in relative account-balance total to influence the customers’ churn probability more
intensively than the customers’ relative account-balance total in the beginning of the sequence, e.g. the relative
account-balance total of the customer in the period October–December 2002 probably influences the churn
probability more than the relative account-balance total of the customer between January–March 2002.
Consider example 1 and 2. Let sequence a be the source, sequence b the target. In both examples, we need to
change the order of elements 6 and 1. Whereas in the target element 6 precedes element 1, in the source this is
reverse. Using a relative position-sensitive distance measure like Joh et al. [22], the SAM cost from reordering
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526518
f
o
f
r
r
d
r
s
w
s
i
t
s
S
4
w
b
t
t
k
c
s
o
t
i
p
a
r
b
or example 1 and 2 would be the same. Supposing we reorder element 6 and the reordering weight is 1, we
btain a reordering cost for example 1 of: |2�1|=1 and for example 2 of: |4�3|=1. From these examples it
ollows that a relative position-sensitive SAM measure does not distinguish between sequences where the
eordering is applied over the same number of positions, but at distinct positions in the source. Whereas a
elative position-sensitive SAM measure keeps the distances symmetric (cf. reordering cost defined using
ifference in position of element to reorder in source and target), an absolute position-sensitive SAM method
esults in asymmetric distances (cf. reordering cost defined using only position of reordered element in the
ource).
position 1 3
b 6
a 1 6 67 7 7 7
7 7 1
2 4 1 3
6
1
7 7 1
2 4
The absolute position-sensitive reordering cost multiplies the position of the reordered element in the source
ith the reordering weight. As mentioned earlier, we also convert the deletion and insertion costs into position-
ensitive costs. The absolute position-sensitive deletion cost considers the position in the source where the element
s deleted. Likewise, the insertion cost is made position-sensitive by incorporating the position of the element in the
arget to be inserted in the source. We use the position of the element in the target as a proxy for the position in the
ource where the target element is inserted. Next we describe how the final element and absolute-position sensitive
AM distances are calculated.
.1.4. Hay’s pairwise-sequence alignment algorithm
A major concern in sequence comparison and SAM analysis is the algorithm used to calculate the distances
ithin a reasonable time window. To address this computational complexity problem [39] we apply an algorithm
y Hay [16] that structures the equalizing process in a fast and easy way. It has not yet been proven to always lead
o an optimal alignment, i.e., the trajectory (the sequence of operations necessary to equalize the source with the
arget) resulting in the smallest distance possible. Yet, the algorithm mostly does.
Step 1: Identify the longest common substrings respecting the sequential order of elements. It is well
nown, that if the substitution/reordering weight is at least the sum of the deletion and insertion weights,
alculating an optimal alignment is equivalent to finding the longest-common subsequences of the two
equences compared. These longest common substrings represent the structural integrity of the two sequences
r the structural skeleton [32]. In case there is more than one possible longest common substring, we opt for
he longest substring of which the absolute sum of the differences in positions between all common elements
n source and target is smallest as the latter prefers matches between source and target at less remote
ositions above matches at more distant source-target positions. In example 3 we compare a customer having
rather small evolution in relative account-balance total (e.g. 0 1 5 0) with another customer starting with
ather small evolutions in relative customer account-balance total but ending with a decreasing account-
alance total (e.g. 5 1 2 3).
Example of sequence pairs Longest common substring
5 1 2 3 5 1 2 3 5 or 1. We opt for identification of the common element 1.
0 1 5 0 0 1 5 0
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 519
.
Step 2: Identify elements, which are not included in the substring and appear in the source and the target. Count
one reordering for each such identified element.
Example of sequence pairs Common elements not appearing in the longest common substring
5 1 2 3 5
0 1 5 0
At the end of this step, the order of the substituted elements has been changed. In the above example, the order
of element 5 is changed to precede 1 rather than to succeed 1. The total reordering cost is the sum of the product of
the reordering weight with the position of the reordered element in the source.
Costreordering ¼XRr¼1
gTposr reorel ð2Þ
R, number of reorderings; g, reordering weight; posr_reorel, absolute position of rth reordered element in the source
Step 3: Identify elements not included in the substring and which appear in either one of the compared
sequences. Count one deletion operation for each unique element in the source, one insertion operation for each
unique element in the target.
The two zero elements are deleted from the source, elements 2 and 3 are inserted in the source.
Example of sequence pairs Elements unique to the source or target
5 1 2 3
0 1 5 0
The costs for deletions and insertions are besides position-sensitive also element-sensitive.
Costdeletion ¼XDd¼1
wdT cd e þ posd delð Þ ð3Þ
wd, weight for deletion; cd_e, cost for deletion of dth element with a certain value (i.e., element cost); posd_del, cost
for deletion of dth element at a given position in the source (i.e., position cost).
Costinsertion ¼XIi¼1
wiT ci e þ posi insð Þ ð4Þ
wi, weight for insertion; ci_e, cost for insertion of ith element with a certain value (i.e., element cost); posi_ins, cost
for insertion of ith element from a given position in the target (i.e, position cost). Applying this algorithm, using
the operation weights, the element-based costs and the position costs, the total SAM distance is calculated as
follows:
SAMdist ¼ minXRr¼1
gTposr reorel
!þ
XDd¼1
wdT cd e þ posd delð Þ!
þXIi¼1
wiT ci e þ posi insð Þ! # "
ð5Þ
The total SAM distance for our example is:
SAMdist¼ (2:3T3)+[(1T(0:2+1))+(1T(0:2+4))]+[(0:9T(0:6+3))+(0:9T(0:8+4))] ¼ 19:86 ð6Þ
able 4
inal five-cluster solution on training sample
luster Old clusters Frequency N Percent N Frequency churners Percent churners
1 3281 40.37 82 2.50
2 1629 20.04 42 2.58
3 1040 12.79 26 2.50
4–5 1101 13.55 20 1.82
6–23 1076 13.24 30 2.79
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526520
I
s
i
4
b
c
c
h
v
f
c
c
s
w
s
w
c
s
c
c
i
t
4
c
c
w
a
c
r
g
f
f
c
t
t
T
F
C
1
2
3
4
5
n this paper, we calculate the element- and position-sensitive SAM distance between each customer in the training
ample, in both directions, on the sequential dimension relative account-balance total. These distances are inserted
nto a distance matrix used as input for the asymmetric Taylor–Butina clustering method.
.2. Asymmetric clustering using Taylor–Butina algorithm
The asymmetric SAM distances calculated between the training sequences on evolution in relative account-
alance total are used as input for the asymmetric, disjoint Taylor–Butina algorithm with the aim to distinguish
lusters with respect to the sequential dimension and indirectly also with respect to the dependent variable, i.e.,
hurn. Depending on the threshold (in range [0,1]) used, the clusters obtained by the algorithm are more or less
omogeneous. Lower thresholds result in smaller, but more homogeneous clusters, whereas higher threshold
alues result in larger, but less homogeneous clusters. In our application, however, our primary objective is not
inding the optimal clustering solution in terms of homogeneity but to keep the number of clusters limited as the
luster membership is incorporated by means of dummies into the final classification model. For each cluster, a
ertain minimum number of customers is needed, to enhance the possibility that the cluster dummies would
ignificantly influence the dependent variable. As the Taylor–Butina algorithm iteratively identifies the sequence
ith the highest number of neighbors, it follows that the first cluster defined is the largest one and that
ubsequent clusters have fewer and fewer members. Therefore, it might be that small thresholds are not optimal
ith respect to our predictive goal. Our ambition is to distinguish a reasonable number of clusters with 1) a
ertain minimum number of customers and 2) with the highest possible homogeneity. We experimented with
everal levels of the similarity threshold. It seems that we need a rather high threshold to keep the number of
lusters limited. For instance, for a threshold of 0.80, we obtain 130 clusters of which the biggest cluster only
ontains 758 customers (i.e., 9.32%) and of which clusters 89 to 130 hold less than five customers. We
nvestigated for which threshold the first cluster keeps a high enough number of customers. Employing a
hreshold of 0.9999 resulted in a 23-cluster solution of which the first cluster counts 3281 customers (i.e.,
0.37%). As 23 clusters is still quite high a number, which would result in 22 cluster dummies in the
lassification model, and as this cluster solution still creates some rather small populated clusters (e.g., from
luster 10 on the clusters have less than 100 members), we decided to group some clusters together. Therefore,
e performed a dsecond-order clustering’ by using the 23 representative sequences (i.e., centrotypes) as input to
subsequent clustering exercise, and investigated which cluster representatives are taken together into new
lusters. This resulted in a final five-cluster solution (see Table 4).
For each of these five clusters, we defined five representatives. We prefer to have more than one
epresentative per cluster to enhance the quality of cluster allocation of the hold-out sequences. By default, the
rouping module of Mesa Suite software package version 2.1 returns only one representative per cluster. Hence,
or clusters 1 to 3, we have already one representative for each cluster, for cluster 4 we have two centrotypes and
or cluster 5 we have 18 centrotypes. For clusters 1 to 4 we need to find additional representatives, whereas for
luster 5 we need to limit the number of representatives to five. Therefore, for each cluster defined, we cluster on
he cluster-specific SAM distances until we get a five-cluster solution providing exactly five representatives for
hat cluster.
Table 5
Allocation of hold-out sequences to five clusters identified on training sample
Cluster Frequency N Percent N Frequency churners Percent churners
1 4514 55.60 127 2.81
2 601 7.40 32 5.32
3 916 11.28 4 0.44
4 842 10.36 14 1.66
5 1254 15.45 22 1.75
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 521
In a next step, the hold-out sequences are assigned to the five clusters identified on the training sample. We
calculated the distances between the hold-out sequences and the groups of five representative cluster sequences
(5�5) in both directions. As a proxy for the asymmetric distance between a hold-out sequence and a
representative, we make the sum of the distance between the hold-out sequence and the centrotype and the distance
between the centrotype and the hold-out sequence and divide this sum by two. The latter approach is common
practice in studies performing calculations on asymmetric proximities. After all, it has been proven [34] that each
asymmetric distance matrix decomposes into a symmetric matrix S of averages sij ={ qij +qji} /2 and a skew-
symmetric matrix A with elements aij ={ qij�qji } /2. Using these proxies, each hold-out sequence is assigned to
the cluster to which it has the smallest average distance (i.e., smallest average distance towards five cluster
representatives). Table 5 gives an overview of the cluster distribution in the hold-out sample.
4.3. Defining the best subset of non-time varying independent variables
Before we compare the predictive performance of the LogSeq model with that of the LogNonseq model, we
first define a best subset of non-time varying independent variables to include in the logistic-regression models
besides the sequential dimension relbalance. Employing the leap and bound algorithm [10] on the non-time
varying independent variables in Table 2, we compared the best subsets having size 1 until 20 on their sums of
squares. As expected the increase in the performance criterion is inversely proportional to the number of
independent variables added. From Fig. 2, we decide that a subset containing the best five variables represents a
good balance between number of independents included and variance explained by the model.
The independent variables in the best subset of size five are in Table 6.
Variable selection curveScore Value1000.0000
1200.0000
1100.0000
1000.0000
900.0000
800.0000
700.0000
600.0000
500.00001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Variables
Fig. 2. Number of variables in best subsets.
Table 6
Best subset selection of size 5
Best subset of size 5 Variable description
st_days_until_next_exp Number of days until the first next expiration date (from January 1st, 2003 on) for a service still in
possession on December, 31st 2002. Standardized.
st_ratio_curr_avgtotal3 Account-balance total December 2002/average account-balance total calculated over October,
November and December 2002.
st_ratio_stillo_open_bef2003 (Number of services still open before January 1st, 2003/number of services ever opened before
January 1st, 2003)*100. Standardized.
st_months_last_titu_nozero Number of months ago since customer was titular of at least one account where the balance is non-zero.
st_days_since_last_exp Number of days between December 31st, 2002 and last expiration date of a service before January 1st,
2003. Standardized.
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526522
4.4. Comparing churn predictive performance of LogSeq and LogNonseq models
We compare the predictive performance measured by AUC on the hold-out sample of the LogNonseq model
with that of the LogSeq model. Both logistic-regression models include the best five non-time varying variables
from Table 6 as well as the sequential dimension relbalance. However, in the LogNonseq model, the sequential
dimension is incorporated by means of four non-time varying independent variables (i.e., st_relbalanceJanMar,
st_relbalanceMarJul, st_relbalanceJulOct and st_relbalanceOctDec), whereas in the LogSeq model the sequential
dimension is operationalized by four cluster dummies. We hypothesize that the churn predictive performance of
the LogSeq model will be significantly higher than that of the LogNonseq model because operationalizing a
sequential dimension by non-time varying independent variables neglects the sequential information of the
dimension.
Table 7 shows that our hypothesis is confirmed. There is a significant difference (v2=17.69, p=0.0000259) inpredictive performance between the binary logistic-regression model including the best subset of five independent
variables and the sequential dimension operationalized by non-time varying variables and the logistic-regression
model with the same set of independent variables but the sequential dimension expressed by cluster dummies
deduced from sequence-alignment analysis. Table 8 shows the parameter estimates and significance levels of the
regressors for both models. Where possible the standardized estimates are given.
Although not all cluster dummies are significant at a =0.05 level, the LogSeq model significantly
outperforms the LogNonseq model. The insignificance of cluster dummy 3 ( p=0.1744) might stem from a
serious drop in percentage churners from training (2.58%) to hold-out sample (0.44%) to even below 1%
churners. Looking at the estimates for the relative account-balance total dimension in the LogNonSeq model,
we find that only the relative evolution in account-balance total for the last six months seems to have a
significant effect on the churn probability in 2003 (cf. st_relbalanceJulOct and st_relbalanceOctDec). The
bigger the positive difference in account-balance total the less likely the customer will churn. Considering the
non-sequential regressors, it appears that all five regressors are significant for the LogNonseq model, while all
but one, i.e. st_months_last_titu_nozero, for the LogSeq model. All effects have the expected sign. The
smaller the number of days until the next expire date from January, 1st 2003 onwards, the higher the churn
probability (cf. st_days_until_next_exp). The effect of st_ratio_curr_avgtotal3 seems to be rather small, so we
should not worry too much about the difference in sign of effect between the LogNonseq and the LogSeq
model.
Table 7
Churn predictive performance of LogSeq and LogNonseq model on hold-out sample
Performance measure LogNonseq model LogSeq model
AUC 0.906 0.964
Table 8
Parameter estimates for LogNonseq and LogSeq models
Variable Estimate PrNChi-Square
LogNonseq LogSeq LogNonseq LogSeq
Intercept �16.80 �8.23 b0.0001 b0.0001
Best subset st_days_until_next_exp �1.25 �1.26 b0.0001 b0.0001
st_ratio_curr_avgtotal3 0.28 �0.09 0.0187 0.0109
st_ratio_stillo_open_bef2003 �1.25 �1.28 b0.0001 b0.0001
st_months_last_titu_nozero 0.12 0.02 0.0048 0.4392
st_days_since_last_exp 0.50 0.49 b0.0001 b0.0001
Sequential Dimension st_relbalanceJanMar cl1 �21.79 0.29 0.2169 0.0401
st_relbalanceMarJul cl2 �8.86 0.25 0.9299 0.1034
st_relbalanceJulOct cl3 �38.25 0.23 0.0890 0.1744
st_relbalanceOctDec cl4 �240.45 0.35 0.0011 0.0447
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 523
,
We may conclude, that our new procedure, which combines sequence analysis with a traditional
classification model, is a possible strategy to follow in an attempt to overcome the caveat of traditional
classification models designed for modeling non-time varying independent variables. In this paper, we have
modeled a time-varying independent variable by preceding a traditional binary logistic-regression model by an
element/position-sensitive sequence-alignment on the sequential dimension. This resulted in cluster-member-
ship information in terms of the sequential dimension as well as implicitly with respect to the dependent
variable. We decided to include this cluster-membership information by means of dummies in the final
classification model. Instead of the latter, we could alternatively use the cluster information to build cluster-
specific classification models. However, in our application, this approach is irrealistic because one of the five
clusters identified on the test sample has less than 1% churners (cf. cluster 3). So in case the predicted event
is rather rare, building cluster-specific classification models might be impossible due to too few people
experiencing the event in some of the identified clusters. Moreover, another drawback of building cluster-
specific models, lies in the practical problems to include several sequential dimensions in the classification
model. Whereas it is easy to include another set of cluster dummies in the classification model for each
sequential dimension, building cluster-specific classification models on more than one sequential dimension
implies simultaneously clustering on several sequential dimensions employing multidimensional SAM analysis.
However, as computational complexity is already an issue of concern in case of unidimensional SAM, the
latter becomes even worse within multidimensional SAM analysis.
5. Conclusion
In this paper, we provide a new procedure that
overcomes the caveat of traditional classification
models to incorporate sequential exogenous variables.
Instead of transforming the sequential dimension in
non-time varying variables, thereby ignoring the
sequential-information part, a better practice is to
employ a sequence-analysis method for modeling the
time-varying independent variable and to, subse-
quently, incorporate this information in the traditional
classification method, which is designed for modeling
non-time varying covariates. This way the best of both
methods is combined. One possible strategy hereby is
to cluster the customers on the sequential dimension
using SAM (in this paper an element/position
sensitive SAM) and to incorporate this cluster
information in the classification model by dummy
variables. The latter approach is promising as the
results from the attrition models at the IFSP confirm
our hypothesis of improved predictive performance
when modeling the sequential dimension by se-
quence-analysis methods instead of operationalizing
them as non-time varying variables. Besides this
approach, preceding a traditional classification model
like binary logistic regression, by a sequence-analysis
method to model the sequential dimension in the
model, other approaches might exist. It might be
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526524
worthwhile to elaborate on other procedures to
combine sequence-analysis methods designed to
model sequential information with traditional classi-
fication methods suited to model non-time varying
independent variables. Another avenue for further
research exists in exploring how other sequence-
analysis methods than sequence alignment could
enhance the modeling of sequential covariates in
classification models. In this paper, we only included
one sequential dimension. Further research studies
should incorporate several sequential covariates.
Finally, we wish to bring the parameter setting issue
to the attention of researchers considering to apply or
to elaborate on our new procedure. The highly-tuned
edit distances used for our churn application might not
be valid in other applications. The researcher should
adapt the operational weights, element costs and
position costs of the sequence-alignment to fit the
application at hand. Similarly, the threshold used for
the asymmetric clustering will need fine-tuning in
order to obtain a good clustering solution for other
applications.
Acknowledgements
The authors would like to thank the anonymous
financial-services company for providing the data.
Next, we extend our thanks to John D. MacCuish
and Norah E. MacCuish for providing a free
academic license of the software package Mesa
Suite Version 1.2, Grouping Module (www.mesaac.
com) and for their kind assistance. Moreover, we
would like to thank Bart Lariviere, PhD candidate at
Ghent University, for sharing his knowledge of the
data warehouse of the company. Finally, we express
our thanks to 1) Ghent University for funding the
PhD project of Anita Prinzie (BOF Grantno.
B00141), and 2) the Flemish Research Fund (FWO
Vlaanderen) for providing the funding for the com-
puting equipment to complete this project (Grantno.
G0055.01).
References
[1] A. Abbott, Sequence analysis: new methods for old ideas,
Annual Review of Sociology 21 (1995) 93–113.
[2] A. Abbott, A. Hrycak, Measuring resemblance in sequence
data: an optimal matching analysis of musicians’ careers,
American Journal of Sociology 96 (1) (1990) 144–185.
[3] B. Baesens, G. Verstraeten, D. Van den Poel, Bayesian
network classifiers for identifying the slope of the customer-
lifecycle of long-life customers, European Journal of Opera-
tional Research 156 (2) (2004) 508–523.
[4] C.B. Bhattacharya, When customers are members: customer
retention in paid membership contexts, Journal of the
Academy of Marketing Science 26 (1) (1998) 31–44.
[5] W. Buckinx, E. Moons, D. Van den Poel, G. Wets, Customer-
adapted coupon targeting using feature selection, Expert
Systems with Applications 26 (4) (2004) 509–518.
[6] D. Butina, Unsupervised data base clustering based on Day-
light’s fingerprint, Tanimoto similarity: a fast and automated
way to cluster small and large data sets, Journal of Chemical
Information and Computer Sciences 39 (4) (1999) 747–750.
[7] A. Cohen, R.I. Ivry, S.W. Keele, Attention and structure in
sequence learning, Journal of Experimental Psychology.
Learning, Memory and Cognition 16 (1) (1990) 17–30.
[8] M.R. Colgate, P.J. Danaher, Implementing a customer relation-
ship strategy: the asymmetric impact of poor versus excellent
execution, Journal of the Academy of Marketing Science 28
(3) (2000) 375–387.
[9] W.S. DeSarbo, G. De Soete, On the use of hierarchical-
clustering for the analysis of nonsymmetric proximities,
Journal of Consumer Research 11 (1) (1984) 601–610.
[10] G.M. Furnival, R.W. Wilson, Regressions by leaps and
bounds, Technometrics 16 (4) (1974) 499–511.
[11] J. Ganesh, M.J. Arnold, K.E. Reynolds, Understanding the
customer base of service providers: an examination of the
differences between switchers and stayers, Journal of Market-
ing 64 (3) (2000) 65–87.
[12] D. Green, J.A. Swets, Signal detection theory and psycho-
physics, John Wiley & Sons, New York, USA, 1966.
[13] M. Gribskov, J. Devereux (Eds.), Sequence Analysis Primer,
Oxford University Press, New York, USA, 1992.
[14] J. Hair, R. Andersen, R. Tatham, W. Black, Multivariate Data
Analysis, Prentice Hall, 1998.
[15] J. Hartigan, Clustering Algorithms, Wiley, New York, USA,
1975.
[16] B. Hay, Sequence Alignment Methods in Web Usage Mining,
Doctoral Dissertation, LUC, Belgium, 2003.
[17] B. Hay, G. Wets, K. Vanhoof, Web usage mining by means of
multidimensional sequence alignment methods. WEBKDD
2002—mining web data for discovering usage patterns and
profiles, Lecture Notes in Artificial Intelligence 2703 (2003)
50–65.
[18] W.J. Hopp, A sequential model of R&D investment over an
unbounded time horizon, Management Science 33 (4) (1987)
500–508.
[19] A.K. Jain, R.C. Dubes, Algorithms for clustering data,
Prentice Hall Advanced Reference Series, Englewood Cliffs,
NJ, 1998.
[20] R.A. Jarvis, E.A. Patrick, Clustering using a similarity
measure based on shared nearest neighbors, IEEE Trans-
actions on Computers 22 (11) (1973) 1025–1034.
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 525
[21] C.H. Joh, T.A. Arentze, H.J.P. Timmermans, Multidimen-
sional sequence alignment methods for activity–travel
pattern analysis: a comparison of dynamic programming
and genetic algorithms, Geographical Analysis 33 (3) (2001)
247–270.
[22] C.H. Joh, T.A. Arentze, H.J.P. Timmermans, A position-
sensitive sequence alignment method illustrated for space–
time activity-diary data, Environment and Planning A 33 (2)
(2001) 313–338.
[23] J. Jonz, Textual sequence and 2nd language comprehension,
Language Learning 39 (2) (1989) 207–249.
[24] L. Kaufman, P.J. Rousseeuw, Finding groups in data: an
introduction to cluster analysis, John Wiley & Sons, 1990.
[25] E. Kim, W. Kim, Y. Lee, Combination of multiple classifiers
for the customer’s purchase behavior prediction, Decision
Support Systems 34 (2) (2003) 167–175.
[26] K. Krishna, R. Krishnapuram, A clustering algorithm for
asymmetrical related data with applications to text mining,
Proceedings of the Tenth International Conference on Infor-
mation and Knowledge Management, 2001, pp. 571–573.
[27] B. Lariviere, D. Van den Poel, Investigating the role of product
features in preventing customer churn by using survival
analysis and choice modeling: the case of financial services,
Expert Systems with Applications 27 (2) (2004) 277–285.
[28] V.I. Levenshtein, Binary codes capable of correcting deletions,
insertions, and reversals, Cybernetics and Control Theory 10
(8) (1965) 707–710.
[29] J. MacCuish, N.E. MacCuish, Mesa Suite Version 1.2
Grouping Module, Mesa Analytics and Computing, LLC,
www.mesaac.com, 2003.
[30] J. MacCuish, C. Nicolaou, N.E. MacCuish, Ties in proximity
and clustering compounds, Journal of Chemical Information
and Computer Sciences 41 (1) (2001) 134–146.
[31] S.Mc. Brearty, The Sagoan–Lupemban and middle stone-age
sequence at the Muguruk site, World Archaeology 19 (3)
(1988) 388–420.
[32] M.A. McClure, T.K. Vasi, W.M. Fitch, Comparative analysis
of multiple protein-sequence alignment methods, Molecular
Biology and Evolution 11 (4) (1994) 571–592.
[33] C.S. Myers, L.R. Rabiner, A level building dynamic time
warping algorithm for connected word recognition, IEEE
Transactions on Acoustics, Speech, and Signal Processing
ASSP 29 (2) (1981) 284–297.
[34] B. Noble, W. Daniel, Applied Linear Algebra, Prentice-Hall,
New Jersey, 1988, p. 20.
[35] K. Ozawa, CLASSIC: a hierarchical clustering algorithm
based on asymmetric similarities, Pattern Recognition 16 (2)
(1983) 201–211.
[36] A. Prinzie, D. Van den Poel, Investigating purchasing-
sequence patterns for financial services using Markov, MTD
and MTDg models, European Journal of Operational Research
(2006) (in press).
[37] A.E. Raftery, S. Tavare, Estimation and modelling repeated
patterns in high order Markov chains with the mixture
transition distribution model, Applied Statistics 43 (1) (1994)
179–199.
[38] R. Sabherwal, D. Robey, Reconciling variance and process
strategies for studying information system development,
Information Systems Research 6 (1995) 303–327.
[39] D. Sankoff, J. Kruskal, Time Warps, String Edits, and
Macromolecules. The Theory and Practice of Sequence
Comparison, Addison-Wesley Pub., Advanced Book Program,
Mass., 1983.
[40] R. Sethuraman, V. Srinivasan, K. Doyle, Asymmetric and
neighborhood cross-price effects: some empirical general-
izations, Marketing Science 18 (1) (1999) 23–41.
[41] R. Taylor, Simulation analysis of experimental design strat-
egies for screening random compounds as potential new drugs
and agrochemicals, Journal of Chemical Information and
Computer Sciences 35 (1) (1995) 59–67.
[42] P. Toth, D. Vigo, A heuristic algorithm for the symmetric
and asymmetric vehicle routing problems with backhauls,
European Journal of Operational Research 113 (3) (1999)
528–543.
[43] A. Tversky, J.W. Hutchinson, Nearest neighbor analysis of
psychological spaces, Psychological Review 93 (1) (1986)
3–22.
[44] G.L. Urban, P.L. Johnson, J.R. Hauser, Testing competitive
market structures, Marketing Science 3 (1984) 83–112.
[45] D. Van den Poel, B. Lariviere, Customer attrition analysis
for financial services using proportional hazard models,
European Journal of Operational Research 157 (1) (2004)
196–217.
[46] R.A. Wagner, M.J. Fischer, The string-to-string correction
problem, Journal of the Association for Computing Machinery
21 (1974) 168–173.
[47] M.S. Waterman, Introduction to Computational Biology.
Maps, Sequences and Genomes, Chapman and Hall, USA,
1995.
[48] W.C. Wilson, Activity pattern analysis by means of sequence-
alignment methods, Environment and Planning A 30 (6)
(1998) 1017–1038.
[49] B. Zielman, W.J. Heiser, Models for asymmetric proximities,
British Journal of Mathematical & Statistical Psychology 49
(1996) 127–146.
Anita Prinzie is a PhD candidate in
Economics and Business Administration
at Ghent University, Belgium. She
received her master degree in Marketing
Analysis and Planning at Ghent Univer-
sity, Belgium. Her PhD thesis investigates
the use of sequence-analysis methods for
CRM purposes (churn and cross-sell
analysis).
A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526526
Dirk Van den Poel is associate professor of
marketing at the Faculty of Economics and
Business Administration of Ghent Univer-
sity in Belgium. He heads a competence
center on analytical customer relationship
management (aCRM). He received his
degree of management/business engineer
as well as his PhD from K.U.Leuven
(Belgium). His main fields of interest are
studying consumer behavior from a quan-
titative perspective (CRM), data mining
(genetic algorithms, neural networks, random forests), and oper-
ations research.