incorporating sequential information into traditional classification models by using an...

19
Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM Anita Prinzie T , Dirk Van den Poel Department of Marketing, Faculty of Economics and Business Administration, Ghent University, Hoveniersberg 24, Ghent, Belgium Available online 22 April 2005 Abstract The inability to capture sequential patterns is a typical drawback of predictive classification methods. This caveat might be overcome by modeling sequential independent variables by sequence-analysis methods. Combining classification methods with sequence-analysis methods enables classification models to incorporate non-time varying as well as sequential independent variables. In this paper, we precede a classification model by an element/position-sensitive Sequence-Alignment Method (SAM) followed by the asymmetric, disjoint Taylor–Butina clustering algorithm with the aim to distinguish clusters with respect to the sequential dimension. We illustrate this procedure on a customer-attrition model as a decision-support system for customer retention of an International Financial-Services Provider (IFSP). The binary customer-churn classification model following the new approach significantly outperforms an attrition model which incorporates the sequential information directly into the classification method. D 2005 Elsevier B.V. All rights reserved. Keywords: Sequence analysis; Binary classification methods; Sequence-alignment method; Asymmetric clustering; Customer-relationship management; Churn analysis 1. Introduction In the past, traditional classification models like logistic regression have been applied successfully to the prediction of a dependent variable by a series of non-time varying independent variables [5]. In case there are time-varying independent variables, these are typically included in the model by transforming them into non-time varying variables [3]. Unfortunately, this practice results in information loss as the sequential patterns of the data are neglected. Hence, although traditional classification models are highly valid and robust for modeling non-time varying data, they are unable to capture sequential patterns in data. This caveat might be overcome by modeling time-varying independent variables by sequence-analysis methods . Unlike traditional classification methods, sequence- analysis methods were designed for modeling sequen- tial information. These methods take sequences of data, 0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2005.02.004 T Corresponding author. Tel.: +32 9 264 35 20; fax: +32 9 264 42 79. E-mail addresses: [email protected] (A. Prinzie), [email protected] (D. Van den Poel). Decision Support Systems 42 (2006) 508 – 526 www.elsevier.com/locate/dsw

Upload: anita-prinzie

Post on 05-Sep-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

www.elsevier.com/locate/dsw

Decision Support Systems

Incorporating sequential information into traditional classification

models by using an element/position-sensitive SAM

Anita Prinzie T, Dirk Van den Poel

Department of Marketing, Faculty of Economics and Business Administration, Ghent University, Hoveniersberg 24, Ghent, Belgium

Available online 22 April 2005

Abstract

The inability to capture sequential patterns is a typical drawback of predictive classification methods. This caveat might be

overcome by modeling sequential independent variables by sequence-analysis methods. Combining classification methods with

sequence-analysis methods enables classification models to incorporate non-time varying as well as sequential independent

variables. In this paper, we precede a classification model by an element/position-sensitive Sequence-Alignment Method (SAM)

followed by the asymmetric, disjoint Taylor–Butina clustering algorithm with the aim to distinguish clusters with respect to the

sequential dimension. We illustrate this procedure on a customer-attrition model as a decision-support system for customer

retention of an International Financial-Services Provider (IFSP). The binary customer-churn classification model following the

new approach significantly outperforms an attrition model which incorporates the sequential information directly into the

classification method.

D 2005 Elsevier B.V. All rights reserved.

Keywords: Sequence analysis; Binary classification methods; Sequence-alignment method; Asymmetric clustering; Customer-relationship

management; Churn analysis

1. Introduction

In the past, traditional classification models like

logistic regression have been applied successfully to

the prediction of a dependent variable by a series of

non-time varying independent variables [5]. In case

there are time-varying independent variables, these are

0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.dss.2005.02.004

T Corresponding author. Tel.: +32 9 264 35 20; fax: +32 9 264

42 79.

E-mail addresses: [email protected] (A. Prinzie),

[email protected] (D. Van den Poel).

typically included in the model by transforming them

into non-time varying variables [3]. Unfortunately, this

practice results in information loss as the sequential

patterns of the data are neglected. Hence, although

traditional classification models are highly valid and

robust for modeling non-time varying data, they are

unable to capture sequential patterns in data. This

caveat might be overcome by modeling time-varying

independent variables by sequence-analysis methods.

Unlike traditional classification methods, sequence-

analysis methods were designed for modeling sequen-

tial information. These methods take sequences of data,

42 (2006) 508–526

Page 2: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 509

i.e., ordered arrays, as their input rather than individual

data points. With the exception of marketing, sequence

analysis is commonly applied in disciplines like

archeology [31], biology [37], computer sciences

[38], economics [18], history [1], linguistics [23],

psychology [7] and sociology [2]. Sequence-analysis

methods can be categorized depending on whether the

sequences are treated as a whole or step by step [1].

Step-by-step methods examine relationships among

elements or states in the sequences. Time-series

methods are used to study the dependence of an

interval-measured sequence on its own past. When

the variable of interest is categorical, Markov methods

are appropriate. The latter methods calculate transition

probabilities based on the transition between two

events [36]. Transitions from one prior category can

be modeled by using event-history methods, also

known as duration methods, hazard methods, failure

analysis, and reliability analysis. The central research

question studied is time until transition. Whole-

sequence methods use the entire sequence as unit of

analysis to discover similarities between sequences

resulting in typologies. The central issue addressed is

whether there are patterns in the sequences, either over

the whole sequences or within parts of them. There are

two approaches to this pattern question. In the algebraic

approach, each sequence is reduced to some simplest

form and sequences with similar dsimplest formsT aregathered under one heading. In the metric approach, a

similarity measure between the sequences is calculated

which is then subsequently processed by clustering,

scaling and other categorization methods to extract

typical sequential patterns. Methods like optimal

matching or optimal alignment are commonly applied

within this metric approach. In an intermediate

situation, local similarities are analyzed to find out

the role of key subsequences embedded in longer

sequences [48].

Given that traditional classification models are

designed for modeling non-time varying independent

variables and that sequence-analysis methods are

well-suited to model dynamic information, it follows

that a combination of both methods unites the best of

both worlds and allows for building predictive

classification models incorporating non-time varying

as well as time-varying independent variables. One

possible approach amongst others exists in preceding

the traditional classification method by a sequence-

analysis method to model the dynamic exogenous

variables (cf. serial instead of parallel combination of

classifiers, [25]). In this paper, we precede a logistic

regression, as a traditional classification method, by

a Sequence-Alignment Method (i.e., SAM), as a

whole sequence method using the metric approach.

The SAM analysis is used to model a time-varying

independent variable. We identify how similar the

customers are on the dynamic independent variable

by calculating a similarity measure between each

pair of customers, the SAM distances. These

distances are further processed by a clustering

algorithm to produce groups of customers which

are relatively homogeneous with respect to the

dynamic independent variable. As we cluster on a

dimension influencing the dependent variable, the

clusters are not only homogeneous in terms of the

time-varying independent variable, but should also

be homogeneous with respect to the dependent

variable. This way, we make the implicit link

between clustering and classification explicit. After

all, clustering is in theory a special problem of

classification associated with an equivalence relation

defined over a set [35]. Including the cluster-

membership information as dummies in the classi-

fication model not only allows for modeling the

dynamic independent variable in an appropriate way,

it should even improve the predictive performance.

In this paper, we illustrate the new procedure,

which combines a sequence-analysis method with a

traditional classification method, by estimating a

customer-attrition model for a large International

Financial-Services Provider (from now on referred

to as IFSP). This attrition model feeds the managerial

decision process and helps refining the retention

strategy by elucidating the profile of customers with

a high defection risk. A traditional logistic regression

is applied to predict whether a customer will churn or

not. This logistic regression is preceded by an

element- and position-sensitive Sequence-Alignment

Method to incorporate a time-varying covariate. We

will calculate the distance between each customer on a

sequential dimension, i.e., the evolution in relative

account-balance total of the customer at the IFSP, and

use these distances as input for a subsequent cluster

analysis. The cluster-membership information is

incorporated in the logistic regression by dummies.

We hypothesize that the logistic-regression model

Page 3: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526510

with the time-varying independent variable included

as cluster dummy variables will outperform the

traditional logistic regression where the same sequen-

tial dimension is incorporated by creating as many

non-time varying independent variables as there are

time points on which the dimension is measured.

The remainder of this paper is structured as

follows. In Section 2 we describe the different

methods used. We discuss the basic principles of

SAM and underline how the cost allocation influences

the mathematical features of the resulting SAM

distance measures determining whether a symmetric

or asymmetric clustering algorithm is appropriate. We

outline how a modification of Taylor’s cluster-

sampling algorithm [41] and Butina’s cluster algo-

rithm based on exclusion spheres [6] allows clustering

on asymmetric SAM distances. In Section 3 we

outline how the new procedure proposed in this paper

is applied within a financial-services context to

improve prediction of churn behavior. Section 4

investigates whether the results confirm our hypoth-

esis on improved predictive performance. We con-

clude with a discussion of the main findings and

introduce some avenues for further research.

2. Methodology

2.1. Sequence-alignment method (SAM)

The Sequence-Alignment Method (SAM) was

developed in computer sciences (text editing and

voice recognition) and molecular biology (protein and

nucleic acid analysis). A common application in

computer sciences is string correction or string editing

[46]. The main use of sequence comparison in

molecular biology is to detect the homology between

macromolecules. If the distance between two macro-

molecules is small enough, one may conclude that

they have a common evolutionary ancestor. Applica-

tions of sequence alignment in molecular biology use

comparatively simple alphabets (the four nucleotide

molecules or the twenty amino acids) but tend to have

very long sequences [48]. Conversely, in marketing

applications, sequences will mostly be shorter but

with a very large alphabet. Besides SAM applications

in computer sciences and molecular biology, there are

applications in social science [2], transportation

research [21] and speech processing [33]. Recently,

SAM has been applied in marketing to discover

visiting patterns of websites [17].

Sankoff and Kruskall [39], Waterman [47] and

Gribskov and Devereux [13] are good references on

Sequence-Alignment Method. SAM handles variable-

length sequences and incorporates sequential infor-

mation, i.e., the order in which the elements appear in

a sequence, into its distance measure (unlike conven-

tional position-based distance measures, like Eucli-

dean, Minkowsky, city block and Hamming dis-

tances). The original sequence-alignment method

can be summarized as follows. Suppose we compare

sequence a, called the source, having i elements a =a

[a1, . . ., ai] with sequence b, i.e., the target, having j

elements b =b [b1, . . ., bj]. In general, the distance or

similarity between sequence a and b is expressed by

the number of operations (i.e., total amount of effort)

necessary to convert sequence a into b. The SAM

distance is represented by a score. The higher the

score, the more effort it takes to equalize the

sequences and the less similar they are. The elemen-

tary operations are insertions, deletions and substitu-

tions or replacements. Deletion and insertion

operations, often referred to as indel, are applied to

elements of the source (first) sequence in order to

change the source into the target (second) sequence.

Substitution operations indicate deletion+ insertion.

Some advanced research involves other operations

like swaps or transpositions (i.e., the interchange of

adjacent elements in the sequence), compression (of

two or more elements into one element) and expan-

sion (of one element into two or more elements).

Every elementary operation is given a weight (i.e.,

cost) greater than or equal to zero. It is common

practice to make assumptions on the weights in order

to achieve the metric axioms (nonnegative property,

zero property, triangle inequality and symmetry) of

mathematical distance (e.g., equal weights for dele-

tions and insertions to preserve the symmetry axiom)

[39]. Weights may be tailored to reflect the impor-

tance of operations, the similarity of particular

elements (cf. element sensitive), the position of

elements in the sequence (cf. position sensitive), or

the number/type of neighboring elements or gaps [48].

A different weight for insertion and deletion as well as

position-sensitive weights result in SAM distances

which are no longer symmetric: cf. |ab|~= |ba|. The

Page 4: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 511

latter has its implications on the clustering algorithm

that could be used (cf. infra). Different meanings can

be given to the word ddistanceT in sequence compar-

ison. In this paper, we express the relatedness

(similarity or distance) between customers on their

evolution in relative account-balance total at the IFSP

by calculating the weighted-Levenshtein [28] dis-

tance between each possible pair of customers (i.e.,

pairwise-sequence analysis). The weighted-Leven-

shtein distance defines dissimilarity as the smallest

sum of operation-weighting values required to change

sequence a into b. This way a distance matrix is

constructed and consecutively, used as input for a

cluster analysis.

2.2. Cluster analysis of weighted-Levenshtein SAM

distances

We cluster the customers on the weighted-Lev-

enshtein SAM distances, expressing how dissimilar

they are on the sequential dimension. The cluster-

membership information resulting from this cluster

analysis is translated into cluster dummies, which

represent the sequential dimension in a subsequent

classification model. We hypothesize that a classi-

fication model including cluster indicators (operation-

alized as dummy variables) based on SAM distances

will outperform a similar model where the same

sequential dimension is incorporated by as many non-

time varying independent variables as time points on

which the dimension is measured. After all, these

dummies are good indicators of what type of

behavior the customer exhibits towards the sequential

dimension (i.e., time-varying independent variable), as

well as towards the dependent variable (cf. explicit

typology of customers on time-varying covariate

results in implicit typology on the dependent variable).

A distance matrix holding the pairwise weighted-

Levenshtein distances between customer sequences, is

used as a distance measure for clustering. As

discussed earlier, depending on how the weights

(i.e., costs) for SAM are set, the distances in the

matrix are symmetric or asymmetric. Most common

clustering methods employ symmetric, hierarchical

algorithms such as Wards, Single-, Complete-, Aver-

age-, or CentroRd linkage [14,19,24], non-hierarchical

algorithms such as Jarvis–Patrick [20], or partitional

algorithms such as k-means or hill-climbing. Such

methods require symmetric measures, e.g. Tanimoto,

Euclidean, Hamman or Ochai, as their inputs. One

drawback of these methods is that they cannot capture

important asymmetric relationships. Nevertheless,

there exist many practical scenarios where the under-

lying relation is asymmetric. Asymmetric relation-

ships are common in transportation research (cf.

different distance between two cities A and B

(|AB|~= |BA|) due to other routes (e.g. the Vehicle

Routing Problem [42])), in text mining (cf. word

associations, e.g. most people will relate ddataT to

dminingT more strongly than conversely [43]), in

sociometric ratings (cf. a person i could express a

higher like or dislike rating to person j than vice

versa), in chemoinformatics (cf. compound A may

fit into compound B while the reverse is not

necessarily true) and to a lesser extent in marketing

research (cf. brand-switching counts [9], dfirstchoiceT–dsecond choiceT connections [44] and the

asymmetric price effects between competing brands

[40]). A good overview of models for asymmetric

proximities is given by Zielman and Heiser [49].

Although there are a lot of research settings in-

volving asymmetric proximities, only a few clus-

tering algorithms can handle asymmetric data. Most

of these are based on a nearest-neighbor table

(NNT). Krishna and Krishnapuram [26] provide a

clustering algorithm for asymmetric data (i.e.,

CAARD algorithm which closely resembles the

Leader Clustering Algorithm (LCA) [15]) with

applications to text mining. Ozawa [35] defines a

hierarchical asymmetric clustering algorithm called

Classic, and applies it on the detection of gestalt

clusters. His algorithm is based on an iteratively

defined nested sequence of NNRs (i.e., Nearest

Neighbors Relations). MacCuish et al. [30] con-

verted the Taylor–Butina exclusion region grouping-

algorithms [6,41] into a real clustering algorithm,

which can be used for both disjoint or non-disjoint

(overlapping), either symmetric or asymmetric clus-

tering. Although this algorithm is designed for

clustering compounds (i.e., the chemoinformatics

field with applications like compound acquisition

and lead optimization in high-throughput screening),

in this paper it is employed to cluster customers on

marketing-related information. More specifically, we

apply the asymmetric, disjoint version of the

algorithm to the asymmetric SAM distances obtained

Page 5: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526512

earlier. The asymmetric disjoint Taylor–Butina algo-

rithm is a five-step procedure (Fig. 1) [29]:

1. Create the threshold nearest-neighbor table using

similarities in both directions.

2. Find true singletons, i.e., data points (in our case

customers) with an empty nearest-neighbor list.

Those elements do not fall into any cluster.

3. Find the data point with the largest nearest-

neighbor list. This point tends to be in the center

of the k-th (cf. k clusters) most densely occupied

region of the data space. The data point together

with all its neighbors within its exclusion region,

constitute a cluster. The data point itself becomes

the representative data point for the cluster.

Remove all elements in the cluster from all

nearest-neighbor lists. This process can be seen

as putting an dexclusion sphereT around the newly

formed cluster [6].

4. Repeat step 3 until no data points exist with a non-

empty nearest-neighbor list.

5. Assign remaining data points, i.e., false singletons,

to the group that contains their most similar nearest

neighbor, but identify them as bfalse singletonsQ.These elements have neighbors at the given

similarity threshold criterion (e.g. all elements with

a dissimilarity measure smaller than 0.3 are

deemed similar), but a dstrongerT cluster represen-

Threshold=.15

Representative Compound False Singleton

Exclusion Regions diameter set by threshold value

True Singleton Dissimilarity in both directions

Fig. 1. Asymmetric Taylor–Butina Schematic [29].

tative, i.e., one with more neighbors in the list,

excludes those neighbors (cf. cluster criterion).

2.3. Incorporating cluster membership information in

the classification model

After having applied SAM and cluster analysis

using the asymmetric Taylor–Butina algorithm, we

build a classification model to predict a binary target

variable, in our application dchurnT. As a classificationmethod, we use binary logistic regression. We build

two churnmodels using the logistic-regressionmethod.

One model includes the sequential dimension as cluster

dummies resulting from clustering the SAM distances

(from now on referred to as LogSeq). The second

model incorporates the sequential dimension in a

traditional way by asmany non-time varying regressors

as there are time points, on which the dimension is

measured (from now on referred to as LogNonseq).

Both models are estimated on a training sample and

subsequently validated on a hold-out sample, contain-

ing customers not belonging to the training sample. We

compare the predictive performance of the LogSeq

model with that of the LogNonseq model.

In order to test the performance of the LogSeq

model on the hold-out sample, we need to define a

procedure to assign the hold-out customers to the

clusters identified on the training sample. We define

five sequences per cluster in the training sample as

representatives. By default the grouping module of the

Mesa Suite software package, which implements the

Taylor–Butina clustering algorithm [29], returns only

one representative for each identified cluster. We

prefer to have more than one representative customer

for each cluster in order to improve the quality of

allocation of customers in the hold-out sample to the

clusters identified on the training sample. Therefore,

once we have found a good k-th cluster solution on

the training sample, we apply the Taylor–Butina

algorithm to the cluster-specific SAM distances in

order to obtain a five-cluster solution delivering five

representatives for the given cluster. This way, each

cluster has five representatives. Next, we calculate the

SAM distances of the hold-out sequences towards

these groups of five cluster representatives and vice

versa. Each hold-out sequence is assigned to the

cluster to which it has the smallest average distance

(i.e., smallest average distance towards five cluster

Page 6: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 513

representatives). This cluster membership information

is transformed into cluster dummy variables.

The predictive performance of the classification

models (in this case: logistic regression) is assessed by

the Area Under the receiver operating Curve (AUC).

Unlike the Percentage Correctly Classified (i.e., PCC),

this performance measure is independent of the

chosen cut-off. The Receiver Operating Character-

istics curve plots the hit percentage (events predicted

to be events) on the vertical axis versus the percentage

false alarms (non-events predicted to be events) on the

horizontal axis for all possible cut-off values [12]. The

predictive accuracy of the logistic-regression models

is expressed by the area under the ROC curve (AUC).

The AUC statistic ranges from a lower limit of 0.5 for

chance (null-model) performance to an upper limit of

1.0 for perfect performance [12]. We compare the

predictive performance of the LogSeq model with the

predictive accuracy of the LogNonseq model. We

hypothesize that the LogSeq model will outperform

the LogNonseq model.

3. A financial-services application

We illustrate our new procedure, which combines

sequence analysis with a traditional classification

method, on a churn-prediction case to support the

customer-retention decision system of a major Finan-

cial-Services Provider (i.e., IFSP). Over the past two

decades, the financial markets have become more

competitive due to the mature nature of the sector on

the one hand and deregulation on the other, resulting in

diminishing profit margins and blurring distinctions

between banks, insurers and brokerage firms (i.e.,

universal banking). Hence, nowadays a small number

of large institutions offering a wider set of services

dominate the financial-services industry. These devel-

opments stimulated bank assurance companies to

implement Customer Relationship Management

(CRM). Under this intensive competitive pressure,

companies realize the importance of retaining their

current customers. The substantive relevance of

attrition modeling comes from the fact that an increase

in retention rate of just one percentage point may result

in substantial profit increases [45]. Successful cus-

tomer retention allows organizations to focus more on

the needs of their existing customers, thereby increas-

ing the managerial insights into these customers’ needs

and hence decreasing the servicing costs. Moreover,

long-term customers buy more [11] and if satisfied,

might provide new referrals through positive word-of-

mouth for the company. These customers tend to be

less sensitive to competitive marketing actions.

Finally, losing customers leads to opportunity costs

due to lost sales and because attracting new customers

is five to six times more expensive than customer

retention [4,8]. For an overview on the literature in

attrition analysis we refer to Van den Poel and

Lariviere [45]. Combining several techniques (just

like in this paper) to achieve improved attrition models

has already been shown to be highly effective [27].

3.1. Customer selection

In this paper, we define a dchurnedT customer as

someone who closed all his accounts at the IFSP. We

predict whether customers still being customer at

December 31st, 2002, will churn on all their accounts

in the next year (i.e., 2003) or not. Several selection

criteria are used to decide which customers to include

into our analysis. Firstly, we only selected customers

who became customers from January 1st, 1992

onwards because the information in the data ware-

house before this date is less detailed. Secondly, we

only select customers having at least three distinct

purchase moments before January 2003. This con-

straint is imposed because we wish to focus the

attrition analysis on the more valuable customers.

Given the fact that most customers at the IFSP only

possess one financial service, the selected customers

clearly belong to the more precious clients of the IFSP.

Thirdly, we only keep customers still being customers

on December 31st, 2002 (cf. prediction of churn event

in 2003). This eventually results in 16,254 customers

left among which 399 customers (2.45%) closed all

their accounts in 2003. We randomly created a training

and hold-out sample of 8127 customers each, among

which 200 (2.46%) and 199 (2.45%) are churners

respectively. There is no overlap between the training

and hold-out sample.

3.2. Construction of the sequential dimension

As discussed earlier, we want to include a sequential

covariate in a traditional classification model. One such

Page 7: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

Table 1

Four elements of the relative account-balance total dimension

Dimension relbalance Definition

relbalanceJanMar

relbalanceMarJul

relbalanceJulOct

relbalanceOctDec

(account-balance total March 2002 –account balance total January 2002) /account-balance total January 2002(account-balance total July 2002 – account-balance total March 2002) /account-balance total March 2002(account-balance total October 2002 –account-balance total July 2002) / account-balance total July 2002(account-balance total December 2002 –account-balance total October 2002) /account-balance total October 2002

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526514

sequential dimension likely to influence the churn

probability is the customers’ evolution in account-

balance total at the IFSP.We define the latter variable as

a sum of the customers’ total assets (i.e., total

outstanding balance on short- and long-term credit

accounts+ total debit on current account) and total

liabilities (i.e., total amount on savings and investment

products+credit on current account+sum of monthly

insurance fees). Although this account-balance total is

a continuous dimension, it is registered in the data

warehouse at discrete moments in time; at the end of the

month for bank accounts and on a yearly basis for

insurance products. We have reliable data for account-

balance total from January 1st, 2002 onwards.We build

sequences of relative difference in account-balance

total (i.e., relbalance) rather than sequences of absolute

account-balance total with the aim to facilitate the

capturing of overall trends in account-balance total.

Each sequence contains four elements (see Table 1):

relbalanceJanMar, relbalanceMarJul, relbalanceJu-

lOct, relbalanceOctDec.

able 2

alues for the categorical relative account-balance total dimension and element-based costs

lement Values of relbalance Deletion/Insertion cost of element

0 0.2

�0.5b relbalanceb0 0.4

�2.5b relbalanceV�0.5 0.6

�10b relbalanceV�2.5 0.8

relbalanceV�10 1

0b relbalanceb0.05 0.4

0.05V relbalanceb0.5 0.6

0.5V relbalanceb2.5 0.8

relbalancez2.5 1

T

V

E

0

1

2

3

4

5

6

7

8

Besides observing the account-balance total at discrete

moments in time, we converted the ratio-scaled relative

account-balance total sequence into a categorical

dimension. The latter is crucial to ensure that the

SAM analysis will find any similarities between the

customers’ sequences. Based on an investigation of the

distribution of the relative account-balance total, nine

categories are distinguished representing approxi-

mately an equal number of customers (cf. to enhance

discovery of similarities between customers). See

Table 2.

For the LogSeq model customers are clustered

using the SAM and Taylor–Butina algorithm on

their evolution in relative account-balance total as

expressed by a sequence of four categorical relative

account-balance total variables. In the LogNonseq

model, the sequential dimension is included by the

four relative account-balance total variables (i.e.,

relbalanceJanMar, relbalanceMarJul, relbalanceJu-

lOct and relbalanceOctDec) measured at the ratio

scale level.

3.3. Non-time varying independent variables

Besides the sequential dimension, several non-time

varying covariates are created (see Table 3). Two

blocks of independent variables can be distinguished.

A first block captures behavioral/transactional-related

information. Some of these variables are related to the

number of accounts open(ed)/closed, while others

consider the account-balance total of the customer over

a certain period of time. We also include some

exogenous variables expressing when and in what

service category the next expiration will occur. Finally,

we tried to incorporate some regressors expressing how

active the customer is: cf. his recency or number of

Page 8: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

Table 3

Non-time varying independent variables to predict churn behavior

Name of Variable Description

st_days_until_next_exp Number of days until the first next expiration date (from January 1st, 2003 on) for a

service still in possession at December 31st, 2002. Standardized.

st_days_since_last_exp Number of days between December 31st, 2002 and last expiration date of a service before

January 1st, 2003. Standardized.

st_days_since_last_intentend Number of days between December 31st, 2002 and date (before January 1st, 2003) on

which the customer intentionally closed a service. Remark: Mostly the expiration date is

equal to the closing date.

dummy_cat1_next_exp. . .dummy_cat14_next_exp

Dummies indicating in what service category the next coming expiration date in 2003 is.

nbr_purchevent_bef2003 Number of distinct purchase moments the customer has before January 1st, 2003. The

minimum value is 3 due to customer selection.

st_nbr_serv_opened_bef2003 Number of accounts a customer ever opened before January 1st, 2003. Standardized.

nbr_serv_still_open_bef2003 Number of services the customer still possesses on December 31st, 2002.

nbr_serv_open_cat1_bef2003. . .

nbr_serv_open_cat14_bef2003

Number of accounts still open in service category 1 (respectively 2, . . ., 14) on December

31st, 2002.

dummy_lastcat1_opened_bef2003. . .

dummy_lastcat14_opened_bef2003

Dummies indicating in what service category the customer last opened an account before

January 1st, 2003.

nbr_serv_closed_bef2003 Number of accounts the customer has closed (intentionally closed or due to expiration)

before January 1st, 2003.

nbr_serv_intentend_bef2003 Number of services the customer intentionally closed before expiration date, before

January 1st, 2003.

nbr_serv_closed_cat1_bef2003. . .

nbr_serv_closed_cat14_bef2003

Number of accounts expired or intentionally closed in service category 1 (respectively 2,

. . ., 14) before January 1st, 2003.

dummy_lastcat1_closed_bef2003. . .

dummy_lastcat14_closed_bef2003

Dummies indicating in what service category the customer last closed (intentionally or due

to expiration) a service before January 1st, 2003.

dummy_last_cat1_intentend_bef2003. . .

dummy_last_cat14_intentend_bef2003

Dummies indicating in what service category the customer last intentionally closed an

account before January 1st, 2003.

ratio_closed_open_bef2003 (Number of accounts closed before 2003/number of accounts opened before 2003)*100.

ratio_stillo_open_bef2003 (Number of services still open on December 31st, 2002/number of services ever opened

before January 1st, 2003)*100.

st_recency Time between last purchase moment and December 31st, 2002. Standardized.

st_avg_balance_total3 The average account-balance total of the customer over the last three months of 2002.

(Account-balance total October 2002+account-balance total November 2002+account-

balance total December 2002) /3. Standardized.

st_avg_balance_total6 The average account-balance total of the customer over the last six months of 2002.

Standardized.

st_ratio_tot_3_6 Ratio of the total account-balance total of the customer over the last three months of 2002

and the total account-balance total of the customer over the last six months before 2003.

Standardized.

st_avg_diff_balance_total2 ((account-balance total December 2002�account-balance total November 2002)+

(account-balance total November 2002�account-balance total October 2002)) /2.

Standardized.

st_avg_diff_balance_total3 ((account-balance total December 2002�account-balance total November 2002)+

(account-balance total November 2002�account-balance total October 2002)+

(account-balance total October 2002�account-balance total September 2002)) /3.

Standardized.

st_ratio_curr_avgtotal3 Account-balance total December 2002/average account-balance total calculated over

October, November and December 2002.

st_ratio_curr_avgtotal6 Account-balance total December 2002/average account-balance total calculated over last

six months of 2002.

st_avg_balance_min4to2 (account-balance total November 2002+account-balance total October 2003+

account-balance total September 2002) /3. Standardized.

(continued on next page)

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 515

Page 9: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

Name of Variable Description

st_ratio_curr_avgtotalmin4to2 Turnover in December 2002/avg_account-balance total_min4to2. Standardized.

last_avg_reinvest_time Latest average reinvest time before 2003. The average reinvest time indicates how long

the customer waits before investing money that is suddenly available again.

st_last_use_homebanking Number of days since last use of home banking. Deduced from the last logon date or if

missing from the last transaction date or if missing from the first logon date or if missing

from the home banking start date. Standardized.

months_last_titu_nozero Number of months ago since customer was titular of at least one account where the

balance is non-zero.

dummy_contentieux Dummy indicating whether the customer is at least at one account contentieux (bad debt)

on December 31st, 2002.

lor Length of relationship expressed in years.

age_31_Dec_2002 Age of the customer on December 31st, 2002.

age_becoming_customer Age of the customer when becoming customer at the IFSP.

gender Gender of the customer.

dummy_cohort_G1. . .dummy_cohort_G5 Dummies indicating whether the customer belongs to cohort group 1, 2, 3, 4, 5 or 6.

Cohort 1: 1900Vbirth dateV1924 (i.e., early baby boomers). Cohort 2: 1925VbirthdateV1945 (i.e., GI generation). Cohort 3: 1946zbirth dateV1955 (i.e., Silent

Generation). Cohort 4: 1956zbirth dateV1964 (i.e., late baby boomers). Cohort 5:

1965Vbirth dateV1980 (i.e., X generation). Cohort 6: birth dateN1980 (i.e., Y generation).

Table 3 (continued)

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526516

months since being titular of at least one account. The

second block of variables involves non-transactional

data, i.e., socio-demographical information like gen-

der, age and cohort.

4. Results

4.1. An element and position-sensitive SAM analysis

4.1.1. Operation costs

We calculated the distance between each customer in both directions on the relative categorical account-balance

total dimension mainly using the weighted-Levenshtein distance. All sequences have length four, i.e., there are

four elements in each sequence (see Section 3.2). Typically, the operation weights for deletion and insertion are set

to 1. In this paper, we do not follow this approach. In order to ensure that different trajectories result in different

SAM distances, the operation weights for deletion and insertion are not set equal. We define wdel=1 and wins just

below 1; wins=0.9. Similarly, to favor different SAM distances, the weight of reordering is not set like in many

research studies to the sum of the cost of one deletion and insertion. We arbitrarily set the reordering weight to 2.3.

We intentionally define the reordering weight to be larger than the sum of the deletion and insertion weights to

simplify and speed up the search for the optimal alignment (cf. calculating an optimal alignment is equivalent to

finding the longest common subsequence of the two sequences compared when the substitution weight is at least

equal to the sum of the deletion and insertion weights).

4.1.2. Element-sensitive costs

We adapted this rather conventional SAM-distance calculation to an element-sensitive SAM analysis to better

reflect the research context. Besides the operation weights we charge an element-based cost depending on the

element in the source being deleted or inserted. As can be seen from the values assigned to the categorical relative

account-balance total variable, (cf. Table 2), some values are more related to each other. For instance, values 4 and

8 are more divergent than values 0 and 1. Therefore, a different extra cost is added depending on the element being

inserted in or deleted from the source. The latter increases the variance in the final SAM distances. We set the

Page 10: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 517

,

element-based costs for deletion and insertion equal. The element-cost setting does not distinguish between the

categories of relbalance reflecting positive evolutions in relbalance and categories expressing a negative evolution

on relbalance (e.g. equal element costs for element value 1 or 5). See Table 2 (cf. supra).

4.1.3. Absolute position-sensitive costs

Besides incorporating an element-based cost in the calculation of the conventional SAM distances, we convert

the SAM-distance measure into a position-sensitive measure. Normally, SAM describes differences between

sequences only in terms of the difference in sequential order of the elements, by changing the order of the common

elements of the source sequence if it differs from that of the common elements in the target (i.e., reordering), and in

terms of the difference in element composition by deleting the unique elements from the source and inserting the

unique elements of the target sequence in the source. Hence, conventional SAM is not sensitive to the positions at

which elements are inserted, deleted or reordered. After all, in bio-informatics studies, this position-sensitivity is

not useful, as the elements in DNA strings are relatively independent from each other. Consecutive DNA elements

are not likely to affect one another. However, one can think of many other sequences where the elements are

influencing each other. Whenever the elements in the sequence are measured at consecutive time points, we can

assume that previous values influence subsequent values in the sequence. For instance, in activity-sequence

analysis sequential relationships between activities is a primary concern. Likewise, we assume in our application

that the elements in the sequence consisting of the evolution in relative account-balance total of the customer over

2002, are correlated. Although there are many applications where the elements in the sequence are influencing

each other, there is, to our knowledge, only one research study, which incorporates the positional component into

the original SAM distance concept.

Joh et al. [22] developed a position-sensitive sequence-alignment analysis for activity analysis. Position-

sensitivity is taken into account by considering the distance by which the sequential order of the source

element is changed. The reordering distance h is measured as h = |i� j|, where i and j are the positions of the

reordered elements in the source and target sequences. The position-sensitive SAM distance is defined as

follows:

d a; bð Þ ¼ min wdDþ wiI þ gXRr¼1

hr

#"ð1Þ

where wd, weight for deletion; D, number of deletions; wi, weight for insertion; I, number of insertions; gweight for reordering; R, number of reorderings; hr, distance of reordering the rth common element.

The authors show that for larger values of the reordering weight, there is a significant difference between

the clustering solution found using the traditional SAM measure and the one resulting from the position-

sensitive SAM analysis. Whereas Joh et al. [22] developed a relative position-sensitive SAM analysis, i.e., a

SAM analysis that is sensitive to the difference in positions of common elements that need to be reordered, we

wish to develop an absolute position-sensitive SAM analysis which does not only consider the positions of

elements reordered, but also the positions at which elements are deleted from the source as well as the

positions in the source at which elements are inserted. We prefer an absolute position-sensitive measure to a

relative measure because we wish to distinguish between operations applied in the beginning of the sequence

and operations performed at the end of the sequence. The rationale for this comes from the fact that we assume

recent evolutions in relative account-balance total to influence the customers’ churn probability more

intensively than the customers’ relative account-balance total in the beginning of the sequence, e.g. the relative

account-balance total of the customer in the period October–December 2002 probably influences the churn

probability more than the relative account-balance total of the customer between January–March 2002.

Consider example 1 and 2. Let sequence a be the source, sequence b the target. In both examples, we need to

change the order of elements 6 and 1. Whereas in the target element 6 precedes element 1, in the source this is

reverse. Using a relative position-sensitive distance measure like Joh et al. [22], the SAM cost from reordering

Page 11: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526518

f

o

f

r

r

d

r

s

w

s

i

t

s

S

4

w

b

t

t

k

c

s

o

t

i

p

a

r

b

or example 1 and 2 would be the same. Supposing we reorder element 6 and the reordering weight is 1, we

btain a reordering cost for example 1 of: |2�1|=1 and for example 2 of: |4�3|=1. From these examples it

ollows that a relative position-sensitive SAM measure does not distinguish between sequences where the

eordering is applied over the same number of positions, but at distinct positions in the source. Whereas a

elative position-sensitive SAM measure keeps the distances symmetric (cf. reordering cost defined using

ifference in position of element to reorder in source and target), an absolute position-sensitive SAM method

esults in asymmetric distances (cf. reordering cost defined using only position of reordered element in the

ource).

position 1 3

b 6

a 1 6 67 7 7 7

7 7 1

2 4 1 3

6

1

7 7 1

2 4

The absolute position-sensitive reordering cost multiplies the position of the reordered element in the source

ith the reordering weight. As mentioned earlier, we also convert the deletion and insertion costs into position-

ensitive costs. The absolute position-sensitive deletion cost considers the position in the source where the element

s deleted. Likewise, the insertion cost is made position-sensitive by incorporating the position of the element in the

arget to be inserted in the source. We use the position of the element in the target as a proxy for the position in the

ource where the target element is inserted. Next we describe how the final element and absolute-position sensitive

AM distances are calculated.

.1.4. Hay’s pairwise-sequence alignment algorithm

A major concern in sequence comparison and SAM analysis is the algorithm used to calculate the distances

ithin a reasonable time window. To address this computational complexity problem [39] we apply an algorithm

y Hay [16] that structures the equalizing process in a fast and easy way. It has not yet been proven to always lead

o an optimal alignment, i.e., the trajectory (the sequence of operations necessary to equalize the source with the

arget) resulting in the smallest distance possible. Yet, the algorithm mostly does.

Step 1: Identify the longest common substrings respecting the sequential order of elements. It is well

nown, that if the substitution/reordering weight is at least the sum of the deletion and insertion weights,

alculating an optimal alignment is equivalent to finding the longest-common subsequences of the two

equences compared. These longest common substrings represent the structural integrity of the two sequences

r the structural skeleton [32]. In case there is more than one possible longest common substring, we opt for

he longest substring of which the absolute sum of the differences in positions between all common elements

n source and target is smallest as the latter prefers matches between source and target at less remote

ositions above matches at more distant source-target positions. In example 3 we compare a customer having

rather small evolution in relative account-balance total (e.g. 0 1 5 0) with another customer starting with

ather small evolutions in relative customer account-balance total but ending with a decreasing account-

alance total (e.g. 5 1 2 3).

Example of sequence pairs Longest common substring

5 1 2 3 5 1 2 3 5 or 1. We opt for identification of the common element 1.

0 1 5 0 0 1 5 0

Page 12: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 519

.

Step 2: Identify elements, which are not included in the substring and appear in the source and the target. Count

one reordering for each such identified element.

Example of sequence pairs Common elements not appearing in the longest common substring

5 1 2 3 5

0 1 5 0

At the end of this step, the order of the substituted elements has been changed. In the above example, the order

of element 5 is changed to precede 1 rather than to succeed 1. The total reordering cost is the sum of the product of

the reordering weight with the position of the reordered element in the source.

Costreordering ¼XRr¼1

gTposr reorel ð2Þ

R, number of reorderings; g, reordering weight; posr_reorel, absolute position of rth reordered element in the source

Step 3: Identify elements not included in the substring and which appear in either one of the compared

sequences. Count one deletion operation for each unique element in the source, one insertion operation for each

unique element in the target.

The two zero elements are deleted from the source, elements 2 and 3 are inserted in the source.

Example of sequence pairs Elements unique to the source or target

5 1 2 3

0 1 5 0

The costs for deletions and insertions are besides position-sensitive also element-sensitive.

Costdeletion ¼XDd¼1

wdT cd e þ posd delð Þ ð3Þ

wd, weight for deletion; cd_e, cost for deletion of dth element with a certain value (i.e., element cost); posd_del, cost

for deletion of dth element at a given position in the source (i.e., position cost).

Costinsertion ¼XIi¼1

wiT ci e þ posi insð Þ ð4Þ

wi, weight for insertion; ci_e, cost for insertion of ith element with a certain value (i.e., element cost); posi_ins, cost

for insertion of ith element from a given position in the target (i.e, position cost). Applying this algorithm, using

the operation weights, the element-based costs and the position costs, the total SAM distance is calculated as

follows:

SAMdist ¼ minXRr¼1

gTposr reorel

XDd¼1

wdT cd e þ posd delð Þ!

þXIi¼1

wiT ci e þ posi insð Þ! # "

ð5Þ

The total SAM distance for our example is:

SAMdist¼ (2:3T3)+[(1T(0:2+1))+(1T(0:2+4))]+[(0:9T(0:6+3))+(0:9T(0:8+4))] ¼ 19:86 ð6Þ

Page 13: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

able 4

inal five-cluster solution on training sample

luster Old clusters Frequency N Percent N Frequency churners Percent churners

1 3281 40.37 82 2.50

2 1629 20.04 42 2.58

3 1040 12.79 26 2.50

4–5 1101 13.55 20 1.82

6–23 1076 13.24 30 2.79

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526520

I

s

i

4

b

c

c

h

v

f

c

c

s

w

s

w

c

s

c

c

i

t

4

c

c

w

a

c

r

g

f

f

c

t

t

T

F

C

1

2

3

4

5

n this paper, we calculate the element- and position-sensitive SAM distance between each customer in the training

ample, in both directions, on the sequential dimension relative account-balance total. These distances are inserted

nto a distance matrix used as input for the asymmetric Taylor–Butina clustering method.

.2. Asymmetric clustering using Taylor–Butina algorithm

The asymmetric SAM distances calculated between the training sequences on evolution in relative account-

alance total are used as input for the asymmetric, disjoint Taylor–Butina algorithm with the aim to distinguish

lusters with respect to the sequential dimension and indirectly also with respect to the dependent variable, i.e.,

hurn. Depending on the threshold (in range [0,1]) used, the clusters obtained by the algorithm are more or less

omogeneous. Lower thresholds result in smaller, but more homogeneous clusters, whereas higher threshold

alues result in larger, but less homogeneous clusters. In our application, however, our primary objective is not

inding the optimal clustering solution in terms of homogeneity but to keep the number of clusters limited as the

luster membership is incorporated by means of dummies into the final classification model. For each cluster, a

ertain minimum number of customers is needed, to enhance the possibility that the cluster dummies would

ignificantly influence the dependent variable. As the Taylor–Butina algorithm iteratively identifies the sequence

ith the highest number of neighbors, it follows that the first cluster defined is the largest one and that

ubsequent clusters have fewer and fewer members. Therefore, it might be that small thresholds are not optimal

ith respect to our predictive goal. Our ambition is to distinguish a reasonable number of clusters with 1) a

ertain minimum number of customers and 2) with the highest possible homogeneity. We experimented with

everal levels of the similarity threshold. It seems that we need a rather high threshold to keep the number of

lusters limited. For instance, for a threshold of 0.80, we obtain 130 clusters of which the biggest cluster only

ontains 758 customers (i.e., 9.32%) and of which clusters 89 to 130 hold less than five customers. We

nvestigated for which threshold the first cluster keeps a high enough number of customers. Employing a

hreshold of 0.9999 resulted in a 23-cluster solution of which the first cluster counts 3281 customers (i.e.,

0.37%). As 23 clusters is still quite high a number, which would result in 22 cluster dummies in the

lassification model, and as this cluster solution still creates some rather small populated clusters (e.g., from

luster 10 on the clusters have less than 100 members), we decided to group some clusters together. Therefore,

e performed a dsecond-order clustering’ by using the 23 representative sequences (i.e., centrotypes) as input to

subsequent clustering exercise, and investigated which cluster representatives are taken together into new

lusters. This resulted in a final five-cluster solution (see Table 4).

For each of these five clusters, we defined five representatives. We prefer to have more than one

epresentative per cluster to enhance the quality of cluster allocation of the hold-out sequences. By default, the

rouping module of Mesa Suite software package version 2.1 returns only one representative per cluster. Hence,

or clusters 1 to 3, we have already one representative for each cluster, for cluster 4 we have two centrotypes and

or cluster 5 we have 18 centrotypes. For clusters 1 to 4 we need to find additional representatives, whereas for

luster 5 we need to limit the number of representatives to five. Therefore, for each cluster defined, we cluster on

he cluster-specific SAM distances until we get a five-cluster solution providing exactly five representatives for

hat cluster.

Page 14: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

Table 5

Allocation of hold-out sequences to five clusters identified on training sample

Cluster Frequency N Percent N Frequency churners Percent churners

1 4514 55.60 127 2.81

2 601 7.40 32 5.32

3 916 11.28 4 0.44

4 842 10.36 14 1.66

5 1254 15.45 22 1.75

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 521

In a next step, the hold-out sequences are assigned to the five clusters identified on the training sample. We

calculated the distances between the hold-out sequences and the groups of five representative cluster sequences

(5�5) in both directions. As a proxy for the asymmetric distance between a hold-out sequence and a

representative, we make the sum of the distance between the hold-out sequence and the centrotype and the distance

between the centrotype and the hold-out sequence and divide this sum by two. The latter approach is common

practice in studies performing calculations on asymmetric proximities. After all, it has been proven [34] that each

asymmetric distance matrix decomposes into a symmetric matrix S of averages sij ={ qij +qji} /2 and a skew-

symmetric matrix A with elements aij ={ qij�qji } /2. Using these proxies, each hold-out sequence is assigned to

the cluster to which it has the smallest average distance (i.e., smallest average distance towards five cluster

representatives). Table 5 gives an overview of the cluster distribution in the hold-out sample.

4.3. Defining the best subset of non-time varying independent variables

Before we compare the predictive performance of the LogSeq model with that of the LogNonseq model, we

first define a best subset of non-time varying independent variables to include in the logistic-regression models

besides the sequential dimension relbalance. Employing the leap and bound algorithm [10] on the non-time

varying independent variables in Table 2, we compared the best subsets having size 1 until 20 on their sums of

squares. As expected the increase in the performance criterion is inversely proportional to the number of

independent variables added. From Fig. 2, we decide that a subset containing the best five variables represents a

good balance between number of independents included and variance explained by the model.

The independent variables in the best subset of size five are in Table 6.

Variable selection curveScore Value1000.0000

1200.0000

1100.0000

1000.0000

900.0000

800.0000

700.0000

600.0000

500.00001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of Variables

Fig. 2. Number of variables in best subsets.

Page 15: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

Table 6

Best subset selection of size 5

Best subset of size 5 Variable description

st_days_until_next_exp Number of days until the first next expiration date (from January 1st, 2003 on) for a service still in

possession on December, 31st 2002. Standardized.

st_ratio_curr_avgtotal3 Account-balance total December 2002/average account-balance total calculated over October,

November and December 2002.

st_ratio_stillo_open_bef2003 (Number of services still open before January 1st, 2003/number of services ever opened before

January 1st, 2003)*100. Standardized.

st_months_last_titu_nozero Number of months ago since customer was titular of at least one account where the balance is non-zero.

st_days_since_last_exp Number of days between December 31st, 2002 and last expiration date of a service before January 1st,

2003. Standardized.

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526522

4.4. Comparing churn predictive performance of LogSeq and LogNonseq models

We compare the predictive performance measured by AUC on the hold-out sample of the LogNonseq model

with that of the LogSeq model. Both logistic-regression models include the best five non-time varying variables

from Table 6 as well as the sequential dimension relbalance. However, in the LogNonseq model, the sequential

dimension is incorporated by means of four non-time varying independent variables (i.e., st_relbalanceJanMar,

st_relbalanceMarJul, st_relbalanceJulOct and st_relbalanceOctDec), whereas in the LogSeq model the sequential

dimension is operationalized by four cluster dummies. We hypothesize that the churn predictive performance of

the LogSeq model will be significantly higher than that of the LogNonseq model because operationalizing a

sequential dimension by non-time varying independent variables neglects the sequential information of the

dimension.

Table 7 shows that our hypothesis is confirmed. There is a significant difference (v2=17.69, p=0.0000259) inpredictive performance between the binary logistic-regression model including the best subset of five independent

variables and the sequential dimension operationalized by non-time varying variables and the logistic-regression

model with the same set of independent variables but the sequential dimension expressed by cluster dummies

deduced from sequence-alignment analysis. Table 8 shows the parameter estimates and significance levels of the

regressors for both models. Where possible the standardized estimates are given.

Although not all cluster dummies are significant at a =0.05 level, the LogSeq model significantly

outperforms the LogNonseq model. The insignificance of cluster dummy 3 ( p=0.1744) might stem from a

serious drop in percentage churners from training (2.58%) to hold-out sample (0.44%) to even below 1%

churners. Looking at the estimates for the relative account-balance total dimension in the LogNonSeq model,

we find that only the relative evolution in account-balance total for the last six months seems to have a

significant effect on the churn probability in 2003 (cf. st_relbalanceJulOct and st_relbalanceOctDec). The

bigger the positive difference in account-balance total the less likely the customer will churn. Considering the

non-sequential regressors, it appears that all five regressors are significant for the LogNonseq model, while all

but one, i.e. st_months_last_titu_nozero, for the LogSeq model. All effects have the expected sign. The

smaller the number of days until the next expire date from January, 1st 2003 onwards, the higher the churn

probability (cf. st_days_until_next_exp). The effect of st_ratio_curr_avgtotal3 seems to be rather small, so we

should not worry too much about the difference in sign of effect between the LogNonseq and the LogSeq

model.

Table 7

Churn predictive performance of LogSeq and LogNonseq model on hold-out sample

Performance measure LogNonseq model LogSeq model

AUC 0.906 0.964

Page 16: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

Table 8

Parameter estimates for LogNonseq and LogSeq models

Variable Estimate PrNChi-Square

LogNonseq LogSeq LogNonseq LogSeq

Intercept �16.80 �8.23 b0.0001 b0.0001

Best subset st_days_until_next_exp �1.25 �1.26 b0.0001 b0.0001

st_ratio_curr_avgtotal3 0.28 �0.09 0.0187 0.0109

st_ratio_stillo_open_bef2003 �1.25 �1.28 b0.0001 b0.0001

st_months_last_titu_nozero 0.12 0.02 0.0048 0.4392

st_days_since_last_exp 0.50 0.49 b0.0001 b0.0001

Sequential Dimension st_relbalanceJanMar cl1 �21.79 0.29 0.2169 0.0401

st_relbalanceMarJul cl2 �8.86 0.25 0.9299 0.1034

st_relbalanceJulOct cl3 �38.25 0.23 0.0890 0.1744

st_relbalanceOctDec cl4 �240.45 0.35 0.0011 0.0447

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 523

,

We may conclude, that our new procedure, which combines sequence analysis with a traditional

classification model, is a possible strategy to follow in an attempt to overcome the caveat of traditional

classification models designed for modeling non-time varying independent variables. In this paper, we have

modeled a time-varying independent variable by preceding a traditional binary logistic-regression model by an

element/position-sensitive sequence-alignment on the sequential dimension. This resulted in cluster-member-

ship information in terms of the sequential dimension as well as implicitly with respect to the dependent

variable. We decided to include this cluster-membership information by means of dummies in the final

classification model. Instead of the latter, we could alternatively use the cluster information to build cluster-

specific classification models. However, in our application, this approach is irrealistic because one of the five

clusters identified on the test sample has less than 1% churners (cf. cluster 3). So in case the predicted event

is rather rare, building cluster-specific classification models might be impossible due to too few people

experiencing the event in some of the identified clusters. Moreover, another drawback of building cluster-

specific models, lies in the practical problems to include several sequential dimensions in the classification

model. Whereas it is easy to include another set of cluster dummies in the classification model for each

sequential dimension, building cluster-specific classification models on more than one sequential dimension

implies simultaneously clustering on several sequential dimensions employing multidimensional SAM analysis.

However, as computational complexity is already an issue of concern in case of unidimensional SAM, the

latter becomes even worse within multidimensional SAM analysis.

5. Conclusion

In this paper, we provide a new procedure that

overcomes the caveat of traditional classification

models to incorporate sequential exogenous variables.

Instead of transforming the sequential dimension in

non-time varying variables, thereby ignoring the

sequential-information part, a better practice is to

employ a sequence-analysis method for modeling the

time-varying independent variable and to, subse-

quently, incorporate this information in the traditional

classification method, which is designed for modeling

non-time varying covariates. This way the best of both

methods is combined. One possible strategy hereby is

to cluster the customers on the sequential dimension

using SAM (in this paper an element/position

sensitive SAM) and to incorporate this cluster

information in the classification model by dummy

variables. The latter approach is promising as the

results from the attrition models at the IFSP confirm

our hypothesis of improved predictive performance

when modeling the sequential dimension by se-

quence-analysis methods instead of operationalizing

them as non-time varying variables. Besides this

approach, preceding a traditional classification model

like binary logistic regression, by a sequence-analysis

method to model the sequential dimension in the

model, other approaches might exist. It might be

Page 17: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526524

worthwhile to elaborate on other procedures to

combine sequence-analysis methods designed to

model sequential information with traditional classi-

fication methods suited to model non-time varying

independent variables. Another avenue for further

research exists in exploring how other sequence-

analysis methods than sequence alignment could

enhance the modeling of sequential covariates in

classification models. In this paper, we only included

one sequential dimension. Further research studies

should incorporate several sequential covariates.

Finally, we wish to bring the parameter setting issue

to the attention of researchers considering to apply or

to elaborate on our new procedure. The highly-tuned

edit distances used for our churn application might not

be valid in other applications. The researcher should

adapt the operational weights, element costs and

position costs of the sequence-alignment to fit the

application at hand. Similarly, the threshold used for

the asymmetric clustering will need fine-tuning in

order to obtain a good clustering solution for other

applications.

Acknowledgements

The authors would like to thank the anonymous

financial-services company for providing the data.

Next, we extend our thanks to John D. MacCuish

and Norah E. MacCuish for providing a free

academic license of the software package Mesa

Suite Version 1.2, Grouping Module (www.mesaac.

com) and for their kind assistance. Moreover, we

would like to thank Bart Lariviere, PhD candidate at

Ghent University, for sharing his knowledge of the

data warehouse of the company. Finally, we express

our thanks to 1) Ghent University for funding the

PhD project of Anita Prinzie (BOF Grantno.

B00141), and 2) the Flemish Research Fund (FWO

Vlaanderen) for providing the funding for the com-

puting equipment to complete this project (Grantno.

G0055.01).

References

[1] A. Abbott, Sequence analysis: new methods for old ideas,

Annual Review of Sociology 21 (1995) 93–113.

[2] A. Abbott, A. Hrycak, Measuring resemblance in sequence

data: an optimal matching analysis of musicians’ careers,

American Journal of Sociology 96 (1) (1990) 144–185.

[3] B. Baesens, G. Verstraeten, D. Van den Poel, Bayesian

network classifiers for identifying the slope of the customer-

lifecycle of long-life customers, European Journal of Opera-

tional Research 156 (2) (2004) 508–523.

[4] C.B. Bhattacharya, When customers are members: customer

retention in paid membership contexts, Journal of the

Academy of Marketing Science 26 (1) (1998) 31–44.

[5] W. Buckinx, E. Moons, D. Van den Poel, G. Wets, Customer-

adapted coupon targeting using feature selection, Expert

Systems with Applications 26 (4) (2004) 509–518.

[6] D. Butina, Unsupervised data base clustering based on Day-

light’s fingerprint, Tanimoto similarity: a fast and automated

way to cluster small and large data sets, Journal of Chemical

Information and Computer Sciences 39 (4) (1999) 747–750.

[7] A. Cohen, R.I. Ivry, S.W. Keele, Attention and structure in

sequence learning, Journal of Experimental Psychology.

Learning, Memory and Cognition 16 (1) (1990) 17–30.

[8] M.R. Colgate, P.J. Danaher, Implementing a customer relation-

ship strategy: the asymmetric impact of poor versus excellent

execution, Journal of the Academy of Marketing Science 28

(3) (2000) 375–387.

[9] W.S. DeSarbo, G. De Soete, On the use of hierarchical-

clustering for the analysis of nonsymmetric proximities,

Journal of Consumer Research 11 (1) (1984) 601–610.

[10] G.M. Furnival, R.W. Wilson, Regressions by leaps and

bounds, Technometrics 16 (4) (1974) 499–511.

[11] J. Ganesh, M.J. Arnold, K.E. Reynolds, Understanding the

customer base of service providers: an examination of the

differences between switchers and stayers, Journal of Market-

ing 64 (3) (2000) 65–87.

[12] D. Green, J.A. Swets, Signal detection theory and psycho-

physics, John Wiley & Sons, New York, USA, 1966.

[13] M. Gribskov, J. Devereux (Eds.), Sequence Analysis Primer,

Oxford University Press, New York, USA, 1992.

[14] J. Hair, R. Andersen, R. Tatham, W. Black, Multivariate Data

Analysis, Prentice Hall, 1998.

[15] J. Hartigan, Clustering Algorithms, Wiley, New York, USA,

1975.

[16] B. Hay, Sequence Alignment Methods in Web Usage Mining,

Doctoral Dissertation, LUC, Belgium, 2003.

[17] B. Hay, G. Wets, K. Vanhoof, Web usage mining by means of

multidimensional sequence alignment methods. WEBKDD

2002—mining web data for discovering usage patterns and

profiles, Lecture Notes in Artificial Intelligence 2703 (2003)

50–65.

[18] W.J. Hopp, A sequential model of R&D investment over an

unbounded time horizon, Management Science 33 (4) (1987)

500–508.

[19] A.K. Jain, R.C. Dubes, Algorithms for clustering data,

Prentice Hall Advanced Reference Series, Englewood Cliffs,

NJ, 1998.

[20] R.A. Jarvis, E.A. Patrick, Clustering using a similarity

measure based on shared nearest neighbors, IEEE Trans-

actions on Computers 22 (11) (1973) 1025–1034.

Page 18: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526 525

[21] C.H. Joh, T.A. Arentze, H.J.P. Timmermans, Multidimen-

sional sequence alignment methods for activity–travel

pattern analysis: a comparison of dynamic programming

and genetic algorithms, Geographical Analysis 33 (3) (2001)

247–270.

[22] C.H. Joh, T.A. Arentze, H.J.P. Timmermans, A position-

sensitive sequence alignment method illustrated for space–

time activity-diary data, Environment and Planning A 33 (2)

(2001) 313–338.

[23] J. Jonz, Textual sequence and 2nd language comprehension,

Language Learning 39 (2) (1989) 207–249.

[24] L. Kaufman, P.J. Rousseeuw, Finding groups in data: an

introduction to cluster analysis, John Wiley & Sons, 1990.

[25] E. Kim, W. Kim, Y. Lee, Combination of multiple classifiers

for the customer’s purchase behavior prediction, Decision

Support Systems 34 (2) (2003) 167–175.

[26] K. Krishna, R. Krishnapuram, A clustering algorithm for

asymmetrical related data with applications to text mining,

Proceedings of the Tenth International Conference on Infor-

mation and Knowledge Management, 2001, pp. 571–573.

[27] B. Lariviere, D. Van den Poel, Investigating the role of product

features in preventing customer churn by using survival

analysis and choice modeling: the case of financial services,

Expert Systems with Applications 27 (2) (2004) 277–285.

[28] V.I. Levenshtein, Binary codes capable of correcting deletions,

insertions, and reversals, Cybernetics and Control Theory 10

(8) (1965) 707–710.

[29] J. MacCuish, N.E. MacCuish, Mesa Suite Version 1.2

Grouping Module, Mesa Analytics and Computing, LLC,

www.mesaac.com, 2003.

[30] J. MacCuish, C. Nicolaou, N.E. MacCuish, Ties in proximity

and clustering compounds, Journal of Chemical Information

and Computer Sciences 41 (1) (2001) 134–146.

[31] S.Mc. Brearty, The Sagoan–Lupemban and middle stone-age

sequence at the Muguruk site, World Archaeology 19 (3)

(1988) 388–420.

[32] M.A. McClure, T.K. Vasi, W.M. Fitch, Comparative analysis

of multiple protein-sequence alignment methods, Molecular

Biology and Evolution 11 (4) (1994) 571–592.

[33] C.S. Myers, L.R. Rabiner, A level building dynamic time

warping algorithm for connected word recognition, IEEE

Transactions on Acoustics, Speech, and Signal Processing

ASSP 29 (2) (1981) 284–297.

[34] B. Noble, W. Daniel, Applied Linear Algebra, Prentice-Hall,

New Jersey, 1988, p. 20.

[35] K. Ozawa, CLASSIC: a hierarchical clustering algorithm

based on asymmetric similarities, Pattern Recognition 16 (2)

(1983) 201–211.

[36] A. Prinzie, D. Van den Poel, Investigating purchasing-

sequence patterns for financial services using Markov, MTD

and MTDg models, European Journal of Operational Research

(2006) (in press).

[37] A.E. Raftery, S. Tavare, Estimation and modelling repeated

patterns in high order Markov chains with the mixture

transition distribution model, Applied Statistics 43 (1) (1994)

179–199.

[38] R. Sabherwal, D. Robey, Reconciling variance and process

strategies for studying information system development,

Information Systems Research 6 (1995) 303–327.

[39] D. Sankoff, J. Kruskal, Time Warps, String Edits, and

Macromolecules. The Theory and Practice of Sequence

Comparison, Addison-Wesley Pub., Advanced Book Program,

Mass., 1983.

[40] R. Sethuraman, V. Srinivasan, K. Doyle, Asymmetric and

neighborhood cross-price effects: some empirical general-

izations, Marketing Science 18 (1) (1999) 23–41.

[41] R. Taylor, Simulation analysis of experimental design strat-

egies for screening random compounds as potential new drugs

and agrochemicals, Journal of Chemical Information and

Computer Sciences 35 (1) (1995) 59–67.

[42] P. Toth, D. Vigo, A heuristic algorithm for the symmetric

and asymmetric vehicle routing problems with backhauls,

European Journal of Operational Research 113 (3) (1999)

528–543.

[43] A. Tversky, J.W. Hutchinson, Nearest neighbor analysis of

psychological spaces, Psychological Review 93 (1) (1986)

3–22.

[44] G.L. Urban, P.L. Johnson, J.R. Hauser, Testing competitive

market structures, Marketing Science 3 (1984) 83–112.

[45] D. Van den Poel, B. Lariviere, Customer attrition analysis

for financial services using proportional hazard models,

European Journal of Operational Research 157 (1) (2004)

196–217.

[46] R.A. Wagner, M.J. Fischer, The string-to-string correction

problem, Journal of the Association for Computing Machinery

21 (1974) 168–173.

[47] M.S. Waterman, Introduction to Computational Biology.

Maps, Sequences and Genomes, Chapman and Hall, USA,

1995.

[48] W.C. Wilson, Activity pattern analysis by means of sequence-

alignment methods, Environment and Planning A 30 (6)

(1998) 1017–1038.

[49] B. Zielman, W.J. Heiser, Models for asymmetric proximities,

British Journal of Mathematical & Statistical Psychology 49

(1996) 127–146.

Anita Prinzie is a PhD candidate in

Economics and Business Administration

at Ghent University, Belgium. She

received her master degree in Marketing

Analysis and Planning at Ghent Univer-

sity, Belgium. Her PhD thesis investigates

the use of sequence-analysis methods for

CRM purposes (churn and cross-sell

analysis).

Page 19: Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM

A. Prinzie, D. Van den Poel / Decision Support Systems 42 (2006) 508–526526

Dirk Van den Poel is associate professor of

marketing at the Faculty of Economics and

Business Administration of Ghent Univer-

sity in Belgium. He heads a competence

center on analytical customer relationship

management (aCRM). He received his

degree of management/business engineer

as well as his PhD from K.U.Leuven

(Belgium). His main fields of interest are

studying consumer behavior from a quan-

titative perspective (CRM), data mining

(genetic algorithms, neural networks, random forests), and oper-

ations research.