customer relationship management using ensemble methods · yue cui master of applied science...

Customer Relationship Management

using Ensemble Methods

by

Yue Cui

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Department of Chemical Engineering & Applied Chemistry

University of Toronto

© Copyright by Yue Cui 2018

- ii -

Customer Relationship Management using Ensemble Methods

Yue Cui

Master of Applied Science

Department of Chemical Engineering & Applied Chemistry

University of Toronto

2018

Abstract

This thesis aims to provide a method for customer relationship management prediction to beat

the in-house classification method implemented by the data provider company. By reviewing

the common machine learning algorithms, we recommend ensemble methods to predict the

targets. Three different ensemble methods are implemented in the thesis: random forest, gradient

boosting decision trees and ensemble selection. With the results, we conclude that all ensemble

methods outperform the benchmark, reduce the positive predictions and increase the true

positive rates. Ensemble selection performs the best, followed by the gradient boosting decision

trees. In addition, the results indicate that the ensemble methods' running time significantly

increase when compared to the benchmark. The results also indicate that careful feature

selection can significantly simplify the training and prediction process. We further discuss the

potential applications in marketing using the prediction results and the trade-off between

accuracy and computational complexity when applying ensemble methods.

- iii -

Acknowledgements

Immeasurable appreciation and deepest gratitude for the academic, educational and human

support and belief in me from the following persons who have contributed in making this thesis

possible:

Professor Joseph C. Paradi, my respected thesis supervisor, for his sincere support, guidance,

valuable suggestions and comments that benefited me in the completion of this work.

My parents, whose love and support made it possible for me to complete this project.

My friends and my fellow CMTE candidates, for their companionship and encouragement.

Lastly, I humbly extend my thanks to all concerned persons who helped and encouraged me

through my graduate studies.

- iv -

Table of Contents

Acknowledgements....................................................................................................................... iii

List of Tables ............................................................................................................................... vii

List of Figures .............................................................................................................................. vii

Introduction..................................................................................................................................... 1

1.1 Motivation ................................................................................................................ 1

1.2 Objectives ................................................................................................................. 2

1.3 Scope ........................................................................................................................ 2

Literature Review ........................................................................................................................... 4

2.1 Supervised Learning ................................................................................................. 4

2.1.1 Basic Naïve Bayes Classifier .................................................................................... 4

2.1.2 Logistic Regression .................................................................................................. 5

2.1.3 Decision Tree ............................................................................................................ 6

2.1.4 Support Vector Machine ........................................................................................... 7

2.1.5 Artificial Neural Network ......................................................................................... 8

2.1.6 K-Nearest Neighbors Algorithm .............................................................................. 9

2.2 Ensemble Learning ................................................................................................. 10

2.2.1 Bagging ................................................................................................................... 11

2.2.2 Random Forest ........................................................................................................ 11

2.2.3 Boosting .................................................................................................................. 12

2.2.4 Gradient Boosting ................................................................................................... 13

2.2.5 Ensemble Selection ................................................................................................. 14

Data ............................................................................................................................................... 15

- v -

Technology ................................................................................................................................... 17

4.1 Python .................................................................................................................... 17

4.2 Libraries ................................................................................................................. 17

4.2.1 Scikit-learn .............................................................................................................. 17

4.2.2 TensorFlow ............................................................................................................. 18

4.2.3 TFlearn .................................................................................................................... 18

4.2.4 XGBoost ................................................................................................................. 18

4.3 Cloud Computing ................................................................................................... 19

Methodology ................................................................................................................................. 20

5.1 Evaluation Score: Area Under the Curve (AUC) ................................................... 20

5.2 Data Processing ...................................................................................................... 21

5.2.1 Data Preprocessing ................................................................................................. 21

5.2.2 Encoding Categorical Data ..................................................................................... 22

5.2.3 Feature Selection .................................................................................................... 24

5.3 Classification Methods ........................................................................................... 24

5.3.1 Naïve Bayes Classification ..................................................................................... 24

5.3.2 Random Forest ........................................................................................................ 25

5.3.3 Gradient Boosting Decision Tree ........................................................................... 25


5.4 Parameter Optimization ......................................................................................... 28

5.5 Feature Importance and Model Performance ......................................................... 29

Results and Analysis ..................................................................................................................... 30

6.1 Classification Results ............................................................................................. 30

- vi -

6.1.1 Naïve Bayes Classification ..................................................................................... 30

6.1.2 Random Forest ........................................................................................................ 31

6.1.3 Gradient Boosting Decision Trees .......................................................................... 34


6.1.5 Overall Comparison ................................................................................................ 37

6.2 Feature selection ..................................................................................................... 38

6.3 Target Audience ..................................................................................................... 41

Discussion and Conclusions ......................................................................................................... 44

Future Works ................................................................................................................................ 48

References..................................................................................................................................... 50

- vii -

List of Tables

Table 1 Frequency Distribution of Target Variables .................................................................... 16

Table 2 Confusion matrix of a binary classification ..................................................................... 20

Table 3 Confusion Matrix for Churn from Naïve Bayes Classifier ............................................. 30

Table 4 Confusion Matrix for Appetency from Naïve Bayes Classifier ...................................... 30

Table 5 Confusion Matrix for Upselling from Naïve Bayes Classifier ........................................ 31

Table 6 Confusion Matrix for Churn from Random Forest .......................................................... 31

Table 7 Confusion Matrix for Appetency from Random Forest .................................................. 32

Table 8 Confusion Matrix for Upselling from Random Forest .................................................... 32

Table 9 Optimal Hyperparameters for Targets ............................................................................. 32

Table 10 Confusion Matrix for Churn from Gradient Boosted Decision Trees ........................... 34

Table 11 Confusion Matrix for Appetency from Gradient Boosted Decision Trees .................... 34

Table 12 Confusion Matrix for Upselling from Gradient Boosted Decision Trees ..................... 34

Table 13 Optimal Hyperparameters for Targets ........................................................................... 35

Table 13 Confusion Matrix for Churn from Ensemble Selection................................................. 36

Table 14 Confusion Matrix for Appetency from Ensemble Selection ......................................... 36

Table 15 Confusion Matrix for Upselling from Ensemble Selection ........................................... 37

Table 16 Feature Importance of the Top 20 Variables for Each Target ....................................... 39

- viii -

List of Figures

Figure 1 Sigmoid Function ............................................................................................................. 5

Figure 2 A decision tree to determine the species of a citrus fruit. ................................................ 6

Figure 3 How kernel trick transforms a nonlinear classification (Jain, 2017)................................ 7

Figure 4 Maximum-margin hyperplane and margins for an SVM trained with samples from two

classes (Meyer, 2001). .................................................................................................................... 8

Figure 5 Examples of (a) a neuron, and (b) a neural network. ....................................................... 9

Figure 6 An Example of k-NN Classification .............................................................................. 10

Figure 7 The General Structure of a Random Forest (He, Chaney, Sheffield, & Schleiss, 2016)

...................................................................................................................................................... 12

Figure 8 Bagging vs Boosting (ETS Asset Management Factory, 2016) .................................... 13

Figure 9 Residual Fitting Process in Gradient Boosted Decision Trees (Grover, 2017) .............. 14

Figure 10 Frequency Distribution of Variable 2382..................................................................... 15

Figure 11 Frequency Distribution of Variable 1933..................................................................... 16

Figure 12 Area Under Curve and Balanced Accuracy for Sensitivity vs Specificity ................... 21

Figure 13 A Numeric/Ordinal Encoding Example (Laurae, 2017) .............................................. 22

Figure 14 A One Hot Encoding Example (Laurae, 2017) ............................................................ 23

Figure 15 A Binary Encoding Example (Laurae, 2017) ............................................................... 24

Figure 16 Overview of All the Classification Model ................................................................... 38

Figure 17 AUC Score for Random Forest with Different Number of Features ........................... 40

Figure 18 AUC Score for Gradient Boosting Decision Tree with Different Number of Features

...................................................................................................................................................... 40

Figure 19 Relationship between Prediction and Truth and Connection with Marketing ............. 41

Figure 20 Positive Prediction and True Positive Rate for Each Classifier ................................... 42

Figure 21 Trade-off between Accuracy and Time ........................................................................ 45

- 1 -

Chapter 1

Introduction

1.1 Motivation

In current times, our social reality is in a state of flux, developing from an industrial society via

an information society towards a knowledge-based society. Knowledge is the basis for making

decisions and taking action, and is relied on data and information (Hohenegger, Bufardi, &

Xirouchakis, 2008). If you want to make a good decision, it is necessary that the relevant

information is gathered quickly through well addressed and comprehensibly analyzed data.

Knowledge management looks into the possibility of taking an active influence on the

knowledge resources within a company (Wilde, 2011). Customers are the basis of a company’s

economic success. Therefore, it is important for us to appropriately manage customers’ data and

information. Furthermore, understanding the connections among the knowledge and actions or

strategies can lead to the improvements in customer relationships and increase the company’s

profit.

The KDD Cup 2009 offered a large marketing database (15,000 × 50,000, >1GB) provided by

the French Telecom company Orange (the database or the Orange database). The competition

required the participants to produce predictions for the propensity of customers to switch

providers (churn), buy new products or services (appetency), or buy upgrades or add-ons

proposed to them to make the sale more profitable (up-selling) (Guyon, Lemaire, Boullé, Dror,

& Vogel, 2009). The customers that are more loyal to the company, more likely to purchase new

products and upgrades can be considered as the “optimal target audiences” for marketing

campaigns.

This well cleaned and tested database provides an extraordinary opportunity to apply machine

learning algorithms on a large-scale industrial application for customer relationship

management.

- 2 -

This thesis focuses on constructing a feasible prediction implementation of the target variables,

churn, appetency and upselling, for the Orange database based on the comprehension and

evaluation of current popular machine learning methods.

1.2 Objectives

As the data source is obtained from a Knowledge Discovery and Data Mining competition,

some of the objectives follow the requirements of the challenge.

The objectives of this thesis are as follow:

➢ Assess the academic literature, industry practices and success competition candidates of

machine learning application so as to determine the fastest and the most efficient

methods that can be applied to develop predictions based on this database.

➢ Propose a reasonable methodology or framework that predicts target variables based on

large-scale inputs.

➢ Construct the algorithm that can outperform the baseline predicted using a basic Naïve

Bayes classifier, and improve the algorithm to surpass the results from the in-house

system developed by the Orange Labs.

➢ Further analyze the results generated from the above application to identify the optimal

target audience for future marketing campaigns.

1.3 Scope

The goal of the project is to compare some of the common ensemble methods, and provide

suggestions based on the results.

The thesis focuses exclusively on the structure of the database provided by Orange, which is a

flat file database.

This thesis provides predictions about target variables, churn, appetency and up-selling, and will

not predict other information related with other fields within the database.

- 3 -

This work is constructed to be worked solely on the Orange database but can be modified to fit

most classification tasks.

- 4 -

Chapter 2

Literature Review

For machine learning tasks, the training data consist of a set of examples. In supervised learning,

each example is a pair consisting of an input vector and a desired output vector, also called the

supervisory signal. A supervised learning algorithm analyzes the training data and produces an

inferred function, which can be used for mapping new examples (Russel & Norvig, 2010). This

thesis project falls into the supervised learning category. The chapter first reviews the widely

used supervised learning algorithms and states the strengths and weaknesses for each one. In

addition, this chapter studies ensemble learning, a class of algorithms transforming a simple

mediocre algorithm into a super classifier without requiring additional fancy new algorithms and

provides background and foundation for Chapter 5.

2.1 Supervised Learning

In supervised learning, a label from the output vector is the explanation of its respective input

example from the input vector. The output vector consists of labels for each training example

present in the training data. These labels for the output vector are provided by the supervisor,

and that’s why this type of learning is called supervised learning (Mohammed, Khan, & Bashier,

2016). Two general groups of algorithms fall under the umbrella of supervised learning:

classification and regression. This project, predicting customer decisions, is a classification

problem.

2.1.1 Basic Naïve Bayes Classifier

The benchmark of the KDD Cup 2009 was provided based on the basic Naïve Bayes classifier.

Naïve Bayes classifiers are a group of simple probabilistic classifiers based on the application of

Bayes’ theorem with the assumption of strong (naïve) independence among the features (Good,

1965), and votes among features with a voting score capturing the correlation of the feature to

the target. This type of classifier was widely studied since the 1950s and became a popular

- 5 -

method for classification in the early 1960s (Russel & Norvig, 2010). The Naïve Bayes model is

a good representative of generative models, as it generates distribution for both variables and

target.

Compared to other classification algorithms, Naïve Bayes is easy to implement, and only

requires a small training data to estimate the parameters necessary for classification. However,

the assumption is that class conditional independence causes the loss of accuracy. Practically,

dependencies exist among variables, and these cannot be modelled by Naïve Bayes classifier

(Rish, 2001).

2.1.2 Logistic Regression

Logistic regression was developed by statistician David Cox in 1958 (COX, 1958). Logistic

regression generalizes a linear model by replacing the multiplication in linear regression with

the sigmoid or logistic function to obtain a binary outcome. It follows a general form of

𝑦(𝐱) = 𝜎(𝐰𝑇𝐱 + 𝑤0)

where x is the input vector or the independent variables, y is the output vector or the dependent

variables, w is a parameter vector, and w0 is a constant offset term, the sigmoid is defined as

𝜎(𝑧) =1

1+𝑒−𝑧 and is plotted in Figure 1.

Figure 1 Sigmoid Function

The binary logistic model is used to estimate the probability of a binary response based on one

or more predictor variables. This model is a good example of discriminative models, which

generate the dependence of target variables on observed variables.

- 6 -

Logistic regression is incredibly easy to apply and very efficient to train. In addition, the

conditional probabilities are determined through the training process, which can be very

valuable in certain applications. Since it is a generalized linear model, it cannot solve non-linear

problems with its linear decision surface.

2.1.3 Decision Tree

The first decision tree algorithm, Automatic Interaction Detection (AID), was published in 1963

by Morgan and Sonquist to produce piecewise constant prediction of a regression function. In

1972, Messenger and Mandell introduced THeta Automatic Interaction Detection (THAID),

which extended the application to classification. Despite the novelty, AID and THAID did not

draw much interest within the statistics community. Later in 1984, Breiman et al. proposed the

Classification And Regression Trees (CART), which regenerated interests in this subject (Loh,

2014).

Decision tree (DT) classifies data in a dataset by flowing through a query flowchart-like tree

structure from the root through internal nodes (non-leaf nodes) until it reaches a leaf node

(terminal node), where each internal node denotes a test on an attribute, each branch represents

an outcome of the test, and each leaf holds a class label (Han, Kamber, & Pei, 2012). A typical

decision tree is shown in Figure 2. It predicts whether a citrus fruit is a lemon or an orange.

Figure 2 A decision tree to determine the species of a citrus fruit.

In general, decision trees are easy to fit, easy to use, and easy to interpret as a fixed sequence of

simple tests. They are non-linear, so they work much better than linear models for highly non-

linear functions. On the other hand, as the decision tree classifies by rectangular partitioning, it

does not handle nonnumeric data, and when dealing with data set with a large number of

features the size of the tree can be rather large and may suffer from over-fitting.

- 7 -

2.1.4 Support Vector Machine

The fundamental algorithm of support vector machine was initially presented by Boser, Guyon

and Vapnik as a training algorithm for optimal margin classifiers in 1992 (Boser, Guyon, &

Vapnik, 1992), and was later published as Support Vector Networks by Cortes and Vapnik in

1995 for binary classification (Cortes & Vapnik, 1995).

In a nutshell, a support vector machine (SVM) first uses the kernel trick, essentially a mapping

function, to transform the original training data (input space) into a high-dimensional feature

space as shown in Figure 3. Within this feature space, it searches for the linear optimal

separating hyperplane, a “decision boundary” separating the tuples of one class from another.

As in Figure 4, the SVM finds this hyperplane using support vectors (“essential” training tuples)

and margins (defined by the support vectors).

Figure 3 How kernel trick transforms a nonlinear classification (Jain, 2017).

- 8 -

Figure 4 Maximum-margin hyperplane and margins for an SVM trained with samples from two

classes (Meyer, 2001).

SVMs provide a good out-of-sample generalization, if the hyperparameters are appropriately

chosen. By introducing the kernel, SVMs gain the flexibility of including expert knowledge via

engineering the kernel. An SVM is defined by a convex optimality problem which has efficient

solution methods. But, as SVM is a non-parametric technique, one major disadvantage is the

lack of transparency of results (Auria & Moro, 2008). Also, the choice of the kernel and the

determination of the hyperparameter are relatively important to avoid over-fitting.

2.1.5 Artificial Neural Network

The computational model for neural network was created by McCulloch and Pitts in 1943

(McCulloch & Pitts, 1943). Through the past 70 years, many algorithms, techniques and

hardware were designed to improve the simulation and accelerate the training of neural

networks, such as back propagation (Werbos, 1974), max-pooling (Weng, Ahuja, & Huang,

1992) and GPU implementations (Steinkraus, Simard, & Buck, 2005).

- 9 -

Figure 5 shows examples of (a) a neuron, and (b) a neural network.

An artificial neural network (ANN) is a system based on the operational paradigm of biological

neural networks. Generally, an artificial neural network is a system of “neurons”, each of which

represents a transfer function, and the commonly used transfer functions are the sigmoid and

logistic functions. This system has a structure that receives an input, processes the data, and

provides an output. Commonly, the input consists of a data array which can consist of any kind

of data that can be represented in an array. Once an input is presented to the neural network, and

a corresponding desired or target response is set at the output, an error is composed from the

difference between the desired response and the real system output. The error information is fed

back to the system, which makes adjustments to all of its parameters in a systematic fashion

(commonly known as the learning rule). This process is repeated until the desired output is

acceptable (Priddy & Keller, 2005). This model is widely adopted to estimate patterns from a

large set of inputs with a large portion of unknowns, such as face identification, object

recognition, etc.

2.1.6 K-Nearest Neighbors Algorithm

K-nearest neighbors (k-NN) algorithm is a nonparametric method used for classification and

regression. This algorithm is created primarily based on the nearest neighbor decision rule,

which was first formally introduced by such name by Cover and Hart in 1967 (Cover & Hart,

1967). Before that similar rules were mentioned by Nilsson (1965) as “minimum distance

classifier” (Nilsson, 1965) and by Sebestyen (1962) as “proximity algorithm” (Sebestyen, 1962).

The first formulation of nearest neighbor type and analysis of its properties appears to be made

- 10 -

by Fix and Hodges in their very early discussion on non-parametric discrimination in 1951 (Fix

& Hodges, 1951).

Figure 6 An Example of k-NN Classification

In k-NN classification, the output is a class membership. An object is classified by a majority

vote of its neighbors, with the object being assigned to the class most common among its k

nearest neighbors. Take the green square in Figure 6 as a example, when k = 5, it would be

classified as a blue triangle, however when k = 10, it would be classified as a red star.

The k-NN algorithm the most basic of all Instance-Based Learning (IBL) methods, where the

function is only approximated locally and all computation is deferred until classification.

Though the k-NN algorithm is among the simplest of all machine learning algorithms, its

computational complexity makes it relatively expensive (in terms of both memory and time) to

work on large dataset (Mohri, Rostamizadeh, & Talwalkar, 2012).

2.2 Ensemble Learning

Ensemble methods are learning algorithms that construct a set of classifiers and then classify

new data points by taking a weighted vote of their predictions. The aim is to improve the

predictive performance of a given statistical learning or model fitting technique. The research of

ensembles was initiated at the end of 1980s (Zhou, Ensemble Learning, 2009).

Statistically, by constructing an ensemble out of accurate classifiers, the algorithm can

“average” their votes and reduce the risk of choosing the wrong classifier. Computationally, an

ensemble constructed by running from different starting points provides a better approximation

- 11 -

to the true unknown function than any of the individual classifiers. Representationally, as the

true function for the variables may not be represented by any of the classifiers, by performing

the ensemble method, it is possible to expand the space of representable function (Dietterich,

2000).

Thus, the ensemble method is considered as the suitable technique to obtain a better predictive

performance.

The variance-bias decomposition is an important general tool for analyzing the performance of

learning algorithms, where the bias measures the error from erroneous assumptions in the

learning algorithm, and the variance measures the error from sensitivity to small variations in

the training set. Ensemble methods aim to reduce the generalization error in learning algorithms

focusing on these two aspects.

2.2.1 Bagging

Bagging (or Bootstrap aggregating) was proposed by Leo Breiman in 1994 to improve the

classification by combining classifications of randomly generated training sets.

Given a set, D, of d tuples, bagging works as follows: repeatedly draw n samples with

replacement from D; for each set of samples, estimate a statistic; the bootstrap estimate is the

mean (or the majority vote) of the individual estimates.

The bagged classifier often has significantly greater accuracy than a single classifier derived

from D. The increased accuracy occurs because the composite model reduces the variance of the

individual classifiers without affecting bias, which means it reduces the sensitivity to individual

data points. For prediction, it was theoretically proven that a bagged predictor will always have

improved accuracy over a single predictor derived from D (Breiman, 1996).

2.2.2 Random Forest

Random forests (RF) are the ensemble of decision trees and the general method was first

proposed by Ho in 1995 (Ho, 1995). The algorithm follows similar steps as bagging: divide

training examples into multiple training sets (bagging), then train a decision tree on each set

- 12 -

(can randomly select subset of variables to consider), finally aggregate the predictions of each

tree to make classification decision.

Figure 7 The General Structure of a Random Forest (He, Chaney, Sheffield, & Schleiss, 2016)

As each decision tree can be trained in parallel, random forests are fairly efficient on large data

sets, and they can handle a large number of variables without variable deletion. In addition, they

give estimation of which variables are more important in classification. For categorical variables

with different number of levels, random forests are biased in favour of those attributes with

more levels. Therefore, the variable importance scores from random forests are not reliable for

this type of data.

2.2.3 Boosting

Boosting is a type of algorithm that converts the weaker learners to stronger ones. In 1989,

Kearns and Valiant raised an open question: what’s the relationship between weakly learnable

and strongly learnable problem (Kearns & Valiant, 1989). In 1990, Schapire proved that strong

and weak learnability are equivalent notions (Schapire, 1990). In other words, a weak learning

algorithm that works just slightly better than random guess can be “boosted” into an arbitrarily

accurate strong learning algorithm.

In boosting, weak classifiers are trained sequentially (Zhou, Ensemble Methods: Foundations

and Algorithms, 2012). Every time a weak classifier is trained, it is given knowledge of the

- 13 -

performance of previously trained classifiers: misclassified examples gain weight and correctly

classified examples loss weight. Therefore, future classifiers will focus more on the mistakes

from the previous learners. And the final classifier is a weighted sum of component weak

classifiers. Figure 8 shows the comparison between a single learner, bagging and boosting.

Figure 8 Bagging vs Boosting (ETS Asset Management Factory, 2016)

For simple models, an average of models has much greater capacity than a single model.

Boosting can reduce bias substantially by increasing capacity, and control variance by fitting

one component at a time.

There are many different approaches for boosting, some widely known ones are: AdaBoost,

LogitBoost and Gradient Boosting.

2.2.4 Gradient Boosting

Gradient boosting was generalized from adaptive boosting (AdaBoost) by Friedman in 1999

(Friedman, 2001). Gradient boosting transforms a set of weak learners to a strong learner with

the help of gradient descent optimization. At each stage a weak learner is fitted to the remaining

errors (also known as pseudo-residuals) of current strong learner. Figure 9 shows the changes in

the residuals through a training process. At the first several iterations, the residuals are relatively

large and vary significantly between data groups. As the iteration increases to 18, the residuals

decrease to around zero and are around the same size. The residuals keep decreasing and

become stable around zero as the iteration comes to 50. Then, the contribution of the weak

- 14 -

learner to the strong one is computed using gradient descent to minimize the overall error of the

strong learner. The well known AdaBoost is a special case of gradient boosting where the

sample distribution is modified to emphasize the hard cases and the contribution of the weak

learners is determined by their performance. One of the most used models is the gradient

boosting decision trees (GBDT), which can handle mixture of features and does not require

feature scaling.

Figure 9 Residual Fitting Process in Gradient Boosted Decision Trees (Grover, 2017)

2.2.5 Ensemble Selection

Ensemble selection is a method to construct ensembles from a library of various models

proposed by Caruana, Niculescu, Crew and Ksikes in 2004. It selects models added to the

ensemble in a forward stepwise manner to maximize the performance.

Selection starts with a library of desirable models and an empty ensemble. For each iteration, a

model in the library that maximizes the ensemble’s performance to the error metric on the

validation set is added to the ensemble. The selection is repeated until it reaches a fixed number

of iterations or all models have been used (Caruana, Niculescu, Crew, & Ksikes, 2004).

Compared to bagging and boosting, where the weight of each model needs to be determined

manually, ensemble selection automatically weights the selected models based on their ability to

improve the performance of the ensemble determined by the error matric.

- 15 -

Chapter 3

Data

The Orange dataset was provided by a French multinational telecommunication corporation

Orange S.A. for the 2009 Knowledge Discovery in Data Competition (KDD Cup 2009). To

protect the privacy of the customers whose records were used, the data were anonymized by

replacing actual text or labels by meaningless codes and not revealing the meaning of the

variables.

The dataset used throughout the thesis is the training set of the competition, which contains

50,000 instances including 15,000 variables (a 50,000×15,000 matrix), the first 14,740 of which

are numerical and the last 260 are categorical, and three binary target variables corresponding to

churn, appetency and up-selling.

Figure 10 Frequency Distribution of Variable 2382

All 15,000 variables of the dataset are analyzed through frequency distributions to determine the

characteristics among the variables. Most numerical variables have skewed distributions as

shown in Figure 10 and 11. In addition, many numerical variables have a common factor, which

is an indication that these variables are artificially encoded. Take variable 2382 as an example,

0%

5%

10%

15%

20%

25%

30%

35%

Mis

sin

g 6

15

24

33

42

51

60

69

78

93

10

2

12

9

14

1

17

4

20

4

24

3

27

0

28

5

32

1

35

4

40

2

56

1

72

6

89

4

Freq

uen

cy D

istr

ibu

tio

n

- 16 -

all the values are multiples of three, as shown in Figure 10. Similarly, the values for variable

1933 increment by two.

Figure 11 Frequency Distribution of Variable 1933

Another observation is that many variables only have a few discrete values. For example, about

50% of all numerical variables have less than 3 discrete values, approximately 80% of all

categorical variables have fewer than 10 categories and 12% of numerical variables and 28% of

categorical variables are constant. Furthermore, numerical values are heavily populated by 0s. It

was discovered that 80% of the numerical variables have more than 98% 0s. These results

suggest that a large number of variables can be removed since they are constant or close to

constant.

Table 1 Frequency Distribution of Target Variables

Churn Appetency Up-selling Frequency Percentage

-1 -1 -1 41756 83.51%

-1 -1 1 3682 7.36%

-1 1 -1 890 1.78%

1 -1 -1 3672 7.34%

The frequency distributions of the three target variables are shown in Table 1. It is obvious all of

the three targets are all highly imbalanced with only 1-7% positive cases. And there is no

overlap between any pair of labels, all the instances within the dataset have at most one positive

target variable.

0%

5%

10%

15%

20%

25%

30%

35%

0 2 4 6 8 10 12 14 16 18 20

Freq

uen

cy D

istr

ibu

tio

n

- 17 -

Chapter 4

Technology

This chapter discusses the programming language and platform used in the thesis and justifies

the advantages of choosing them. This chapter also includes the packages, libraries and APIs

that are used in the thesis.

4.1 Python

Python is a very popular open source programming language, created by Guido van Rossum and

first released in 1991. While many languages are used in data science, for instance, C++, Java,

R, and MATLAB, Python is dominant; codeeval.com rated Python “the most popular language”

for the fifth year in a row (CodeEval, 2016). Python, an interpreted language, has a design

philosophy which emphasizes code readability, and a syntax which allows programmers to

express concepts in fewer lines of code than possible in languages such as C++ or Java

(Summerfield, 2007).

Furthermore, Python has packages for almost any conceivable math function for machine

learning. And most of the popular machine learning libraries are either written in Python (scikit-

learn, TensorFlow) or have Python bindings (Caffe, OpenCV). Python 2.7 is used for all

programming in this thesis.

4.2 Libraries

4.2.1 Scikit-learn

Scikit-learn is a free machine learning library for the Python programming language initiated by

David Cournapeau in 2007. It features various classifications, regressions and clustering

algorithms including Naïve Bayes, support vector machines, decision trees, ensemble methods,

clustering etc. It is designed to interoperate with the Python numerical and scientific libraries

- 18 -

NumPy and SciPy. This package focuses on bringing machine learning to non-specialists using

a general-purpose high-level language (Pedregosa, et al., 2011).

4.2.2 TensorFlow

In November 2015, Google released TensorFlow, an open source deep learning software library

for defining, training and deploying machine learning models. It provides support for both the

research and the engineering sides in Google, as it can advance the state of the art on existing

problems and bring understanding to new problems as well as take the insight from the research

community to enable innovative products and product features (Google, 2015). Aside from

supporting the internal product in Google, TensorFlow provides a platform for collaboration and

communication among researchers. There are numerous libraries and source projects built on

top of TensorFlow and allow clearer understanding and more accessible applications of deep

learning.

4.2.3 TFlearn

TFlearn is a modular and transparent deep learning library built on top of TensorFlow. It was

designed to provide a higher-level API to TensorFlow in order to facilitate and speed-up

experimentations, while remaining fully transparent and compatible with it.

Compared to TensorFlow, TFlearn allows fast prototyping through highly modular built-in

neural network layers, regularizers, optimizers, and metrics. The high-level API currently

supports most of recent deep learning models, such as convolutional neural network (CNN),

bidirectional recurrent neural networks (BRNN), batch normalization, PReLU, residual

networks, generative adversarial networks (GANs), and etc.

4.2.4 XGBoost

XGBoost is an open-source software library which provides the gradient boosting framework

for C++, Java, Python, R, and Julia. XGBoost initially started as a research project by Tianqi

Chen as part of the Distributed (Deep) Machine Learning Community (DMLC) group and was

first released on March 27, 2014. XGBoost is designed to provide an efficient, flexible and

portable gradient boosting library that works on major distributed environment (Hadoop, SGE,

- 19 -

MPI) and solves data science problems in a fast and accurate manner (Distributed (Deep)

Machine Learning Community, 2015).

4.3 Cloud Computing

Cloud computing was defined by National Institute of Standards and Technology in 2011 as “a

model for enabling convenient, on-demand network access to a shared pool of configurable

computing resources (e.g., networks, servers, storage, applications, and services) that can be

rapidly provisioned and released with minimal management effort or service provider

interaction” (Mell & Grance, 2011).

Cloud computing in general provides a more cost-efficient system for data centralization with

better software integration and various access options. The main reason the thesis is run on

cloud is that the models can run on the cloud without the interferences from other tasks on the

computer, which would most likely guarantee memory allocation and increase the computation

speed.

The cloud used in the thesis is the IBM Data Scientist Workbench, a virtual lab environment for

people to practice data science and cognitive computing. The free access was provided through

the registration of Cognitive Class (https://cognitiveclass.ai), formerly known as the Big Data

University initiated by IBM in 2010. The workbench provides elastic compute environments

with the best possible capacity of 16 vCPU and 64 GB RAM.

- 20 -

Chapter 5

Methodology

This chapter explains the steps employed to achieve the objectives of the thesis. It is organized

as follows: section 5.1 introduces how the models are evaluated, section 5.2 explains the models

used in the thesis and how they are constructed, section 5.3 further explains how optional target

audiences are selected.

5.1 Evaluation Score: Area Under the Curve (AUC)

One of the main objectives of the thesis is to make good predictions of the target variables. The

prediction of each target variable is thought of as a separate classification problem. The results

of classification, obtained by thresholding the prediction score, may be represented in a

confusion matrix, where tp (true positive), fn (false negative), tn (true negative) and fp (false

positive) represent the number of examples falling into each possible outcome, as shown in

Table 2.

Table 2 Confusion matrix of a binary classification

Predictions

Class +1 Class –1

Truth Class +1 tp fn

Class –1 fp tn

The results will be evaluated with the Area Under Curve, which corresponds to the area under

the curve obtained by plotting sensitivity against specificity by varying a threshold on the

prediction values to determine the classification result. Another common curve used in machine

learning to determine the diagnostic ability of a binary classifier system is the receiver operating

characteristic (ROC) curve, created by plotting the true positive rate against the false positive

rate. The ROC curve evaluates a classifier based on its performance only on the predictions of

the true target, but this method is problematic especially when the data is highly skewed

(Swamidass, Azencott, Daily, & Baldi, 2012). The AUC used in the thesis represents the trade-

- 21 -

off between sensitivity and specificity, therefore more suitable for the model evaluation using

the Orange dataset.

The AUC is calculated using the trapezoid method. In the case when binary scores are supplied

for the classification instead of discriminant values, the curve is given by {(0,1), (tn/(tn+fp),

tp/(tp+fn)), (1,0)} and the AUC is just the Balanced ACcuracy BAC. The average accuracy

widely used may give misleading idea about generalization performance when a classifier is

tested on an imbalanced dataset. And such shortcoming can be overcome by replacing the

average accuracy by the balanced accuracy (Brodersen, Ong, Stephany, & Buhmann, 2010).

Figure 12 Area Under Curve and Balanced Accuracy for Sensitivity vs Specificity

5.2 Data Processing

5.2.1 Data Preprocessing

A significant portion of the variables are highly skewed, in another word, some variables

contain considerable amount of missing values. Many techniques are available to address the

missing value problem, such as deletion, mean/mode substitution, maximum likelihood

estimation, Bayesian estimation, multiple imputation, etc. (Enders, 2010). In this thesis, the

missing values are either substituted by mean or mode, depending on the type of variables, or

considered as a standalone entry ‘missing’.

- 22 -

After handling the missing values, the dataset was cleaned by removing the 1531 constant

variables and 5874 quasi-constant variables (where a single value occupies more than 99.98%

population).

For categorical data, although most of the features have less than 10 categories, there are about

5% of them having more than 100 categories, in such case, the categorical variables with more

than 100 distinct values were grouped into 20 categories.

5.2.2 Encoding Categorical Data

Some models in section 5.3 do not handle categorical data, such as Naïve Bayes Classifier and k

Nearest Neighbors. Thus, in addition to the binning process above, an encoding process of

converting categorical data into numerical data was performed for each of the categorical

variables. The encoding methods used are: ordinal/numeric encoding, one hot encoding and

binary encoding, which are visualized in Figures 13, 14 and 15 respectively.

Ordinary encoding is simply converting each value in the column to a unique number. This is

the simplest encoding method. However, in some categorical data, it is not necessary that the

category with larger number is more “important” or “heavier-weighted” than the category with a

smaller number, and this would most likely lead to misinterpretation by some of the algorithms.

This method was used when there are only a small number of categories for the advantage of

easy implementation.

Figure 13 A Numeric/Ordinal Encoding Example (Laurae, 2017)

- 23 -

A common alterative approach is one hot encoding, which converts each category value into a

new binary column by assigning 1 or 0 to the corresponding column. This avoids improper

weighting of categories but has the downside of including more columns in the dataset, thus

increasing the potential computational complexity for future analysis. This method is used when

the number of categories is relatively small.

Figure 14 A One Hot Encoding Example (Laurae, 2017)

The binary encoding first assigns a unique value to each category, then converts the value into

binary value. By using the power law of binary encoding, an N cardinalities (categories) feature

can be stored using ceil(log(N+1)/log(2)) features. This will significantly reduce the memory

increase from one hot encoding while ensuring the representation of each category. However,

the encoding process is the most complicated among the three methods mentioned and takes the

longest to encode. This method is used for data with large number of categories.

- 24 -

Figure 15 A Binary Encoding Example (Laurae, 2017)

5.2.3 Feature Selection

There was no feature selection before getting the results from the classification models. As some

of the models used later on are based on the construction of decision trees, where the trees are

split based on the significances of attributes determined by information gain or relative entropy,

the significances of the attributes were used later on as a form of feature selection to determine

the performance of different models when the number of variables changes.

5.3 Classification Methods

5.3.1 Naïve Bayes Classification

Naive Bayes Classifier is a classification technique based on Bayes’ Theorem with an

assumption of strong (naive) independence between the features. In simple terms, a Naïve Bayes

classifier assumes that the presence of a particular feature in a class is unrelated to the presence

of any other feature. Gaussian Naive Bays, basically assumes that features follow a normal

distribution.

The advantage is that Naive Bayes Classifier is easy to implement and fast to predict. It also

performs well in multiclass classification. One major limitation of Naive Bayes is the

assumption of independent features. In real life, it is almost impossible to get a dataset whose

fields are completely independent. Thus, Gaussian Naive Bayes classifier is chosen as the

benchmark for the later model.

- 25 -

The Naïve Bayes classification is constructed using function GaussianNB under the naive_bayes

class in scikit-learn library.

5.3.2 Random Forest

Random forest is an ensemble learning method, that operates by constructing a multitude of

decision trees at training time and outputting the class that is the mode of the classes of the

individual trees. Each tree is trained independently, using a random sample of the data. This

randomness helps to make the model more robust than a single decision tree, and less likely to

overfit on the training data.

Random decision forests correct for decision trees' habit of overfitting to their training set, as

well as decreasing test error by lowering prediction variance. It also naturally handles missing

and categorical values. Similar to most ensemble methods, random forest is a black box

algorithm, so it is not easy to visually interpret. Another main limitation is that a large number

of trees may make the algorithm slow for real-time prediction. Random forest is chosen as it is

an ensemble method with a reasonable good accuracy.

Random forest classifier is constructed using the function RandomForestClassifier under the

Ensemble Methods in scikit-learn (Scikit-learn, 2017). The model is tuned using grid search to

determine the optimal number of trees in the forest, the number of features to consider when

looking for the best split, the maximum depth of the trees, etc.

5.3.3 Gradient Boosting Decision Tree

Gradient boosting is a machine learning technique, which produces a prediction model in the

form of an ensemble of weak prediction models, typically decision trees. Tianqi Chen, the

creator of XGBoost, explained the detailed mathematical formulations and algorithm

construction in his 2016 KDD Cup presentation (Chen & Guestrin, 2016). In general, gradient

boosting decision tree builds the model in a stage-wise fashion like other boosting methods do,

and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Gradient boosting decision tree (GBDT) builds trees one at a time, where each new tree helps to

correct errors made by previously trained tree. With each tree added, the model becomes even

more expressive.

- 26 -

Gradient boosting decision trees has similar advantages and disadvantages with random forest,

as they are both ensemble methods based on decision trees. Gradient boosting decision trees

usually have higher accuracy, do not require feature scaling and can handle a mixture of

features. However, the training times are usually longer as they require significant computation,

the final model can be hard to understand, and it is prone to overfitting and requires careful

tuning. Gradient Boosting Decision Tree is selected as generally it is a better learner than

Random Forest. Gradient boosting decision tree classifier is trained using the XGBoost library.


Ensemble selection is essentially a bagging process with the selection of models only improving

the results, other than a simple average of outcomes from a set of predetermined model types

(Caruana, Niculescu, Crew, & Ksikes, 2004). Ensemble selection can include other ensemble

methods as one of the base models in its library and further improve the result through the

selection process. On the other hand, this means the training of this model will take considerably

more time than all the base models as each model included in the model library is trained

separately and each would require specific tuning to achieve the best result.

5.3.4.1 Pseudocode

The pseudocode is created based on the understanding of Caruana et al.’s papers in the

construction and optimizing ensemble selection (Caruana, Niculescu, Crew, & Ksikes, 2004)

(Caruana, Munson, & Niculescu-Mizil, Getting the Most Out of Ensemble Selection, 2006).

Input: A library of classification models Ω, a training dataset T, a validation dataset V,

maximum iteration number n, predetermined performance matrices

Output: An ensemble E of models from the library Ω

Procedure:

1. Initiate with an empty ensemble E

2. Train all the models in Ω based on the training set T

3. for i ← 1 to a predefined iteration number n do

4. Perform predictions on validation set V with all the models

- 27 -

5. Evaluate the performance of adding each of the model (Ωi) into the ensemble on the

performance matrices

6. Add the model that maximizes the performance to the ensemble

7. end

8. output updated ensemble E

5.3.4.2 Model Library

Naïve Bayes: The same Naïve Bayes classifier used as benchmark was included in the library.

Logistic regression is included as a simple machine learning model under the discriminative

class. Ng and Jordan compared logistic regressions and Naïve Bayes classifiers and concluded

that when the training size reaches infinity the discriminative model performs better than the

generative model. On the other hand, the generative model reaches its asymptote faster than the

discriminative model (Ng & Jordan, 2001). Two types (L1 and L2) of regularized logistic

regression were included using the function LogisticRegression under the Generalized Linear

Models in sklearn.

Support vector machine and L2-regularized logistic regression are closely related. It is possible

to derive SVM by asking a logistic regression to make the right decisions. They differ in that

logistic regression seeks to maximize the likelihood while support vector machines seek to

minimize the constraint violations (Schmidt, 2009). When using the kernel trick, SVMs has

better scalability and can produce sparse solutions. Functions LinearSVC, NuSVC and SVC

under the Support Vector Machines in sklearn were used to trained different types of SVMs:

linear SVMs and kernel SVMs with different kernel types.

K Nearest Neighbors algorithm is an instance-based learner and does not produce a model. It is

generally the easiest to understand but is computationally intensive to classify large data sets.

Function KNeighborsClassifier under the Nearest Neighbors in sklearn was used to implement

several different types of kNN models, including different weight functions (uniform and

distance) and different algorithms to compute the nearest neighbor.

Traditionally, clusterings are regarded as unsupervised learning opposed to the supervised

learning. The goal of clustering is to determine the internal grouping in a set of data. It is still

- 28 -

doable to solve supervised classified problem using clustering (Eick, Zeidat, & Zhao, 2004).

Several clustering functions under the Clustering models in sklearn were used, including k-

means clustering, co-clustering, bi-clustering and etc.

Random Forest Classifier: The same Random Forest Classifier in section 5.3.2 was included in

the library.

Adaboost is a special case under the gradient boosting models, where the boosting is performed

through the weight change of problematic cases. The model was programmed using function

AdaBoostClassifier under the Ensemble Methods in sklearn.

Gradient Boosting Decision Tree: The same Gradient Boosting Decision Tree in section 5.3.2

was included in the library.

Deep Neural Network is really popular in the last several years, especially for its outstanding

power in computer vision and natural language processing. Even so, Dreiseitl and Ohno-

Machado’s review showed that for general data classification scenario NN and logistic

regression perform on about the same level more often than not, with the more flexible neural

networks generally outperforming logistic regression in the remaining cases (Dreiseitl & Ohno-

Machado, 2002). Raschka also suggested the strengths of neural networks are in image

classification, natural language processing, and speech recognition (Raschka, 2016). Neural

network models were built using the TFLearn libraries, models with different structures are

included in the libraries.

5.4 Parameter Optimization

The same kind of machine learning model can require different constraints, weights or learning

rates to be applied on different dataset. These measures are called hyperparameters and have to

be tuned so that the model can optimally solve the machine learning problem. The idea of

parameter optimization or tuning aims to find a tuple of hyperparameters that optimizes the

performance of the model, which is usually evaluated through loss functions. Hyperparameters

can significantly impact performance and tuning the parameters for the models is as important

as choosing the right models.

- 29 -

In the thesis, all models mentioned above in 5.3 are carefully tuned through the method of grid

search, where the optimal hyperparameters are determined through an exhaustive search and

evaluated with the help of cross-validation.

5.5 Feature Importance and Model Performance

After the training of all models and obtaining the result for the first round, further analysis of

feature selection based on feature importance determined the GBDT was applied to the GBDT

and random forest models for the three target variables. This analysis aims to determine how the

prediction changes as the number of features increases and if the prediction accuracy

yields/converges when the number reaches certain range.

- 30 -

Chapter 6

Results and Analysis

This chapter is organized according to the models introduced in section 4.2, provides

discussions regarding the optimal target audience based on the results from all models, and

further discover how feature selection (feature importance) can help with the prediction process.

6.1 Classification Results

The classification results are determined from the average of a 10-fold cross-validation. Each

target's variables are trained separately in this section. And the following results were achieved

using models discussed in section 5.3.

6.1.1 Naïve Bayes Classification

For the Naïve Bayes Classifier, the following confusion matrixes shown in Table 3 to 5 were

obtained for each of the target variables.

Table 3 Confusion Matrix for Churn from Naïve Bayes Classifier

Naïve Bayes Churn Predictions

Class +1 Class –1

Truth Class +1 157.4 209.8

Class –1 304.8 4328

For churn the true positive rate is 34.05%, true negative rate is 95.38%, and the corresponding

AUC is 0.6472.

Table 4 Confusion Matrix for Appetency from Naïve Bayes Classifier

Naïve Bayes Appetency Predictions

Class +1 Class –1

Truth Class +1 40.2 48.8

Class –1 92.6 4818.4

For appetency the true positive rate is 30.27%, true negative rate is 98.99%, and the

corresponding AUC is 0.6463.

- 31 -

Table 5 Confusion Matrix for Upselling from Naïve Bayes Classifier

Naïve Bayes Upselling Predictions

Class +1 Class –1

Truth Class +1 117.7 121.2

Class –1 250.5 4510.6

For upselling the true positive rate is 49.27%, true negative rate is 94.74%, and the


As mentioned in section 5.3.1, the Naïve Bayes Classifier was considered as the benchmark for

all the following ensemble methods, and the results achieved by applying the Naïve Bayes

method matches the result provided by the Orange company, thus the data preprocess is valid.

In addition, as the data is highly skewed, all of the target variables have much higher true

negative rates comparing to true positive rates. In the worst-case scenario for data as imbalanced

as provided, the guess would be to predict all cases as negative, which still gives us a really

good average accuracy (>90%) but completely neglect the purpose of the prediction. Similarly,

if the Naïve Bayes classifier is one of the simpler classifiers, it is more accurate in predicting the

category with a larger number of training data.

Furthermore, as the AUC is calculated using the balanced accuracy, the scores for all the models

will show equal influence from both true positive true negative cases, which are more

meaningful for finding the target customers the thesis is focusing on.

The running time for Naïve Bayes Classifier was about 20 min per fold.

6.1.2 Random Forest

For the Random Forest Classifier, the following confusion matrices shown in Table 6 to 8 were

obtained for each of the target variables.

Table 6 Confusion Matrix for Churn from Random Forest

Random Forest Churn Predictions

Class +1 Class –1

Truth Class +1 151.9 174.5

Class –1 215.3 4458.3

- 32 -


AUC is 0.7097.

Table 7 Confusion Matrix for Appetency from Random Forest

Random Forest Appetency Predictions

Class +1 Class –1


Class –1 32.1 4884.4



Table 8 Confusion Matrix for Upselling from Random Forest

Random Forest Upselling Predictions

Class +1 Class –1


Class –1 237.4 4584.5



The optimal hyperparameters for each of the targets are shown in Table 9.

Table 9 Optimal Hyperparameters for Targets

Hyperparameters Churn Appetency Upselling

bootstrap True True True

class_weight balanced balanced balanced

criterion gini gini gini

max_depth 50 75 60

max_features sqrt sqrt sqrt

min_samples_leaf 3 1 3

min_samples_split 10 5 10

n_estimators 500 500 600

n_jobs -1 -1 -1

Bootstrap=True means bootstrap samples are used when building trees.

criterion='gini' is one of the function to measure the quality of a split, another common function

is ‘entropy’.

- 33 -

n_job represent the number of jobs to run in parallel for both fit and predict, and when it is equal

to -1 the number of jobs is set to the number of cores.

n_estimators represents the number of trees in the forest, when the number is relatively small the

increase in number of trees will improve the performance gradually, but eventually, the result

will converge to an optimal value where the increase in number of trees in the forest does not

affect the result significantly.

max_depth represents the maximum depth of the tree. The deeper/bigger the tree is the more

features it considers when constructing the true. So generally, it is often more desirable to grow

large individual trees for high dimensional problem. However, growing large trees does not

always give the best performance, especially when dealing with noisy data (Lin & Jeon, 2006).

Hence the models all included upper bound for the tree depth.

min_samples_leaf represents the minimum number of samples required to be at a leaf node,

min_samples_split represents the minimum number of samples required to split an internal

node. These two values are highly related as they together determine the general structure of a

leaf. As analyzed in Chapter 3, appetency is relatively more imbalanced than churn and

upselling with about 70% less positive cases, and that makes it reasonable for appetency to have

a lower requirement to split an node or create a leaf.

max_features represents the number of features to consider when looking for the best split, and

the best result was √#𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. This will significantly reduce the time needed to train the model

especially when the size of the data set and the number of trees in the forest is large.

When class_weight is set to ‘balanced’, the model uses the values of target to automatically

adjust weights inversely proportional to class frequencies in the input data. As the dataset is

highly skewed, it would be wise to ‘balance’ the weight through this process.

Comparing to Naïve Bayes classifiers, there were obvious improvements in true positive rate in

all of the three targets, with the most significant raise in appetency. The increase in sensitivity

led to the increase in AUC score, as the increase in specificity is relatively small.

- 34 -

For both appetency and upselling the number of true positive cases increases, whereas the true

positive cases in churn decreases slightly. This is due to the fact that Naïve Bayes classifier has

the lowest true positive rate for churn, an excessive number of the samples was classified as +1

when they are actually -1. The decrease in positive predictions in churn indicates that the tree

structure is comparably better in classifying such cases.

The running time for Random Forest was approximately 4 hours per fold.

6.1.3 Gradient Boosting Decision Trees

For the Gradient Boosting Decision Tree Classifier, the following confusion matrixes shown in

Table 10 to 12 were obtained for each of the target variables.

Table 10 Confusion Matrix for Churn from Gradient Boosted Decision Trees

GBDT Churn Predictions

Class +1 Class –1

Truth Class +1 155.1 138.3

Class –1 212.1 4494.5


AUC is 0.7418.

Table 11 Confusion Matrix for Appetency from Gradient Boosted Decision Trees

GBDT Appetency Predictions

Class +1 Class –1


Class –1 32.4 4915.2



Table 12 Confusion Matrix for Upselling from Gradient Boosted Decision Trees

GBDT Upselling Predictions

Class +1 Class –1


Class –1 235.9 4605

- 35 -



The optimal hyperparameters for each of the target is shown in Table 13.

Table 13 Optimal Hyperparameters for Targets

Hyperparameters Churn Appetency Upselling

eta 0.1 0.05 0.1

gamma 2 1 2

max_depth 2 3 2

min_child_weight 5 3 5

num_round 100 150 120

scale_pos_weight 12 20 12

subsample 0.7 0.6 0.7

In XGBoost, the hyperparameter eta represents the learning rate of the boosting.

A node is split only when the resulting split gives a positive reduction in the loss function.

Gamma specifies the minimum loss reduction required to make a split, the larger the value is the

more conservative the algorithm will be.

max_depth means the same as in random forest. In gradient boosting decision tree, it is not

essential to have really deep tree, as the later trees will improve based on the result from

previous trees, and that’s why the tree depth is much smaller in gradient boosting decision tree

than that in random forest.

min_child_weight is basically min_samples_leaf in random forest. Appetency still has the

smallest value among the targets.

num_round stands for the number of iteration for boosting. As gradient boosted decision trees

are prone to overfit, the results are not guaranteed to be better as the number of rounds

increases.

scale_pos_weight controls the balance of positive and negative weights, which is useful for

unbalanced classes. XGBoost suggests to use sum (negative cases) / sum (positive cases) as the

- 36 -

weight, and the optimal weights for churn and upselling are approximate that value (~12.5), but

for appetency the optimal weight is much smaller (~55).

Subsample is the ratio of the training instance XGBoost randomly collected from the training

dataset to grow trees, and this value will help to prevent overfitting.

Comparing to random forest, there were improvements in churn and upselling and slightly

decrease in appetency. The number of positive predictions continue to shrink for both churn and

upselling with the number of true positive cases increasing, which match the rising trend in true

positive rates. For the appetency both positive predictions and true positive cases are relative

close to the results from random forest, which indicates the good prediction ability from both

ensemble methods.

The running time for Gradient Boosted Decision Trees was in the range of 4-5 hours per fold.


For the Ensemble Selection Classifier, the following confusion matrixes shown in Table 12 to

14 were obtained for each of the target variables.

Table 14 Confusion Matrix for Churn from Ensemble Selection

Ensemble Selection Churn Predictions

Class +1 Class –1

Truth Class +1 156.9 118.5

Class –1 210.3 4514.3


AUC is 0.76. The top 3 models included in the ensemble are: gradient boosting decision trees,

random forest, followed by L2-regularized logistic regression.

Table 15 Confusion Matrix for Appetency from Ensemble Selection

Ensemble Selection Appetency Predictions

Class +1 Class –1

Truth Class +1 60.3 20

Class –1 28.7 4891

- 37 -


corresponding AUC is 0.8725. The top 3 models included in the ensemble are: random forest,

gradient boosted decision trees together with logistic regression.

Table 16 Confusion Matrix for Upselling from Ensemble Selection

Ensemble Selection Upselling Predictions

Class +1 Class –1


Class –1 234.8 4610.3


corresponding AUC is 0.9064. The top three models in the ensemble are: gradient boosted

decision tree, random forest and logistic regression.

The idea of ensemble selection is to improve the performance of the ensemble through the

selection iterations. And as expected, the best results for each of the base model is also the most

heavily weighted model in the ensemble. The results for all of the three targets increases

comparing to both GBDT and random forest, with further shrinkage in positive predictions and

increase in true positive cases.

The running time for Ensemble Selection varied from 20-22 hours per fold.

6.1.5 Overall Comparison

Figure 16 shows the overall comparison of all the classification models used in the thesis. It is

evident, the ensemble methods (RF, GBDT and ensemble selection) achieved significant

improvement from the benchmark Naïve Bayes classifier. As introduced earlier, the purpose of

ensemble methods is to improve the performance of simple classifiers by decreasing the

variance (bagging) or bias (boosting). However, not only did the models increased the

predictions, it also very significantly increased the time needed to produce the prediction.

Random forest and the gradient boosting decision tree can be seen as ensembles of decision

trees, and comparing to a single classifier, the running time increased by 10 times. The ensemble

selection can be considered as an ensemble of ensemble methods as both random forest and

gradient boosting decision trees are included in the library as well as many other models, this

process further pushes the running time up by another 4 times. In spite of the fact that the

- 38 -

ensemble selection model provided us with the best result among all the models tested, the

question would be how to achieve a good balance between prediction outcomes versus time and

computational complexity. From a practical perspective, time and computational complexity

make the ensemble selection model much more expensive comparing to both random forest and

gradient boosting decision tree.

Figure 16 Overview of All the Classification Model

6.2 Feature selection

As mentioned in 5.2.2, there was no initial feature selection for all of the results showed in

section 6.1. Meanwhile, it is also apparent that the running time for ensemble methods is much

longer comparing to any of the base learner. In this section, feature selection based on feature

importance from XGBoost is implemented to discover how random forest and gradient boosting

decision tree performance as the number of features varies.

Using the final gradient boosting decision tree model in section 6.1.3, the top 20 variables based

on its feature importance for each of the target is listed in Table 16. It is apparent that these

variables are far more related with the results as compared to the ones that are not listed in the

0

5

10

15

20

25

0.60

0.70

0.80

0.90

1.00

NB RF GBDT ES

Ru

nn

ing

Tim

e(h

r)

AU

C S

core

Churn

Appetency

Upselling

Overall

RunningTime

- 39 -

table. Figure 17 and Figure 18 showed the change in the AUC score when using different

number of important features for random forest and gradient boosted decision tree respectively.

Table 17 Feature Importance of the Top 20 Variables for Each Target

Rank Churn Appetency Upselling

Variable Importance Variable Importance Variable Importance

1 Var8981 0.20131 Var9045 0.23783 Var9045 0.25517

2 Var14990 0.10253 Var8032 0.13564 Var14990 0.07866

3 Var10533 0.04649 Var14995 0.10791 Var8981 0.05324

4 Var14970 0.04602 Var14990 0.06068 Var12507 0.04962

5 Var5331 0.02358 Var5826 0.03720 Var6808 0.04653

6 Var14995 0.02190 Var8981 0.03233 Var1194 0.02581

7 Var14822 0.02111 Var10256 0.03030 Var14970 0.02157

8 Var9045 0.02004 Var12641 0.02721 Var14871 0.01331

9 Var2570 0.02001 Var14772 0.01718 Var1782 0.01149

10 Var14923 0.01883 Var14939 0.01689 Var10256 0.01052

11 Var14765 0.01194 Var14867 0.01623 Var5026 0.00963

12 Var14904 0.01138 Var14970 0.01423 Var8032 0.00913

13 Var5702 0.01133 Var11781 0.01145 Var14786 0.00807

14 Var11047 0.01121 Var14871 0.00891 Var7476 0.00622

15 Var14778 0.00972 Var14788 0.00860 Var11781 0.00591

16 Var14795 0.00903 Var13379 0.00812 Var14795 0.00574

17 Var990 0.00898 Var5216 0.00711 Var6255 0.00572

18 Var12580 0.00863 Var14795 0.00703 Var5216 0.00503

19 Var9075 0.00860 Var11315 0.00664 Var2591 0.00497

20 Var647 0.00847 Var12702 0.00627 Var12641 0.00465

For random forest, the performance improves as the number of features included increases and

eventually converges as the number reaches 300. For the gradient boosting decision tree, it

followed a similar trend and converged around 200 features. In both models, the performances

for the three targets fluctuated differently but ultimately converge was as expected. In addition,

as the number of features decreases, the running time for both of the models decreases

proportionally.

- 40 -

Figure 17 AUC Score for Random Forest with Different Number of Features

Figure 18 AUC Score for Gradient Boosting Decision Tree with Different Number of Features

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0 200 400 600 800 1000

AU

C S

core

Number of Features

Churn

Appetency

Upselling

Overall

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0 200 400 600 800 1000

AU

C S

core

Number of Features

Churn

Appetency

Upselling

Overall

- 41 -

6.3 Target Audience

The above results can potentially help with the marketing and sales department of the company

with the targeting, preparing and improvements of marketing campaigns. Figure 19 visualized

the relationship between prediction and truth. For marketing purposes, the positive predictions

can be treated as the actual target customers and the true positive rate represents the respond rate

from those customers. The objectives for the company would be to minimize the cost of the

campaign while maximize the number of response.

Figure 19 Relationship between Prediction and Truth and Connection with Marketing

As shown in Figure 20, there are obvious trends in all three targets. The ensemble methods are

able to narrow the range of positive predictions by 30%, while doubling the true positive rate.

For example, the prediction for upselling can be applied in marketing campaign to promote

upgrades and add-ons. With the help of ensemble learning methods, the target customer sizes

are between 150 and 180, comparing to over 240 when using Naïve Bayes methods and 5000

without machine learning analysis. Meanwhile, the respond rates are steady between 70% to

90%, comparing to around 50% when applying the benchmark classifier and 7.4% from the total

- 42 -

customer population. In other words, it is highly possible for ensemble methods to minimize the

campaign cost by decreasing the size of target customers while improving the feedback from

customers.

Figure 20 Positive Prediction and True Positive Rate for Each Classifier

Furthermore, with the identification of churn in existing customer, a survey can be made

towards the positive predicted individuals to understand what aspects of the service or product

might lead them to consider switch providers, and what amelioration they would like to see in

the company. At the same time, survey can also be made towards potential customers with

positive classifications to study what promotion they appreciate the most and their ranking of

companies in the same industry.

Similarly, with the prediction of appetency and upselling, studies can be made to study the

customers’ behaviours after advertisements: what level of upgrade did they choose, what kind of

new product did they purchase, what kind of promotion did the costumer respond, etc., to

support future marketing plans.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

50

100

150

200

250

300

350

400

450

500

NB RF GBDT ES

Tru

e P

osi

tive

Rat

e

Nu

mb

er o

f P

osi

tive

Pre

dic

tio

ns

[LHS]Churn [LHS]Appetency [LHS]Upselling

[RHS]TP - Churn [RHS]TP - Appetency [RHS]TP - Upselling

- 43 -

As shown in section 6.2, these targets are influenced by different features. Unfortunately, further

analysis based on a single variable would be hard to understand as all the values in the dataset

are encrypted. In practice, the company can presumably further analyze the top-ranked features

from ensemble methods, especially the demographic related ones. The understanding of these

features may create higher personalized marketing plans to improve the response from

customers and strengthen the relationships between customers and the company.

- 44 -

Chapter 7

Discussion and Conclusions

The dataset used in the thesis was obtained from the 2009 KDD Cup, during which IBM

Research was able to achieve the highest score. The winning entry from IBM Research

consisted of an ensemble selection model of various classifiers. Other winning entries all

included ensemble methods to some extent, especially random forest and boosting decision tree.

The models used in the thesis are selected based on the study and understanding of theses

ensemble methods and the results from the competition entries (Guyon, Lemaire, Boullé, Dror,

& Vogel, 2009).

Based on the results from Chapter 6, the presented ensemble models, random forests, gradient

boosting decision tree and ensemble selection, are able to outperform the benchmark obtained

from Naïve Bayes classifier, with ensemble selection algorithm producing the best outcomes out

of all of them. The overall score from ensemble selection algorithm (0.8492) also measures up

to the best score produced by IBM Research in the competition (0.8521). Minor deviations are

acceptable due to the differences in data preprocessing and library construction and the

stochasticity in ensemble methods.

For companies which plan on implementing machine learning for data analysis, ensemble

methods are most likely to guarantee better results than a single classifier.

On the whole, ensemble methods are constructed based on a weak learner to improve the

performance. A weak learner is defined to be a classifier that is only slightly correlated with the

true classification, or in simple terms, it can produce a result a little better than random

guessing. Ensemble methods, either bagging or boosting, increase the computation complexity

in both the space (memory) and time aspects comparing to a weak learner. For bagging, the

training process is relatively easier. The estimate is produced by the average or majority vote of

all the weak learners. The number of weaker learners trained is the main reason causing the

increase in complexity. For instance, it takes O(t) and m memory to train a single weak learner,

- 45 -

then it will take O(nt) to train n similar weak learners and n×m memory to store them for

training and predicting. For boosting, each weak learner is trained in sequence, and the later

learners in the ensemble are ‘boosted’ given the knowledge about the mistakes from previous

learners to minimize a predetermined loss function. The computational complexity mainly

depends on the number of boosting iterations and the complexity of boosting function (Zhou,

Ensemble Learning, 2009). From the Naïve Bayes classifier to random forest and gradient

boosting (ensemble of simple learners) to ensemble selection (ensemble on ensemble models),

the prediction result improves gradually. For models like ensemble selection, which is

essentially an ensemble of ensemble models, it is understandable that the running time grows

exponentially from the base learner.

When implementing the ensemble methods in real life, the trade-off would be between accuracy

and computational complexity. In general, as a model becomes more complex, the training

process will require a more powerful driver and extra memory for storage and computation, or it

may risk freezing the whole system and losing all the work to that time. Likewise, the more

complex the algorithm is the longer time it takes to construct, train and predict.

Figure 21 Trade-off between Accuracy and Time

Figure 21 illustrates one of the possible approaches to evaluate the ensemble methods. As all of

the models were performed under the same environment, the only trade-off is between accuracy

0.01

0.1

1

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

NB RF GBDT ES

Tim

e Ef

fici

ency

AU

C S

core

[LHS]Overall Score [RHS]Time Efficiency

- 46 -

(AUC score) and time. Time efficiency is calculated as the multiplicative inverse of running

time, so that the model with shorter running time (more time efficient) has higher value. The

optimal model is the one with balanced AUC score and time efficiency, in this case random

forest or gradient boosting decision tree. Although ensemble selection has the highest score, it is

really inefficient and may not be the best model for all problems.

For companies, the trade-off would be more relevant to the financial resources of the company,

the objectives of the analysis and the details of the actual dataset.

If the dataset is not as large, complex and noisy as the Orange dataset, the base learner would be

relatively easier to train. Under this circumstance, the number of learner needed for random

forest, the number of iteration for gradient boosting would decrease for the models to converge.

At the same time the weak learner is not likely to produce really bad results, which leaves a

small improvement space for ensemble methods. Hence, it is unlikely to have such a difference

between ensemble selection and random forest or gradient boosting decision tree in both

running time and prediction results. The most consuming aspect would be the additional

programming required for the model library in ensemble selection.

Also, if the data analysis department of the company is well financed, it is most likely to have

the computational power and human resources to produce the analysis using ensemble selection.

Similarly, if the primary purpose of the company is to predict the outcomes, that if the most

important objective is accuracy then the ensemble selection would be the best option. On the

other hand, if the company does not have strong computational power, or the analysis result is

more time sensitive, or the dataset has a relatively small number of features, general ensemble

methods like random forest and gradient boosting machines can still produce trustworthy result

with much less effort. For example, a multinational technology company like Apple Inc.

certainly has the computational strength as well as the financial support for the most

complicated models. This type of company will tend to choose the model with the highest

accuracy. On the contrary, if a company is a retail Startup, which is unlikely to have the

resources required to setup a complicated ensemble method. It is wise for these company to

choose a basic ensemble method.

- 47 -

Furthermore, the feature selection process can significantly reduce the running time, and a well

constructed selection can maintain the same level of prediction with much fewer features. The

features were selected based on the feature importance produced by the gradient boosting

decision tree. With feature selection, the models were able to produce comparable predictions

with only 1.5% – 2% features. This process also leads to an extreme decrease in running time

and memory. Feature selection is not only helpful for the fact that it simplifies the computational

process, but also for the potential to decrease future cost in information gathering and

processing, the company can ask for only 2% of the original 15,000 fields for future predictions.

Moreover, as the predictions from this dataset are most likely to be used for marketing and sales

purposes, the results from the ensemble method have the potential to adjust the commercials to a

more personalized level, reduce the marketing campaign size and cost and still obtain or

improve the number of targeting customers. At the same time, ensemble methods are also good

at identifying customers with the propensity to switch to another firm, which gives the company

the opportunity to keep customers or attract customers from other companies with targeting

advertisement or promotions and also to study the reason why they are considering changing

suppliers.

Ensemble methods are widely used in many industries. In the biomedical industry, ensemble

methods can be used for disease diagnosis and prediction, drug discovery and gene expression,

etc. For high energy physics, especially the predictions and analyses for the Large Hadron

Collider (LHC) are mostly done through machine learning.

To sum up, the ensemble methods tested in the thesis were able to produce better prediction at

the cost of additional time and computational power. The trade-off between accuracy, time and

money should be evaluated correspondingly with the details about the dataset, the wealth of the

company and the main objective of the analysis. Feature selection can typically reduce the

number of features required to make good predictions and thus reduce the computational

complexity of the ensemble models. The results from ensemble models can be applied to

marketing and planning purposes. In addition to the predictions, the models can also provide

insights about feature importance that can be used for personalized/targeted promotion and

advertising.

- 48 -

Chapter 8

Future Works

The ensemble methods were only tested on the Orange dataset. A more generalized result can be

obtained by applying these ensemble methods on different datasets, preferably datasets with

different sizes, structures and objectives. In addition to generalizing the trend discovered from

this work, the result can also be used to identify which ensemble method is most suitable for

which kind of dataset, or what specific objectives. Besides the models tested in the thesis, other

ensemble methods can also be tested under the same constraints and further discover the “best”

of the ensemble methods.

The Orange dataset used in the thesis is encrypted, so it is impossible to understand the actual

meaning of each field/feature. It would be helpful to apply the ensemble methods on non-

encoded datasets and disclose some insights through the mining process. With the help of

feature importance, it is possible to reveal what demographic information affects the prediction

most and further study how the most significant features vary from target to target, and why

they have the most influence on that target. These further studies can be measurably helpful for

management and marketing decisions, especially if the features obtained are gender, age or

demographic oriented. As an illustration, if an analysis for customers indicates that the females

in the age group of 18-24 living in the Yonge–Eglington neighbourhood are more accepting and

welcoming to new fashion items. A fast fashion retailer Y may consider to open a pop-up store

within that area with exclusive or limited-edition products. This type of research can lead to

further reduction of the target customers and result in a better designed marketing campaign.

Additionally, the current dataset does not contain chronological data. More often than not, the

realistic data will have a timestamp, and the time of the recording presumably will contain

certain information, such as seasonal trends and market directions, and this is where

reinforcement learning is beneficial. Reinforcement learning, also known as approximate

dynamic programming, is an area of machine learning that maximizes the cumulative rewards

through the actions taken under certain environments. In simple terms, reinforcement learning is

- 49 -

trained on numbers of actions with time steps and corresponding rewards and update the model

as the number increases, so the model is constantly updating as it predicting new data samples

and receive feedback. If a chronological dataset regarding customer information can be

obtained, ensemble learning and reinforcement learning can be combined to produce predictions

with trends and sequential actions that could result in the highest outcomes based on the

definition of the project. One beneficial chronological data to collect to supplement the Orange

dataset would be the response from customers together with the details about marketing plans.

Reinforcement learning can be trained using these feedbacks to predict what kind of promotion

or commercials has the most responses from different customers (highest rewards). Two

consumers both identified as likely to purchase a new product could be approached through

different media and receive different promotions based on their historical data.

- 50 -

References

Auria, L., & Moro, R. A. (2008). Support Vector Machines (SVM) as a Technique for Solvency Analysis

(DIW Berlin Discussion Paper No. 811). Berlin: DIW Berlin.

Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers.

Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152).

Pittsburgh: ACM.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.

Brodersen, K. H., Ong, C. S., Stephany, K. E., & Buhmann, J. M. (2010). The Balanced Accuracy and Its

Posterior Distribution. 2010 20th International Conference on Pattern Recognition (pp. 3121-

3124). Istanbul: Institute of Electrical and Electronics Engineers .

Caruana, R., Munson, A., & Niculescu-Mizil, A. (2006). Getting the Most Out of Ensemble Selection.

Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006) (pp. 828-

833). Hong Kong: IEEE.

Caruana, R., Niculescu, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. In

Proceedings of the twenty-first international conference on Machine learning (ICML '04) (pp. 18-

26). Banff: ACM.

Chen, T., & Guestrin, C. (2016, June 3). XGBoost A Scalable Tree Boosting System. KDD '16 Proceedings

of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

(pp. 785-794). San Francisco.

CodeEval. (2016, Feburary 2). Most Popular Coding Languages of 2016. Retrieved from CodeEval:

http://blog.codeeval.com/codeevalblog/2016/2/2/most-popular-coding-languages-of-2016

Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297.

Cover, T., & Hart, P. (1967, January). Nearest Neighbor Pattern Classification. IEEE Transactions on

Information Theory, 13(1), 21-27.

COX, D. R. (1958, January). The Regression Analysis of Binary Sequences: Discussion on the Paper.

Journal of the Royal Statistical Society. Series B (Methodological), 20(2), 232-242.

Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Proceedings of the First International

Workshop on Multiple Classifier Systems (pp. 1-15). London: Springer-Verlag.

Distributed (Deep) Machine Learning Community. (2015). Introduction to Boosted Trees. Retrieved from

XGBoost Documents: http://xgboost.readthedocs.io/en/latest/get_started/

- 51 -

Dreiseitl, S., & Ohno-Machado, L. (2002, October). Logistic regression and artificial neural network

classification models: a methodology review. Journal of Biomedical Informatics, 35(5-6), 352-

359.

Eick, C. F., Zeidat, N., & Zhao, Z. (2004). Supervised clustering - algorithms and benefits. 16th IEEE

International Conference on Tools with Artificial Intelligence (pp. 774-776). Boca Raton: IEEE

Computer Society.

Enders, C. K. (2010). Applied Missing Data Analysis. New York: Guilford Press.

ETS Asset Management Factory. (2016, April 20). What is the difference between Bagging and

Boosting? Retrieved from QuantDare: https://quantdare.com/what-is-the-difference-between-

bagging-and-boosting/

Fix, E., & Hodges, J. L. (1951). discriminatory Analysis. Nonparametric Discrimination: Consistency

Properties. Randolph Field: USAF School of Aviation Medicine.

Friedman, J. H. (2001, October). Greedy Function Approximation: A Gradient Boosting Machine. The

Annals of Statistics, 29(5), 1189-1232.

Good, I. J. (1965). The estimation of probabilities: An essay on modern Bayesian methods. Cambridge:

M.I.T. Press.

Google. (2015, November 9). TensorFlow: Open source machine learning. Retrieved June 11, 2017, from

YouTube: https://www.youtube.com/watch?v=oZikw5k_2FM

Grover, P. (2017, December 9). Gradient Boosting from scratch. Retrieved from Medium:

https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the KDD Cup 2009: Fast

Scoring on a Large Orange Customer Database. Proceedings of 2009 International Conference

on KDD Cup (pp. 1-22). Paris: JMLR.org.

Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques (3rd ed.). Amsterdam:

Elsevier/Morgan Kaufmann.

He, X., Chaney, N. W., Sheffield, J., & Schleiss, M. (2016, October). Spatial Downscaling of Precipitation

Using Adaptable Random Forests. Water Resources Research, 52(10), 8217–8237.

Hebb, D. O. (1949). The Organization of Behavior. New York: Wiley & Sons.

Ho, T. K. (1995). Random Decision Forests. Proceedings of the 3rd International Conference on

Document Analysis and Recognition (pp. 278-282). Montral: IEEE Computer Society.

Hohenegger, J., Bufardi, A., & Xirouchakis, P. (2008). Compatibility Knowledge in Fuzzy Front End. In A.

Bernard, & S. Tichkiewitch, Methods and tools for effective knowledge life-cycle-management

(pp. 243-258). Berlin: Springer.

- 52 -

Jain, R. (2017, February 21). Simple Tutorial on SVM and Parameter Tuning in Python and R. Retrieved

from Hackerearth blog: http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-

python-r

Kearns, M., & Valiant, L. (1989). Cryptographic Limitations on Learning Boolean Formulae and Finite

Automata. Proceedings of the 21st ACM Symposium on the Theory of Computing (pp. 434-444).

Seattle: ACM.

Laurae. (2017, April 23). Categorical Features and Encoding in Decision Trees. Retrieved from Medium:

https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-

53400fa65931

Lin, Y., & Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of the American

Statistical Association, 101(474), 578-590.

Loh, W. Y. (2014). Fifty Years of Classification and Regression Trees. International Statistical Review,

82(3), 329-348.

McCulloch, W., & Pitts, W. (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of

Mathematical Biophysics, 5(3), 115-133.

Mell, P., & Grance, T. (2011). The NIST definition of cloud computing. Gaithersburg, MD: Computer

Security Division, Information Technology Laboratory, National Institute of Standards and

Technology. Retrieved from http://purl.fdlp.gov/GPO/gpo17628

Meyer, D. (2001). Support Vector Machines The Interface to libsvm in Package e1071. R News, 1/3, 23-

26.

Mohammed, M., Khan, M. B., & Bashier, E. B. (2016). Machine Learning: Algorithms and Applications.

Boca Raton: CRC Press.

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012). Foundations of Machine Learning. Cambridge, MA:

MIT Press.

Ng, A. Y., & Jordan, M. I. (2001). On Discriminative vs. Generative Classifiers: A comparison of logistic

regression and Naive Bayes. Advances in Neural Information Processing Systems 14 (NIPS 2001)

(pp. 841-848). Vancouver: MIT Press.

Nilsson, N. J. (1965). Learning Machines : foundations of trainable pattern-classifying systems. New

York: McGraw-Hill.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Brucher, M. (2011).

Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

Priddy, K., & Keller, P. (2005). Artificial Neural Networks: An Introduction. Bellingham: SPIE Press.

- 53 -

Raschka, S. (2016, April). When Does Deep Learning Work Better Than SVMs or Random Forests?

Retrieved from KDnuggets: https://www.kdnuggets.com/2016/04/deep-learning-vs-svm-

random-forest.html

Rish, I. (2001). An empirical study of the naive Bayes classifier. Proceedings of IJCAI-2001 workshop on

Empirical Methods in AI (pp. 41-46). Seattle: IBM New York.

Russel, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd ed.). Upper Saddle River,

NJ: Prentice Hall.

Schapire, R. E. (1990). The Strength of Weak Learnability. Machine Learning, 5(2), 197-227.

Schmidt, M. (2009, March 29). A Note on Structural Extensions of SVMs. Retrieved from

http://www.cs.ubc.ca/~schmidtm/Documents/2009_Notes_StructuredSVMs.pdf

Scikit-learn. (2017). 1.11. Ensemble methods. Retrieved from Scikit-learn: http://scikit-

learn.org/stable/modules/ensemble.html#gradient-boosting

Scikit-learn. (2017). 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier. Retrieved from Scikit-learn:

http://scikit-

learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Sebestyen, G. S. (1962). Decision Making Processes in Pattern Recognition. New York: Macmillan.

Steinkraus, D., Simard, P., & Buck, I. (2005). Using GPUs for Machine Learning Algorithms. Eighth

International Conference on Document Analysis and Recognition (ICDAR'05). 2, pp. 1115-1120.

Seoul: IEEE Computer Society.

Summerfield, M. (2007). Rapid GUI Programming with Python and Qt: The Definitive Guide to PyQt

Programming. Upper Saddle River: Prentice Hall Press.

Swamidass, S., Azencott, C., Daily, K., & Baldi, P. (2012, April). A CROC stronger than ROC: measuring,

visualizing and optimizing early retrieval. Bioinformatics, 1348-1356.

Weng, J., Ahuja, N., & Huang, T. S. (1992). Cresceptron: A Self-organizing Neural Network. IJCNN

International Joint Conference on Neural Networks. 1, pp. 576-581. Baltimore: IEEE.

Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral

Sciences (Doctoral dissertation). Harvard University.

Wilde, S. (2011). Customer Knowledge Management: Improving Customer Relationship through

Knowledge Application. Berlin: Springer.

Zhou, Z.-H. (2009). Ensemble Learning. In S. Z. Li, & A. Jain (Eds.), Encyclopedia of Biometrics (pp. 270-

273). New York: Springer US.

Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. Boca Raton: Chapman & Hall/CRC.