using adaboost algorithm to enhance the prediction accuracy of decision trees

Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision TreesMOHAMMED ALHAMADI - PROJECT 2

Acknowledgement

This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon

School of Engineering, NYU.

Outline1. Adaboost Algorithm

2. The Dataset

3. Data Exploration and Visualization

4. Factorizing the Dependent Variable

5. Prediction Using a Single Decision Tree

6. Prediction Using Adaboost

7. Can we improve this?

8. Can we improve even further?

9. Comparison

10. References

Adaboost Algorithm• Adaboost is short for “Adaptive Boosting”

• Created by Yoav Freund and Robert Schapire in 1997

• Aaboost is a meta-algorithm (provides a sufficiently good solution to an optimization problem)

• Can be used with other learning algorithms to improve their performance• Makes the weak classifiers strong by combining them• Typically, decision trees (which we’ll use in this project)

Adaboost Algorithm (cont.)• Weak models are added sequentially after being trained on the training data

• The process continues until:• No further improvement can be made on the training data set• A threshold number of weak classifiers is created

• Predication are then made by taking the weighted average of the weak classifiers• Each weak classifier has a weight (its classification result is multiplied by its weight)

Adaboost Algorithm (cont.)• It is better to have these characteristics in the data before processing it with adaboost:

◦ Quality Data:◦ Ensemble method attempts to correct misclassifications in the training data ◦ Make sure that the training data is of high-quality

◦ Outliers:◦ Outliers will force the ensemble to work very hard in order to correct for cases that are unrealistic◦ Best to remove them from dataset

◦ Noisy Data: ◦ Noise in the output variable can be problematic for ensemble◦ Isolate and clean the noisy data from your training dataset

The Data Set• The data set that we will use in this project contains 4898 observations on white wine varieties and quality ranked by the wine tasters

• The data set contains 11 independent variables and 1 dependent variable• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,

free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)

Data Exploration and Visualizationwine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";")dim(wine_data)

[1] 4898 12

names(wine_data)

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"

Data Exploration and Visualization (cont.)

'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ...

str(wine_data)

Data Exploration and Visualization (cont.)

[1] 3.8 14.2range(wine_data$fixed.acidity)

[1] 0.08 1.10range(wine_data$volatile.acidity)

[1] 0.00 1.66range(wine_data$citric.acid)

[1] 0.6 65.8range(wine_data$residual.sugar)

[1] 0.009 0.346range(wine_data$chlorides)

[1] 2 289range(wine_data$free.sulfur.dioxide)

[1] 9 440range(wine_data$total.sulfur.dioxide)

[1] 0.98711 1.03898range(wine_data$density)

[1] 2.72 3.82range(wine_data$pH)

[1] 0.22 1.08range(wine_data$sulphates)

[1] 8.0 14.2range(wine_data$alcohol)

Data Exploration and Visualization (cont.)cor(wine_data)

Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality

Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11

Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19

Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01

Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1

Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21

FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01

TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17

Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31

pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1

Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05

Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44

Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1

Data Exploration and Visualization (cont.)hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)

hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)

Data Exploration and Visualization (cont.)hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")

Factorizing the Dependent Variablequality_fac <- ifelse(wine_data$quality >= 6, "high", "low")

wine_data <- data.frame(wine_data, quality_fac)

table(wine_data$quality_fac)

High Low

3258 1640

• We can now remove the old integer dependent variable (quality)

wine_data <- wine_data[,-12]

Prediction Using a Single Decision Tree• In order to compare the performance of Adaboost, we will first use a single decision tree

• Package (C50) in R

• So the misclassification error rate is almost 22% with a single tree

set.seed(797)training_size <- round(0.7 * dim(wine_data)[1])training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)testing_sample <- -training_sampletraining_data <- wine_data[training_sample,]testing_data <- wine_data[-training_sample,]testing <- quality_fac[testing_sample]

C50_model <- C5.0(quality_fac~., data=training_data)predict_C50 <- predict(C50_model, testing_data[,-12])mean(predict_C50 != testing)

[1] 0.2178353

Prediction Using Adaboost• We will use library (fastAdaboost)

• Initially, we will choose to use 15 classifiers and combine them

• So, we can see a quick improvement when using Adaboost with 15 classifiers• We improved from 22% error rate to 18%

library(fastAdaboost)adaboost_model <- adaboost(quality_fac~., training_data, 15)pred <- predict(adaboost_model, newdata=testing_data)error <- pred$errorprint(paste("Adaboost Error:", error))

"Adaboost Error: 0.184479237576583"

Can we improve this?• Choosing 15 classifiers in the previous step was a random choice

• We can try many different number of classifiers and check performance• Try a range of classifiers from 1 to 50

trials <- data.frame(Trees=numeric(0), Error=numeric(0))

for (num_of_trees in 1:50){ pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data) error <- pred$error print(paste("Adaboost Error at", num_of_trees, ":", error)) trials[num_of_trees, ] <- c(num_of_trees, error)}

Can we improve this? (cont.)[1] "Adaboost Error at 1 : 0.243703199455412"[1] "Adaboost Error at 2 : 0.277739959155888"[1] "Adaboost Error at 3 : 0.219877467665078"[1] "Adaboost Error at 4 : 0.228727025187202"[1] "Adaboost Error at 5 : 0.208304969366916"[1] "Adaboost Error at 6 : 0.222600408441116"[1] "Adaboost Error at 7 : 0.194009530292716"[1] "Adaboost Error at 8 : 0.200816882232811"[1] "Adaboost Error at 9 : 0.191967324710688"[1] "Adaboost Error at 10 : 0.195371000680735"[1] "Adaboost Error at 11 : 0.193328795098707"[1] "Adaboost Error at 12 : 0.198093941456773"[1] "Adaboost Error at 13 : 0.185840707964602"[1] "Adaboost Error at 14 : 0.183117767188564"[1] "Adaboost Error at 15 : 0.184479237576583"[1] "Adaboost Error at 16 : 0.189244383934649"[1] "Adaboost Error at 17 : 0.186521443158611"[1] "Adaboost Error at 18 : 0.189925119128659"[1] "Adaboost Error at 19 : 0.185159972770592"[1] "Adaboost Error at 20 : 0.184479237576583"[1] "Adaboost Error at 21 : 0.182437031994554"[1] "Adaboost Error at 22 : 0.187202178352621"[1] "Adaboost Error at 23 : 0.181075561606535"[1] "Adaboost Error at 24 : 0.184479237576583"[1] "Adaboost Error at 25 : 0.174948944860449"

[1] "Adaboost Error at 26 : 0.183798502382573"[1] "Adaboost Error at 27 : 0.181075561606535"[1] "Adaboost Error at 28 : 0.176310415248468"[1] "Adaboost Error at 29 : 0.179714091218516"[1] "Adaboost Error at 30 : 0.180394826412526"[1] "Adaboost Error at 31 : 0.179714091218516"[1] "Adaboost Error at 32 : 0.177671885636487"[1] "Adaboost Error at 33 : 0.171545268890402"[1] "Adaboost Error at 34 : 0.171545268890402"[1] "Adaboost Error at 35 : 0.170864533696392"[1] "Adaboost Error at 36 : 0.172906739278421"[1] "Adaboost Error at 37 : 0.170864533696392"[1] "Adaboost Error at 38 : 0.175629680054459"[1] "Adaboost Error at 39 : 0.171545268890402"[1] "Adaboost Error at 40 : 0.17358747447243"[1] "Adaboost Error at 41 : 0.17358747447243"[1] "Adaboost Error at 42 : 0.175629680054459"[1] "Adaboost Error at 43 : 0.17358747447243"[1] "Adaboost Error at 44 : 0.172906739278421"[1] "Adaboost Error at 45 : 0.174948944860449"[1] "Adaboost Error at 46 : 0.174948944860449"[1] "Adaboost Error at 47 : 0.174948944860449"[1] "Adaboost Error at 48 : 0.171545268890402"[1] "Adaboost Error at 49 : 0.172226004084411"[1] "Adaboost Error at 50 : 0.172226004084411"

print(min(paste("The minimum is", trials[,2])))The minimum is 0.1708645

We improved from 18% error rate when using 15 classifiers to 17% when using 35 classifiers

Can we improve this? (cont.)

We can notice that, generally, the error decreases as the number of classifiers increases.

qplot(trials2$Trees, trials2$Error, data=trials2, color=Trees, xlab="Number of Classifiers", ylab="Error", geom=c("point", "smooth"))

Can we improve even further?• One of the requirements for the data used with Adaboost is to not have outliers• We did not check for outliers

• We can check if the data contains outliers and remove them to improve performance

•… and we do this for all other variables in the dataset

wine_data2 <- wine_data

outlier <- boxplot.stats(wine_data2$fixed.acidity)$outwine_data2$fixed.acidity <- ifelse(wine_data2$fixed.acidity %in% outlier, NA, wine_data2$fixed.acidity)boxplot.stats(wine_data2$fixed.acidity)

wine_data2 <- wine_data2[complete.cases(wine_data2),]

outlier <- boxplot.stats(wine_data2$volatile.acidity)$outwine_data2$volatile.acidity <- ifelse(wine_data2$volatile.acidity %in% outlier, NA, wine_data2$volatile.acidity)boxplot.stats(wine_data2$volatile.acidity)

# … continue for all other variables

Can we improve even further? (cont.)• After removing all the NA values, we are left with 3971 data instances• We originally had 4898 data instances• We lost 927 data instances to outliers in all the variables• Is that worth it?

dim(wine_data)

dim(wine_data2)

[1] 4898 12

[1] 3971 12

Can we improve even further? (cont.)• We will check multiple number of classifiers again after removing outliers

set.seed(797)training_size <- round(0.7 * dim(wine_data2)[1])training_sample <- sample(dim(wine_data2)[1], training_size, replace=FALSE)testing_sample <- -training_sample

training_data <- wine_data2[training_sample,]testing_data <- wine_data2[-training_sample,]

testing <- quality_fac[testing_sample]

trials2 <- data.frame(Trees=numeric(0), Error=numeric(0))

for (num_of_trees in 1:50){ pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data) error <- pred$error print(paste("Adaboost Error at", num_of_trees, ":", error)) trials2[num_of_trees, ] <- c(num_of_trees, error)}

Can we improve even further? (cont.)[1] "Adaboost Error at 1 : 0.250209907640638"[1] "Adaboost Error at 2 : 0.300587741393787"[1] "Adaboost Error at 3 : 0.237615449202351"[1] "Adaboost Error at 4 : 0.26448362720403"[1] "Adaboost Error at 5 : 0.216624685138539"[1] "Adaboost Error at 6 : 0.219983207388749"[1] "Adaboost Error at 7 : 0.203190596137699"[1] "Adaboost Error at 8 : 0.222502099076406"[1] "Adaboost Error at 9 : 0.194794290512175"[1] "Adaboost Error at 10 : 0.198152812762385"[1] "Adaboost Error at 11 : 0.19311502938707"[1] "Adaboost Error at 12 : 0.199832073887489"[1] "Adaboost Error at 13 : 0.19647355163728"[1] "Adaboost Error at 14 : 0.194794290512175"[1] "Adaboost Error at 15 : 0.184718723761545"[1] "Adaboost Error at 16 : 0.185558354324097"[1] "Adaboost Error at 17 : 0.174643157010915"[1] "Adaboost Error at 18 : 0.18975650713686"[1] "Adaboost Error at 19 : 0.174643157010915"[1] "Adaboost Error at 20 : 0.185558354324097"[1] "Adaboost Error at 21 : 0.17632241813602"[1] "Adaboost Error at 22 : 0.182199832073887"[1] "Adaboost Error at 23 : 0.174643157010915"[1] "Adaboost Error at 24 : 0.182199832073887"[1] "Adaboost Error at 25 : 0.174643157010915"

[1] "Adaboost Error at 26 : 0.17296389588581"[1] "Adaboost Error at 27 : 0.180520570948783"[1] "Adaboost Error at 28 : 0.174643157010915"[1] "Adaboost Error at 29 : 0.17296389588581"[1] "Adaboost Error at 30 : 0.183879093198992"[1] "Adaboost Error at 31 : 0.174643157010915"[1] "Adaboost Error at 32 : 0.178001679261125"[1] "Adaboost Error at 33 : 0.170445004198153"[1] "Adaboost Error at 34 : 0.168765743073048"[1] "Adaboost Error at 35 : 0.164567590260285"[1] "Adaboost Error at 36 : 0.1696053736356"[1] "Adaboost Error at 37 : 0.162048698572628"[1] "Adaboost Error at 38 : 0.168765743073048"[1] "Adaboost Error at 39 : 0.167086481947943"[1] "Adaboost Error at 40 : 0.161209068010076"[1] "Adaboost Error at 41 : 0.1696053736356"[1] "Adaboost Error at 42 : 0.165407220822838"[1] "Adaboost Error at 43 : 0.171284634760705"[1] "Adaboost Error at 44 : 0.174643157010915"[1] "Adaboost Error at 45 : 0.167926112510495"[1] "Adaboost Error at 46 : 0.171284634760705"[1] "Adaboost Error at 47 : 0.1696053736356"[1] "Adaboost Error at 48 : 0.170445004198153"[1] "Adaboost Error at 49 : 0.16624685138539"[1] "Adaboost Error at 50 : 0.168765743073048"

print(min(paste("The minimum is", trials2[,2])))The minimum is 0.1612090

We improved from 17% error rate with outliers in the data to 16% when we removed the outliers(not substantial improvement, but good enough)

Can we improve even further? (cont.)

Here also we notice that, generally, the error decreases as the number of classifiers increases.

qplot(trials2$Trees, trials2$Error, data=trials2, color=Trees, xlab="Number of Classifiers", ylab="Error", geom=c("point", "smooth"))

Comparison

Misclassification error = 22%



Single Tree Adaboost(with outliers)

Adaboost(no outliers)


Adaboost(random # of classifiers)

References• Boosting and AdaBoost for Machine Learning

• What is AdaBoost - Quora

• Adaboost – Wikipedia

• Identify, describe, plot, and remove the outliers from the dataset

• Wine Quality Data Set - UCI Machine Learning Repository

• fastAdaboost – Github

• fastAdaboost - cran.r-project

http://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/



https://www.quora.com/What-is-AdaBoost



https://en.wikipedia.org/wiki/AdaBoost

https://en.wikipedia.org/wiki/AdaBoost

https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

https://github.com/souravc83/fastAdaboost

https://github.com/souravc83/fastAdaboost

https://cran.r-project.org/web/packages/fastAdaboost/fastAdaboost.pdf



using adaboost algorithm to enhance the prediction accuracy of decision trees

Data & Analytics