using adaboost algorithm to enhance the prediction accuracy of decision trees

26
Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees MOHAMMED ALHAMADI - PROJECT 2

Upload: mohammed-al-hamadi

Post on 20-Jan-2017

43 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision TreesMOHAMMED ALHAMADI - PROJECT 2

Page 2: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Acknowledgement

This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon

School of Engineering, NYU.

Page 3: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Outline1. Adaboost Algorithm

2. The Dataset

3. Data Exploration and Visualization

4. Factorizing the Dependent Variable

5. Prediction Using a Single Decision Tree

6. Prediction Using Adaboost

7. Can we improve this?

8. Can we improve even further?

9. Comparison

10. References

Page 4: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Adaboost Algorithm• Adaboost is short for “Adaptive Boosting”

• Created by Yoav Freund and Robert Schapire in 1997

• Aaboost is a meta-algorithm (provides a sufficiently good solution to an optimization problem)

• Can be used with other learning algorithms to improve their performance• Makes the weak classifiers strong by combining them• Typically, decision trees (which we’ll use in this project)

Page 5: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Adaboost Algorithm (cont.)• Weak models are added sequentially after being trained on the training data

• The process continues until:• No further improvement can be made on the training data set• A threshold number of weak classifiers is created

• Predication are then made by taking the weighted average of the weak classifiers• Each weak classifier has a weight (its classification result is multiplied by its weight)

Page 6: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Adaboost Algorithm (cont.)• It is better to have these characteristics in the data before processing it with adaboost:

◦ Quality Data:◦ Ensemble method attempts to correct misclassifications in the training data ◦ Make sure that the training data is of high-quality

◦ Outliers:◦ Outliers will force the ensemble to work very hard in order to correct for cases that are unrealistic◦ Best to remove them from dataset

◦ Noisy Data: ◦ Noise in the output variable can be problematic for ensemble◦ Isolate and clean the noisy data from your training dataset

Page 7: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

The Data Set• The data set that we will use in this project contains 4898 observations on white wine varieties and quality ranked by the wine tasters

• The data set contains 11 independent variables and 1 dependent variable• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,

free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)

Page 8: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Data Exploration and Visualizationwine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";")dim(wine_data)

  [1] 4898 12

names(wine_data)

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"

Page 9: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Data Exploration and Visualization (cont.)

'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ...

str(wine_data)

Page 10: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Data Exploration and Visualization (cont.)

[1] 3.8 14.2range(wine_data$fixed.acidity)

[1] 0.08 1.10range(wine_data$volatile.acidity)

[1] 0.00 1.66range(wine_data$citric.acid)

[1] 0.6 65.8range(wine_data$residual.sugar)

[1] 0.009 0.346range(wine_data$chlorides)

[1] 2 289range(wine_data$free.sulfur.dioxide)

[1] 9 440range(wine_data$total.sulfur.dioxide)

[1] 0.98711 1.03898range(wine_data$density)

[1] 2.72 3.82range(wine_data$pH)

[1] 0.22 1.08range(wine_data$sulphates)

[1] 8.0 14.2range(wine_data$alcohol)

Page 11: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Data Exploration and Visualization (cont.)cor(wine_data)

Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality

Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11

Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19

Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01

Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1

Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21

FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01

TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17

Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31

pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1

Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05

Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44

Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1

Page 12: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Data Exploration and Visualization (cont.)hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)

hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)

Page 13: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Data Exploration and Visualization (cont.)hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")

Page 14: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Factorizing the Dependent Variablequality_fac <- ifelse(wine_data$quality >= 6, "high", "low")

wine_data <- data.frame(wine_data, quality_fac)

table(wine_data$quality_fac)

High Low

3258 1640

• We can now remove the old integer dependent variable (quality)

wine_data <- wine_data[,-12]

Page 15: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Prediction Using a Single Decision Tree• In order to compare the performance of Adaboost, we will first use a single decision tree

• Package (C50) in R

• So the misclassification error rate is almost 22% with a single tree

set.seed(797)training_size <- round(0.7 * dim(wine_data)[1])training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)testing_sample <- -training_sampletraining_data <- wine_data[training_sample,]testing_data <- wine_data[-training_sample,]testing <- quality_fac[testing_sample]

C50_model <- C5.0(quality_fac~., data=training_data)predict_C50 <- predict(C50_model, testing_data[,-12])mean(predict_C50 != testing)

  [1] 0.2178353

Page 16: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Prediction Using Adaboost• We will use library (fastAdaboost)

• Initially, we will choose to use 15 classifiers and combine them

• So, we can see a quick improvement when using Adaboost with 15 classifiers• We improved from 22% error rate to 18%

library(fastAdaboost)adaboost_model <- adaboost(quality_fac~., training_data, 15)pred <- predict(adaboost_model, newdata=testing_data)error <- pred$errorprint(paste("Adaboost Error:", error))

"Adaboost Error: 0.184479237576583"

Page 17: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve this?• Choosing 15 classifiers in the previous step was a random choice

• We can try many different number of classifiers and check performance• Try a range of classifiers from 1 to 50

trials <- data.frame(Trees=numeric(0), Error=numeric(0))

for (num_of_trees in 1:50){ pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data) error <- pred$error print(paste("Adaboost Error at", num_of_trees, ":", error)) trials[num_of_trees, ] <- c(num_of_trees, error)}

Page 18: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve this? (cont.)[1] "Adaboost Error at 1 : 0.243703199455412"[1] "Adaboost Error at 2 : 0.277739959155888"[1] "Adaboost Error at 3 : 0.219877467665078"[1] "Adaboost Error at 4 : 0.228727025187202"[1] "Adaboost Error at 5 : 0.208304969366916"[1] "Adaboost Error at 6 : 0.222600408441116"[1] "Adaboost Error at 7 : 0.194009530292716"[1] "Adaboost Error at 8 : 0.200816882232811"[1] "Adaboost Error at 9 : 0.191967324710688"[1] "Adaboost Error at 10 : 0.195371000680735"[1] "Adaboost Error at 11 : 0.193328795098707"[1] "Adaboost Error at 12 : 0.198093941456773"[1] "Adaboost Error at 13 : 0.185840707964602"[1] "Adaboost Error at 14 : 0.183117767188564"[1] "Adaboost Error at 15 : 0.184479237576583"[1] "Adaboost Error at 16 : 0.189244383934649"[1] "Adaboost Error at 17 : 0.186521443158611"[1] "Adaboost Error at 18 : 0.189925119128659"[1] "Adaboost Error at 19 : 0.185159972770592"[1] "Adaboost Error at 20 : 0.184479237576583"[1] "Adaboost Error at 21 : 0.182437031994554"[1] "Adaboost Error at 22 : 0.187202178352621"[1] "Adaboost Error at 23 : 0.181075561606535"[1] "Adaboost Error at 24 : 0.184479237576583"[1] "Adaboost Error at 25 : 0.174948944860449"

[1] "Adaboost Error at 26 : 0.183798502382573"[1] "Adaboost Error at 27 : 0.181075561606535"[1] "Adaboost Error at 28 : 0.176310415248468"[1] "Adaboost Error at 29 : 0.179714091218516"[1] "Adaboost Error at 30 : 0.180394826412526"[1] "Adaboost Error at 31 : 0.179714091218516"[1] "Adaboost Error at 32 : 0.177671885636487"[1] "Adaboost Error at 33 : 0.171545268890402"[1] "Adaboost Error at 34 : 0.171545268890402"[1] "Adaboost Error at 35 : 0.170864533696392"[1] "Adaboost Error at 36 : 0.172906739278421"[1] "Adaboost Error at 37 : 0.170864533696392"[1] "Adaboost Error at 38 : 0.175629680054459"[1] "Adaboost Error at 39 : 0.171545268890402"[1] "Adaboost Error at 40 : 0.17358747447243"[1] "Adaboost Error at 41 : 0.17358747447243"[1] "Adaboost Error at 42 : 0.175629680054459"[1] "Adaboost Error at 43 : 0.17358747447243"[1] "Adaboost Error at 44 : 0.172906739278421"[1] "Adaboost Error at 45 : 0.174948944860449"[1] "Adaboost Error at 46 : 0.174948944860449"[1] "Adaboost Error at 47 : 0.174948944860449"[1] "Adaboost Error at 48 : 0.171545268890402"[1] "Adaboost Error at 49 : 0.172226004084411"[1] "Adaboost Error at 50 : 0.172226004084411"

print(min(paste("The minimum is", trials[,2])))The minimum is 0.1708645

We improved from 18% error rate when using 15 classifiers to 17% when using 35 classifiers

Page 19: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve this? (cont.)

We can notice that, generally, the error decreases as the number of classifiers increases.

qplot(trials2$Trees, trials2$Error, data=trials2, color=Trees, xlab="Number of Classifiers", ylab="Error", geom=c("point", "smooth"))

Page 20: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve even further?• One of the requirements for the data used with Adaboost is to not have outliers• We did not check for outliers

• We can check if the data contains outliers and remove them to improve performance

•… and we do this for all other variables in the dataset

wine_data2 <- wine_data

outlier <- boxplot.stats(wine_data2$fixed.acidity)$outwine_data2$fixed.acidity <- ifelse(wine_data2$fixed.acidity %in% outlier, NA, wine_data2$fixed.acidity)boxplot.stats(wine_data2$fixed.acidity)

wine_data2 <- wine_data2[complete.cases(wine_data2),]

outlier <- boxplot.stats(wine_data2$volatile.acidity)$outwine_data2$volatile.acidity <- ifelse(wine_data2$volatile.acidity %in% outlier, NA, wine_data2$volatile.acidity)boxplot.stats(wine_data2$volatile.acidity)

# … continue for all other variables

Page 21: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve even further? (cont.)• After removing all the NA values, we are left with 3971 data instances• We originally had 4898 data instances• We lost 927 data instances to outliers in all the variables• Is that worth it?

dim(wine_data)

dim(wine_data2)

[1] 4898 12

[1] 3971 12

Page 22: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve even further? (cont.)• We will check multiple number of classifiers again after removing outliers

set.seed(797)training_size <- round(0.7 * dim(wine_data2)[1])training_sample <- sample(dim(wine_data2)[1], training_size, replace=FALSE)testing_sample <- -training_sample

training_data <- wine_data2[training_sample,]testing_data <- wine_data2[-training_sample,]

testing <- quality_fac[testing_sample]

trials2 <- data.frame(Trees=numeric(0), Error=numeric(0))

for (num_of_trees in 1:50){ pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data) error <- pred$error print(paste("Adaboost Error at", num_of_trees, ":", error)) trials2[num_of_trees, ] <- c(num_of_trees, error)}

Page 23: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve even further? (cont.)[1] "Adaboost Error at 1 : 0.250209907640638"[1] "Adaboost Error at 2 : 0.300587741393787"[1] "Adaboost Error at 3 : 0.237615449202351"[1] "Adaboost Error at 4 : 0.26448362720403"[1] "Adaboost Error at 5 : 0.216624685138539"[1] "Adaboost Error at 6 : 0.219983207388749"[1] "Adaboost Error at 7 : 0.203190596137699"[1] "Adaboost Error at 8 : 0.222502099076406"[1] "Adaboost Error at 9 : 0.194794290512175"[1] "Adaboost Error at 10 : 0.198152812762385"[1] "Adaboost Error at 11 : 0.19311502938707"[1] "Adaboost Error at 12 : 0.199832073887489"[1] "Adaboost Error at 13 : 0.19647355163728"[1] "Adaboost Error at 14 : 0.194794290512175"[1] "Adaboost Error at 15 : 0.184718723761545"[1] "Adaboost Error at 16 : 0.185558354324097"[1] "Adaboost Error at 17 : 0.174643157010915"[1] "Adaboost Error at 18 : 0.18975650713686"[1] "Adaboost Error at 19 : 0.174643157010915"[1] "Adaboost Error at 20 : 0.185558354324097"[1] "Adaboost Error at 21 : 0.17632241813602"[1] "Adaboost Error at 22 : 0.182199832073887"[1] "Adaboost Error at 23 : 0.174643157010915"[1] "Adaboost Error at 24 : 0.182199832073887"[1] "Adaboost Error at 25 : 0.174643157010915"

[1] "Adaboost Error at 26 : 0.17296389588581"[1] "Adaboost Error at 27 : 0.180520570948783"[1] "Adaboost Error at 28 : 0.174643157010915"[1] "Adaboost Error at 29 : 0.17296389588581"[1] "Adaboost Error at 30 : 0.183879093198992"[1] "Adaboost Error at 31 : 0.174643157010915"[1] "Adaboost Error at 32 : 0.178001679261125"[1] "Adaboost Error at 33 : 0.170445004198153"[1] "Adaboost Error at 34 : 0.168765743073048"[1] "Adaboost Error at 35 : 0.164567590260285"[1] "Adaboost Error at 36 : 0.1696053736356"[1] "Adaboost Error at 37 : 0.162048698572628"[1] "Adaboost Error at 38 : 0.168765743073048"[1] "Adaboost Error at 39 : 0.167086481947943"[1] "Adaboost Error at 40 : 0.161209068010076"[1] "Adaboost Error at 41 : 0.1696053736356"[1] "Adaboost Error at 42 : 0.165407220822838"[1] "Adaboost Error at 43 : 0.171284634760705"[1] "Adaboost Error at 44 : 0.174643157010915"[1] "Adaboost Error at 45 : 0.167926112510495"[1] "Adaboost Error at 46 : 0.171284634760705"[1] "Adaboost Error at 47 : 0.1696053736356"[1] "Adaboost Error at 48 : 0.170445004198153"[1] "Adaboost Error at 49 : 0.16624685138539"[1] "Adaboost Error at 50 : 0.168765743073048"

print(min(paste("The minimum is", trials2[,2])))The minimum is 0.1612090

We improved from 17% error rate with outliers in the data to 16% when we removed the outliers(not substantial improvement, but good enough)

Page 24: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Can we improve even further? (cont.)

Here also we notice that, generally, the error decreases as the number of classifiers increases.

qplot(trials2$Trees, trials2$Error, data=trials2, color=Trees, xlab="Number of Classifiers", ylab="Error", geom=c("point", "smooth"))

Page 25: Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Comparison

Misclassification error = 22%

Misclassification error = 17%

Misclassification error = 16%

Single Tree Adaboost(with outliers)

Adaboost(no outliers)

Misclassification error = 18%

Adaboost(random # of classifiers)