data analysis on bank data

23
Data Analysis on Bank Marketing Data Set Anish Bhanushali

Upload: anish-bhanushali

Post on 10-Jan-2017

93 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Data analysis on bank data

Data Analysis on Bank Marketing Data

SetAnish Bhanushali

Page 2: Data analysis on bank data

Information about dataset• UCI machine learning repository link :

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing• This dataset has 20 attributes .• Attribute 2 – 15 are having categorical inputs• 21st attribute named ‘y’ is out class attribute which we want to

predict

Page 3: Data analysis on bank data

Using logistic regression for classification• Assign numerical values to categorical input data and normalize

numeric attributes • To convert cat. Data into numeric we will do 1 hot encoding • In this type of encoding if there are n distinct values are there in

the cat. Attribute then system will create a table of nx(n-1) numerical values associated with given cat. Attribute • In each entry of that table there will be at most one 1 and

remaining 0s will be stored

Page 4: Data analysis on bank data

Example of cat. To numeric • 2nd attribute job has 12 level (i.e. It’s having 12 distinct values )• After conversion one more attribute named ‘contrastas’

• Here you can see that each value is coded into 11 bit binary stream .

Page 5: Data analysis on bank data

R code that converts all categorical inputs to numeric values colum_list = c(2,3,4,5,6,7,8,9,10,15) for(i in colum_list){ n = length(levels(bank_data[[i]])) contrasts(bank_data[[i]]) = contr.treatment(n)}

Page 6: Data analysis on bank data

Normalizing attributes Following R code normalizes the attribute which are having numerical values (other than those attribute which are having values as 0 or 1 )normal = function(x){ return ((x - min(x))/(max(x) - min(x)))}colum_list = c(11,12,13,14,16,17,18,19,20)for (i in colum_list){ bank_data[[i]] = normal(bank_data[[i]]) bank_data <<- bank_data print(bank_data[[i]]) }

Page 7: Data analysis on bank data

Preparing test and train data • We are taking approx. 9% of data as test and remaining as

training data • While dividing data into test and train we should take care about

the proportion of “yes” and “no” valued class • In whole data set if we see 21st column then “yes” valued rows are

11% and 89% rows are having “no” as value of the same column• we have to maintain same proportion into test data as well

Page 8: Data analysis on bank data

R Code for making test/train setbank_data_yes = bank_data[bank_data$y=="yes" , ]bank_data_no = bank_data[bank_data$y=="no" , ]true = vector('logical' ,length = 3000)true = !truefalse = vector('logical',length = (length(bank_data_no[[1]]) - 3000))total_index_no = c(false,true)x_no = runif(length(bank_data_no[[1]]))total_index_no= total_index_no[order(x_no)]test_no = bank_data_no[total_index_no ,]This gives me total negative test set in test_no

Page 9: Data analysis on bank data

R Code for making test/train settrue_yes = vector('logical' ,length = 400)true_yes = !true_yes

false_yes = vector('logical',length = (length(bank_data_yes[[1]]) - 400))

total_index_yes = c(false_yes,true_yes)

x_yes = runif(length(bank_data_yes[[1]]))length(x_yes)length(total_index_yes)total_index_yes= total_index_yes[order(x_yes)]

test_yes = bank_data_yes[total_index_yes ,]

total_test = as.data.frame(rbind(test_yes,test_no))

This gives me total positive test in test_yes then we combine them into one dataset using rbind() method and name it as total_test

Page 10: Data analysis on bank data

R Code for making test/train settrain_yes = bank_data_yes[!total_index_yes ,]train_no = bank_data_no[!total_index_no , ]total_train = as.data.frame(rbind(train_yes,train_no))

• These commands will make train dataset by excluding test rows in main dataset

Page 11: Data analysis on bank data

Using glm for logistic regression model <- glm(total_train$y ~.,family=binomial(link='logit'),data=total_train[,-11])• Here we have not included 11th column in train dataset because it is

clearly mention on uci repository page that , “this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.”

Page 12: Data analysis on bank data

Summary of model• smmary(model) command gives the output shown below and ***

indicates most relevant attribute

Page 13: Data analysis on bank data

Predict the test data with logistic regression model The code below gives us the predicted output of test set notice that here we have excluded 11th column

fitted.results <- predict(model,total_test[,-11],type='response')

fitted.results_yes_no <- ifelse(fitted.results > 0.5,"yes","no")

table(total_test$y , fitted.results_yes_no)

Here we have use the threshold value 0.5 which overall gives good accuracy but can’t avoid huge error in ‘true positive’ prediction

Page 14: Data analysis on bank data

Accuracy • Confusion matrix with 0.5 as threshold

• Here we are getting over all accuracy of 89.5% but if you observe only true values , they have accuracy of only 20.5% • To avoid such loss we will analyze ROC curve

Page 15: Data analysis on bank data

R code for to plot ROC curveYou’ll need “ROCR” package require(ROCR) pr <- prediction(fitted.results, total_test$y)prf <- performance(pr, measure = "tpr", x.measure = "fpr")plot(prf)

Page 16: Data analysis on bank data

ROC curve • This is the roc curve and here we can clearly See that maximum we can have only 62% ‘true positive ’ rate The area under this curve is given by following Code , auc <- performance(pr, measure = "auc")auc <- [email protected][[1]]Value of auc is 0.7618

Page 17: Data analysis on bank data

Increasing true positive rate • To increase true positive rate we have to change threshold • It was observed that if we were decreasing the threshold value

from 0.5 , it was showing increment in true positive value • But observing the roc curve we can say that optimum true

positive rate that we could achieve is between 0.60 to 0.62 • For this process we have to slowly decrease threshold and

observe true positive rate simultaneously .

Page 18: Data analysis on bank data

Optimal threshold is 0.12• Here if we run this code ,fitted.results <- predict(model,total_test[,-11],type='response')

fitted.results_yes_no <- ifelse(fitted.results > 0.12,"yes","no")

table(total_test$y , fitted.results_yes_no)

we will get this confusion matrix

Overall accuracy = 82.5% and true positive rate = 60% (0.6)

Page 19: Data analysis on bank data

Using naïve bayes • R code :require(e1071)

naive_model <- naiveBayes(total_train$y ~. , data = total_train[,-11] , laplace = 0)

result = predict(naive_model , total_test[,-11])

table(total_test$y , result)

• here we get this confusion matrix

• Accuracy = 83.76 % true positive rate = 53.5% (0.535)• NOTE : it was observed that if we use laplacian smoothing then result’s true positive rate decreases

Page 20: Data analysis on bank data

Using SVM with no kernel • R code require(e1071)

svmmod <- svm(total_train$y ~.,data = total_train[,-11] )

pred <- predict(svmmod, total_test[,c(-11,-21)], decision.values = TRUE)

table(total_test$y , pred)

This dataset is having 40k rows and svm will take huge time to generate a predictive model out of it but you can load already saved svm model and test your data on that.

Page 21: Data analysis on bank data

SVMSteps to load existing model and predict Store ‘svm_model.rda’ file in your working directory and run this code,

load("savm_model.rda")

ls() #to check if svmmod is loaded or not

pred <- predict(svmmod, total_test[,c(-11,-21)], decision.values = TRUE)

Make sure that you include all necessary libraries before running the ‘predict’ method

Page 22: Data analysis on bank data

SVM accuracy • This is the confusion matrix we got using SVM

• Overall accuracy is 89.41% but if we see the true positive rate , it’s 17.75%(0.177) which is very low compare to all previous method that we saw

Page 23: Data analysis on bank data

Final verdict• This dataset shows good result with logistic and naïve bayes

method • SVM is giving good accuracy but it fails in case of true positive

rate and • This data set is having lot’s of categorical attributes that makes it

prone to be correctly classified by Decision trees