data analysis on bank data

Data Analysis on Bank Marketing Data

SetAnish Bhanushali

Information about dataset• UCI machine learning repository link :

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing• This dataset has 20 attributes .• Attribute 2 – 15 are having categorical inputs• 21st attribute named ‘y’ is out class attribute which we want to

predict

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Using logistic regression for classification• Assign numerical values to categorical input data and normalize

numeric attributes • To convert cat. Data into numeric we will do 1 hot encoding • In this type of encoding if there are n distinct values are there in

the cat. Attribute then system will create a table of nx(n-1) numerical values associated with given cat. Attribute • In each entry of that table there will be at most one 1 and

remaining 0s will be stored

Example of cat. To numeric • 2nd attribute job has 12 level (i.e. It’s having 12 distinct values )• After conversion one more attribute named ‘contrastas’

• Here you can see that each value is coded into 11 bit binary stream .

R code that converts all categorical inputs to numeric values colum_list = c(2,3,4,5,6,7,8,9,10,15) for(i in colum_list){ n = length(levels(bank_data[[i]])) contrasts(bank_data[[i]]) = contr.treatment(n)}

Normalizing attributes Following R code normalizes the attribute which are having numerical values (other than those attribute which are having values as 0 or 1 )normal = function(x){ return ((x - min(x))/(max(x) - min(x)))}colum_list = c(11,12,13,14,16,17,18,19,20)for (i in colum_list){ bank_data[[i]] = normal(bank_data[[i]]) bank_data <<- bank_data print(bank_data[[i]]) }

Preparing test and train data • We are taking approx. 9% of data as test and remaining as

training data • While dividing data into test and train we should take care about

the proportion of “yes” and “no” valued class • In whole data set if we see 21st column then “yes” valued rows are

11% and 89% rows are having “no” as value of the same column• we have to maintain same proportion into test data as well

R Code for making test/train setbank_data_yes = bank_data[bank_data$y=="yes" , ]bank_data_no = bank_data[bank_data$y=="no" , ]true = vector('logical' ,length = 3000)true = !truefalse = vector('logical',length = (length(bank_data_no[[1]]) - 3000))total_index_no = c(false,true)x_no = runif(length(bank_data_no[[1]]))total_index_no= total_index_no[order(x_no)]test_no = bank_data_no[total_index_no ,]This gives me total negative test set in test_no

R Code for making test/train settrue_yes = vector('logical' ,length = 400)true_yes = !true_yes

false_yes = vector('logical',length = (length(bank_data_yes[[1]]) - 400))

total_index_yes = c(false_yes,true_yes)

x_yes = runif(length(bank_data_yes[[1]]))length(x_yes)length(total_index_yes)total_index_yes= total_index_yes[order(x_yes)]

test_yes = bank_data_yes[total_index_yes ,]

total_test = as.data.frame(rbind(test_yes,test_no))

This gives me total positive test in test_yes then we combine them into one dataset using rbind() method and name it as total_test

R Code for making test/train settrain_yes = bank_data_yes[!total_index_yes ,]train_no = bank_data_no[!total_index_no , ]total_train = as.data.frame(rbind(train_yes,train_no))

• These commands will make train dataset by excluding test rows in main dataset

Using glm for logistic regression model <- glm(total_train$y ~.,family=binomial(link='logit'),data=total_train[,-11])• Here we have not included 11th column in train dataset because it is

clearly mention on uci repository page that , “this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.”

Summary of model• smmary(model) command gives the output shown below and ***

indicates most relevant attribute

Predict the test data with logistic regression model The code below gives us the predicted output of test set notice that here we have excluded 11th column

fitted.results <- predict(model,total_test[,-11],type='response')

fitted.results_yes_no <- ifelse(fitted.results > 0.5,"yes","no")

table(total_test$y , fitted.results_yes_no)

Here we have use the threshold value 0.5 which overall gives good accuracy but can’t avoid huge error in ‘true positive’ prediction

Accuracy • Confusion matrix with 0.5 as threshold

• Here we are getting over all accuracy of 89.5% but if you observe only true values , they have accuracy of only 20.5% • To avoid such loss we will analyze ROC curve

R code for to plot ROC curveYou’ll need “ROCR” package require(ROCR) pr <- prediction(fitted.results, total_test$y)prf <- performance(pr, measure = "tpr", x.measure = "fpr")plot(prf)

ROC curve • This is the roc curve and here we can clearly See that maximum we can have only 62% ‘true positive ’ rate The area under this curve is given by following Code , auc <- performance(pr, measure = "auc")auc <- [email protected][[1]]Value of auc is 0.7618

Increasing true positive rate • To increase true positive rate we have to change threshold • It was observed that if we were decreasing the threshold value

from 0.5 , it was showing increment in true positive value • But observing the roc curve we can say that optimum true

positive rate that we could achieve is between 0.60 to 0.62 • For this process we have to slowly decrease threshold and

observe true positive rate simultaneously .

Optimal threshold is 0.12• Here if we run this code ,fitted.results <- predict(model,total_test[,-11],type='response')

fitted.results_yes_no <- ifelse(fitted.results > 0.12,"yes","no")

table(total_test$y , fitted.results_yes_no)

we will get this confusion matrix

Overall accuracy = 82.5% and true positive rate = 60% (0.6)

Using naïve bayes • R code :require(e1071)

naive_model <- naiveBayes(total_train$y ~. , data = total_train[,-11] , laplace = 0)

result = predict(naive_model , total_test[,-11])

table(total_test$y , result)

• here we get this confusion matrix

• Accuracy = 83.76 % true positive rate = 53.5% (0.535)• NOTE : it was observed that if we use laplacian smoothing then result’s true positive rate decreases

Using SVM with no kernel • R code require(e1071)

svmmod <- svm(total_train$y ~.,data = total_train[,-11] )

pred <- predict(svmmod, total_test[,c(-11,-21)], decision.values = TRUE)

table(total_test$y , pred)

This dataset is having 40k rows and svm will take huge time to generate a predictive model out of it but you can load already saved svm model and test your data on that.

SVMSteps to load existing model and predict Store ‘svm_model.rda’ file in your working directory and run this code,

load("savm_model.rda")

ls() #to check if svmmod is loaded or not

pred <- predict(svmmod, total_test[,c(-11,-21)], decision.values = TRUE)

Make sure that you include all necessary libraries before running the ‘predict’ method

SVM accuracy • This is the confusion matrix we got using SVM

• Overall accuracy is 89.41% but if we see the true positive rate , it’s 17.75%(0.177) which is very low compare to all previous method that we saw

Final verdict• This dataset shows good result with logistic and naïve bayes

method • SVM is giving good accuracy but it fails in case of true positive

rate and • This data set is having lot’s of categorical attributes that makes it

prone to be correctly classified by Decision trees

data analysis on bank data

Data & Analytics