(machine learning) clustering & classifying houses in king county, wa

Clustering & Classifying Houses in King County, WAMOHAMMED ALHAMADI - PROJECT 1

Acknowledgement

This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon

School of Engineering, NYU.

Outlineo The dataset

o Loading & Exploring the dataset

o Clustering Zip codes and Prices

o Predicting House Prices Using Support Vector Machine Algorithm

o Decreasing Correlation Between Independent Variableso Scaling Data for SVM

o References

The Dataset• House sales in King County in Washington State

• From May 2014 to May 2015

• 21,613 observations and 21 features

ID, Date, Price, Bedrooms, Bathrooms, SQFT Living (living area in square feet), SQFT Lot (lot area in square feet), Floors, Waterfront, View, Grade (house grade ranging from 1 to 13), Condition (house condition ranging from 1 to 5), SQFT Above (living area excluding the basement), SQFT Basement (basement area), Yr Built (the year in which the house was built), Yr Renovated (the year in which the house was renovated), Zipcode, Lat (latitude), Long (longitude), SQFT Living15 (living area in square feet for the nearest 15 neighbors), and SQFT LOT15 (lot area in square feet for the nearest 15 neighbors).

Loading and exploring the datahouses2 <- read.csv("/Users/mohammedalhamadi/GoogleDrive/R_code/data/kc_house_data.csv", header=TRUE)dim(houses2) [1] 21613 21

names(houses2)[1] "id" "date" "price" "bedrooms" "bathrooms" [6] "sqft_living" "sqft_lot" "floors" "waterfront" "view" [11] "condition" "grade" "sqft_above" "sqft_basement" "yr_built" [16] "yr_renovated" "zipcode" "lat" "long" "sqft_living15"[21] "sqft_lot15"

Loading and exploring the data (cont.)str(houses2)'data.frame': 21613 obs. of 21 variables: $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ... $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ... $ price : num 221900 538000 180000 604000 510000 ... $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ... $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ... $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ... $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ... $ floors : num 1 2 1 1 1 1 2 1 1 2 ... $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ... $ view : int 0 0 0 0 0 0 0 0 0 0 ... $ condition : int 3 3 3 5 3 3 3 3 3 3 ... $ grade : int 7 7 6 7 8 11 7 7 7 7 ... $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ... $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ... $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ... $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ... $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ... $ lat : num 47.5 47.7 47.7 47.5 47.6 ... $ long : num -122 -122 -122 -122 -122 ... $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ... $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

Discovering correlations between featurescor(houses2[,3:length(houses2)])# first 2 features are excluded

Grade and sqft_living = 0.763Grade and sqft_above = 0.756Price and sqft_living = 0.702Bathrooms and sqft_above = 0.685Price and grade = 0.667Bathrooms and grade = 0.665Price and sqft_above = 0.606Price and sqft_living15 = 0.585Price and bathrooms = 0.525

# most correlations are insignificant. The most significant correlations in order:

Correlations plots (1)plot(houses2[,c(6,3)], main="Correlation between SQFT Living and Price", xlab="Living area in square ft", ylab="Price", col="blue")

# Correlation between sqft_living and price

Correlations plots (2)plot(houses2[,c(6,12)], main="Correlation between SQFT Living and Grade", xlab="Living area in square ft", ylab="Grade", col="dark green")

# Correlation between sqft_living and grade

Correlations plots (3)plot(houses2[,c(12,3)], main="Correlation between grade and price", xlab="Grade", ylab="Price", col="red") # Correlation between grade and price

Clustering zip codes and prices datazip_and_price <- houses2[1:5000, c("zipcode", "price")] # consider first 5000 observationsscaledZP <- scale(zip_and_price) # scale for comparabilitydist_scaledZP <- dist(scaledZP, method="euclidean") # use Euclidean distanceclusters <- hclust(dist_scaledZP, method="ward.D") plot(clusters) # plot clusters in a dendrogram

Clustering zip codes and prices data (cont.)groups <- cutree(clusters, k=6)rect.hclust(clusters, k=6, border="blue")

Predicting House Prices Using Support Vector Machine Algorithm# Creating a categorical variable from the prices data

quantile(houses2$price)

0% 25% 50% 75% 100% 75000 321950 450000 645000 7700000

So we can have our 4 classes like this:

Houses more expensive than 645,000 Expensive 5373 houses

Houses between 450,000 and 645,000 High 5376 houses

Houses between 321,950 and 450,000 Ok 5460 houses

Houses cheaper than 321,950 Cheap 5404 houses

Predicting House Prices Using Support Vector Machine Algorithm (cont.)

summary(houses2$price_categ)

Cheap Ok High Expensive 5404 5460 5376 5373

houses2$price_categ<-cut(houses2$price, c(0,321950,450000,645000,7700000), labels=c("Cheap", "Ok", "High", "Expensive"))

# add a column to the data to hold this categorical variable

# view a summary of the categorical variable


# choose relevant columns from the data set based on correlation analysis done earliercols <- c("sqft_living", "grade", "sqft_living15", "bathrooms", "view", "price_categ")

# let’s define a training data set and a testing data setset.seed(100) training_size <- round(0.7 * dim(houses2)[1])training_sample <- sample(dim(houses2)[1], training_size, replace=FALSE) training_houses <- houses2[training_sample,cols]testing_houses <- houses2[-training_sample,cols]

library(e1071)svmfit <- svm(price_categ~., data=training_houses, kernel="linear", cost=0.1, scale=FALSE)

# calling SVM function on the training data


# plotting the classification graph of the SVM two variables at a time

plot(svmfit, training_houses, sqft_living~grade) plot(svmfit, training_houses, bathrooms~sqft_living)


# use function (tune) to try to find the best variables, the best kernel and the best cost parameter to minimize the errortuned <- tune(svm, price_categ~bathrooms+sqft_living, data=training_houses, kernel="linear", ranges=list(cost=c(10)))print(tuned)

Error estimation of ‘svm’ using 10-fold cross validation: 0.5506647

tuned <- tune(svm, price_categ~bathrooms+sqft_living, data=training_houses, kernel="linear", ranges=list(cost=c(100)))print(tuned)

Error estimation of ‘svm’ using 10-fold cross validation: 0.5488125


# continue calling the (tune) function and change the combination of the independent variables in the formula, the kernel, and the cost parameter. Cost is passed as a list of 6 numbers ranging from 0.001 to 100 in 10-times increments.

tuned <- tune(svm, price_categ~., data=training_houses, kernel="linear", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))summary(tuned)Parameter tuning of ‘svm’:- sampling method: 10-fold cross validation - best parameters: cost 10 - best performance: 0.5088919 - Detailed performance results: cost error dispersion1 1e-03 0.5197316 0.011260592 1e-02 0.5125271 0.015383303 1e-01 0.5109408 0.014575294 1e+00 0.5092224 0.015323965 1e+01 0.5088919 0.015642216 1e+02 0.5093546 0.01505520

The best cost (10) since it gave the least error


# Again call (tune) function and change the independent variables in the formula, the kernel, and the cost parameter. This is a summary of output of different combinations:

Independent Variables KernelError based on cost

0.001 0.01 0.1 1 10 100

All variables Linear 0.52 0.51 0.51 0.51 0.51 0.51

All variables Polynomial 0.60 0.56 0.54 0.53 0.53 0.53

All variables Radial 0.53 0.51 0.50 0.50 0.50 0.50

All variables Sigmoid 0.55 0.58 0.64 0.64 0.64 0.64

Grade Radial 0.56 0.53 0.53 0.53 0.53 0.53

Grade & sqft_living Radial 0.55 0.51 0.51 0.51 0.51 0.51

Grade, sqft_living & sqft_living15 Radial 0.55 0.51 0.51 0.51 0.51 0.51

Grade, sqft_living, sqft_living15 & bathrooms Radial 0.54 0.52 0.51 0.51 0.50 0.50

# We can see that the best case happened when we used all the variables, the radial kernel, and cost parameter=10.

#Let’s use these parameters:

svmfit <- svm(training_houses$price~., data=training_houses, kernel="radial", cost=10, scale=FALSE)

all data radial kernel cost=10

Predicting House Prices Using Support Vector Machine Algorithm (cont.)print(svmfit)Call:svm(formula = training_houses$price ~ ., data=training_houses, kernel="radial", cost=10, scale = FALSE)Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 10 gamma: 0.1111111 Number of Support Vectors: 14406

p <- predict(svmfit, testing_houses[,cols], type="class")table(p, testing_houses[,6])

p Cheap Ok High Expensive Cheap 463 144 106 22 Ok 143 423 114 71 High 95 133 343 94 Expensive 871 968 1039 1455

mean(p==testing_houses[,6])[1] 0.413942 Prediction accuracy (41%), pretty low

Decreasing correlation between independent variableso Data can contain attributes that are highly correlated with each other

o Many methods perform better if highly correlated attributes are removed

o Checking correlation between our independent variables:

o We can see many correlations above 65% which is high

o We want to eliminate thato Choose other variables with low inter-correlation and high correlation with price_categ

sqft_living grade sqft_living_15 bathrooms viewsqft_living 1 0.76 0.76 0.75 0.28

grade 0.76 1 0.71 0.66 0.25sqft_living_15 0.76 0.71 1 0.57 0.28

bathrooms 0.75 0.66 0.57 1 0.19view 0.28 0.25 0.28 0.19 1

Decreasing correlation between independent variables (cont.)o After analyzing correlations of our data set, we chose the following variables:o "sqft_living", "floors", "view", "sqft_basement", and "lat"

cols <- c("sqft_living", "floors", "view", "sqft_basement", "lat", "price_categ")# Running everything againset.seed(100)training_size <- round(0.7 * dim(houses2)[1])training_sample <- sample(dim(houses2)[1], training_size, replace=FALSE)training_houses <- houses2[training_sample,cols]testing_houses <- houses2[-training_sample,cols]# Again, try different kernels and see what’s the besttuned <- tune(svm, price_categ~., data=training_houses, kernel="radial", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))tuned <- tune(svm, price_categ~., data=training_houses, kernel="linear", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))tuned <- tune(svm, price_categ~., data=training_houses, kernel="polynomial", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))tuned <- tune(svm, price_categ~., data=training_houses, kernel="sigmoid", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))

summary(tuned)# every time kernel is changed, call (summary) to see the best cost parameter

Decreasing correlation between independent variables (cont.)o The best kernel was found to be the radial kernel again

svmfit <- svm(training_houses$price_categ~., data=training_houses, kernel="radial", cost=100, scale=FALSE)

p <- predict(svmfit, testing_houses[,cols], type="class") mean(p==testing_houses[,6])[1] 0.514343

Prediction accuracy increased to 51%

Data Scaling for SVMo“ Scaling before applying SVM is very important” ~ Hsu et al

o Advantages of scaling:o avoid attributes in greater numeric ranges dominating those in smaller numeric rangeso avoid numerical difficulties during the calculation

# Scaling the datatraining_houses2 <- training_housestraining_houses2[1:5] <- scale(training_houses2[1:5])

testing_houses2 <- testing_housestesting_houses2[1:5] <- scale(testing_houses2[1:5])

svmfit <- svm(training_houses2$price_categ~., data=training_houses2, kernel="radial", cost=100, scale=FALSE)p <- predict(svmfit, testing_houses2[,cols], type="class")mean(p==testing_houses2[,6])

# Calling SVM again and checking the accuracy

[1] 0.6904688 Big increase in accuracy from 41% to 69%

References Support Vector Machines The Interface to libsvm in package e1071 by David Meyer ∗https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

Support Vector Machines (SVM) Overview and Demo using R https://www.youtube.com/watch?v=ueKqDlMxueE

DSO 530: Logistic Regression in R https://www.youtube.com/watch?v=mteljf020EE

DSO 530: Decision Trees in R (Classification) https://www.youtube.com/watch?v=GOJN9SKl_OE

Feature Selection with the Caret R Package http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/

A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin https://cloud.scorm.com/content/courses/P0P5ZE81VQ/78c05b7f-2582-46cf-ab6b-160ea2e02a6a/0/story_content/external_files/A%20Practical%20Guide%20to%20Support%20Vector%20Machines.pdf

Cluster analysis in R http://www.statmethods.net/advstats/cluster.html

https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

https://www.youtube.com/watch?v=ueKqDlMxueE

https://www.youtube.com/watch?v=mteljf020EE

https://www.youtube.com/watch?v=GOJN9SKl_OE

http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/

https://cloud.scorm.com/content/courses/P0P5ZE81VQ/78c05b7f-2582-46cf-ab6b-160ea2e02a6a/0/story_content/external_files/A%20Practical%20Guide%20to%20Support%20Vector%20Machines.pdf

https://cloud.scorm.com/content/courses/P0P5ZE81VQ/78c05b7f-2582-46cf-ab6b-160ea2e02a6a/0/story_content/external_files/A%20Practical%20Guide%20to%20Support%20Vector%20Machines.pdf

http://www.statmethods.net/advstats/cluster.html

Questions?THANKS!

(machine learning) clustering & classifying houses in king county, wa

Data & Analytics