capstone project in business intelligence

32
June 7, 2016 Seattle University 1 King County Real Estate Prices Analysis and Predictions IS 5325 - Capstone Project in Business Intelligence Prepared by: Samantha Adriaan Braden Simonsen Yan Zhang

Upload: samantha-adriaan

Post on 13-Apr-2017

37 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

1

King County Real Estate Prices Analysis and Predictions

IS 5325 - Capstone Project in Business Intelligence

Prepared by:

Samantha Adriaan

Braden Simonsen

Yan Zhang

Page 2: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

2

Executive Summary

This paper analyzes the use of data mining models in predicting real estate prices in King

County, Washington. Based on research done by other researchers regarding the real estate

market, data mining and machine learning, we decided to use Microsoft Decision Trees and

Neural Networks for our prediction models.

Using publicly available sales and residential building data from the King County Assessor’s

office, income data from the Internal Revenue Service, and educational attainment data from the

United States Census Bureau, we performed, extracted, transformed and then loaded the data into

Microsoft SQL Server. We then used Microsoft Visual Studio to build our prediction models and

perform data mining on over 27,000 transactions. In total, four prediction models were used; a

Decision Tree model for predicting sales price, a Decision Tree model for predicting home value

category, a Neural Network model for predicting sales price, and a Neural Network model for

predicting home value category.

To predict the accuracy of the models and to determine which model best represents our data, we

evaluated the performance of each of our mining models and assessed how well the models

performed against real data by calculating prediction errors using mean absolute percentage error

(MAPE) and mean absolute deviation (MAD). We found that our Neural Network model was

significantly more accurate than Decision Trees. For the two models built to predict the home

value category, we used a classification matrix to sort all cases from the models into categories,

by determining whether the predicted value matched the actual value. Because the overall

accuracy for both of these models was greater than 99%, we didn’t find that predicting the home

value category added value to our analysis.

To further our research, we analyzed the relationship between average income, educational

attainment, and the average predicted sales price from our Neural Network model for each zip

code in King county. Our findings indicate that, with a few exceptions, there is a direct

correlation between home values, income, and educational attainment.

Ultimately, we determined that the Neural Networks algorithm is more accurate than Decision

Trees in predicting real estate prices. Moving forward, we believe that the model could be

improved by using additional attributes, such as household size, marital status, median age, and

standard test scores. Furthermore, we we believe additional research should include the use of

other data mining algorithms, such as Bayes Theorem, in order to compare model accuracy with

Neural Networks and Decision Trees.

Page 3: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

3

Introduction

Home values in King County have surged in recent years as the economy has rebounded from

the global recession in 2008 as shown in Figure 1. The growth of companies, specifically

technology firms, in the Seattle area can be credited with attracting people from all over the

world, causing the population and demand in this region to grow exponentially in recent years.

This growth has caused real estate prices to hit new highs as demand continues to exceed

available inventory. According to Zillow’s Zestimate tool, the current median home price in

King County is $477,900, an increase of 11.6% since May 2015 (Zillow King County, 2016).

According to the Northwest Multiple Listing Service, median home prices sold in King County

increased 12.50% from April 2015 to April 2016 (Northwest Multiple Listing Service, 2016).

Figure 1: Zillow Price Trend of Home Values in King County

With such high value growth in residential real estate and with over over 27,000 sales

transactions in 2015 (King County Assessor, 2016), the King County real estate market provides

us with a fascinating opportunity to predict future home values using Decision Trees and Neural

Network data mining models. Zillow predicts home values in King County will rise by 6.2%

between June 2016 and June 2017 (Zillow King County, 2016).

Using proprietary automated valuation models that apply advanced algorithms to analyze data,

Zillow identifies relationships within a specific geographic area, between home-related data and

actual sales prices. Home characteristics, such as square footage, location or the number of

bathrooms, are given different weights according to their influence on home sale prices in each

specific geography over a specific period of time, resulting in a set of valuation rules, or models

that are applied to generate each home's Zestimate. Specifically, some of the data used in their

algorithm include physical attributes like lot size and square footage, tax assessments, and prior

and current transactions (Zillow Zestimate, 2016).

For our sales price prediction models, we obtained 2015 sales data from the King County

Assessor, educational attainment data from the United States Census Bureau, and income data

Page 4: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

4

from the IRS. With this data we created models using the Decision Trees and Neural Networks

algorithms to predict home values in King County. We evaluated the performance of our mining

models and assessed how well our models performed against real data by calculating prediction

errors, allowing us to determine which model has the highest accuracy.

Problem Identification

The real estate market is known to fluctuate over time and the problem we observed is that

housing prices are highly unpredictable. Using database management techniques, data mining

algorithms and various secondary research we set out to forecast housing prices by taking into

account of other factors such as income and education attainment along with the size of livable

square footage.

Explanation of Domain Problems and Datasets

In gathering the datasets, we had difficulties finding reliable data sources. We made the

decision to use publicly available data from the King County Assessor's Office for residential

building details and sales data, Internal Revenue Service for average income by zip code and the

Census Bureau for the 2014 census data. We believe these four datasets are the most reliable as it

is directly collected by each of the governmental agencies.

The real estate domain is broad, it was difficult to choose a specific geographical location,

property type to focus on. Additionally, our original intent was to perform an analysis on King

County data and Scholastic Aptitude Test (SAT) scores to determine whether there is a

correlation between house prices and SAT scores. We quickly ran into roadblocks with this

domain. The problem here was that we could not get access to the SAT scores from the College

Board due to its policies. We learned that we can only have access to general data but not any

data specific to a zip code and that the more detailed data is only available to school districts. We

moved on to other viable options that can potentially achieve the same goal: education

attainment and average income.

The next problem was with the datasets. We downloaded two King County datasets: Real Estate

Sales and Residential Buildings. The King County Assessor's Office offers 24 datasets to the

general public for download. We reviewed the majority of these datasets to determine which sets

were useful. After some discussion, we agreed on the Real Estate Sales and the Residential

Buildings datasets. The Real Estate Sales dataset offers information on each sales transaction for

properties within King county and the Residential Buildings dataset offers information on

building information for each property. We then combined the Parcel Major and Parcel Minor to

obtain the Parcel ID which identifies each unique property.

In terms of the education attainment and average income datasets, we experienced difficulties

finding information that is relevant and can be organized by zip code. After a long search, we

were able to locate the dataset from the Census Bureau on education attainment by zip code, age

group and education level; from the Internal Revenue Service we obtained the average income

by zip code. We then cleaned up these datasets to better align with the existing King County

datasets.

Page 5: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

5

ETL (Extraction, Transform, and Load)

We started the ETL process by obtaining publically available datasets. The King County

Assessor’s Office provided real estate property sales and residential building datasets. We also

searched from the U.S Census Bureau and Internal Revenue Services for educational attainment

and average income respectively. We used these datasets because the various government

agencies are reliable resources for the information needed. For the complete data dictionary on

our datasets, see attachments:

● Table 1: King County Real Estate Sales

● Table 2: King County Residential Building

● Table 3: Washington Educational Attainment

● Table 4: Washington State Income

We then imported the four datasets into Microsoft SQL Server where we faced a number of

challenges. We used multiple data sources where each source defined the data types differently,

this inherently made the data import process especially difficult. We started with four tables and

more than 150 attributes. Each attribute had to be reviewed and validated for accuracy in terms

of its data type. The data cleaning process consumed a significant portion of our efforts.

After the four datasets were successfully imported, we had a total of 2.3 million rows of data. We

used the joined the four individual datasets, isolated the data to 2015 real estate sales

transactions. Next, we eliminated sale price outliers which were prices lower than $100,000 and

greater than $2 million. Finally, we determined that some data were not pertinent to our research

and analysis. We were able to narrow down our final dataset to roughly 27,000 rows.

In addition, we also had to resolve a data duplication error. We later discovered that it was due to

property sales that were sold multiple times in 2015 for the same parcel ID, and multiple

residential buildings that are listed under the same parcel ID. To solve this problem, we created a

primary key to uniquely identify each transaction. In an effort to prepare for the next phase in our

research, we normalized all the numeric data and discretized sales price and total square footage

of living area. The normalized numeric data we later used in our Neural Network mining model.

As a result of the extraction, transform and load process, we created the following tables and

views from our four original datasets to capture more relevant information to be used in the data

mining models (Appendix 1-12):

● [dbo].[KC_Normalized_TestData4Transactions], Appendix 1

● [dbo].[KC_TestData4Transactions], Appendix 2

● [dbo].[KCPredictionsDTrees], created from data mining prediction model, Appendix 3

● [dbo].[KCPredictionsNNetworks], created from data mining prediction model, Appendix

4

● [dbo].[AvgIncome], Appendix 5

● [dbo].[KC Comprehensive Normalized], Appendix 6

● [dbo].[KC_Normalized_TrainingData4Transactions], Appendix 7

● [dbo].[KC_TrainingData4Transactions], Appendix 8

Page 6: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

6

● [dbo].[PredictionEvaluationDTrees], Appendix 9

● [dbo].[PredictionEvaluationNNetworks], Appendix 10

● [dbo].[View4EvaluationKCDTrees], Appendix 11

● [dbo].[View4EvaluationKCNNetworks], Appendix 12

Data Mining Models and Analysis

We used Microsoft Visual Studio 2013 to apply Decision Trees and Neural Network algorithm

and predict future Real Estate Prices in King County based on the 2015 dataset. Prior to creating

the mining structures, we divided the King County Comprehensive table into 80% training and

20% testing data. We then used the training data in our mining models.

The Decision Trees algorithm is a classification and regression algorithm for use in predictive

modeling of both discrete and continuous attributes. For discrete attributes, such as the sale price

level, the algorithm makes predictions based on the relationships between input columns in the

dataset. It uses the values, known as states, of those columns to predict the states of a column

that we designate as predictable. Specifically, the algorithm identifies the input columns that are

correlated with the Sale Price Level column. The decision tree makes predictions based on this

tendency toward a particular outcome. For continuous attributes, such as the sales price, the

algorithm uses linear regression to determine where a decision tree splits. Since there is only one

column set to be predictable, the algorithm builds a single decision tree for the Sale Price

column. (Microsoft, 2016)

The Decision Tree Viewer for sale price (Appendix 13) shows a visual representation of the

decision rules that are created in the decision tree model. For example, “All” is the root of the

decision tree model and as we go from left to right, the darker color boxes have stronger

determinant towards the sale price. We can see that the population who have a Bachelor’s degree

or higher is the key determinant for sale price and it branches off from there. The Dependency

Network for sale price (Appendix 14) shows that surprisingly, the strongest link of dependency

towards the sale price is the population between 45 to 64 years old who has a Bachelor's degree

or higher.

The Neural Network algorithm is an implementation of the popular and adaptable neural network

architecture for machine learning. The algorithm works by testing each possible state of the input

attribute against each possible state of the predictable attribute, and calculating probabilities for

each combination based on the 80% training data. As mentioned, we normalized numeric data,

such as sales price and total living square feet, to be incorporated into the Neural Network

mining model.

Based on the mining models, lift charts for the sales price prediction were created as shown on

Appendix 15 and 16. The predicted sales price from the Neural Network model (Appendix 16) is

closer to the prediction line than the predicted sale price from the Decision Trees model

(Appendix 15). This means that the Neural Network model is more accurate due to the

complexity of the analysis performed.

Page 7: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

7

Validation

Model evaluation is an integral part of the model development process and it helped us

determine the model that best represents our data. We evaluated the performance of our mining

models and assessed how well our models performed against real data by calculating prediction

errors using mean absolute percentage error (MAPE) and mean absolute deviation (MAD) for

the two models we used to predict home values in King County. Below we will discuss each of

these statistics and what they mean in relation to our data mining models.

The mean absolute percentage error (MAPE) measures the size of the error in percentage terms

and is calculated as the average of the unsigned percentage error. The primary purpose of the

MAPE is to assess forecast accuracy. Because it is measures the size of the error in percentage

terms it is perhaps the easiest of the statistics to interpret.

The initial MAPE that was calculated for our Microsoft Decision Trees was 20.73%. However,

as mentioned earlier, the average home price of King County homes went up by 12.50% between

April 2015 and April 2016. Because we used 2015 data to create our models, we had to adjust

the predicted prices upwards by 12.50% in order to match the overall increase. After adjusting

the data to reflect the 12.50% increase in sales price, the size of the error increased further to

26.75%.

The initial MAPE that was calculated for our Neural Network model was 8.19%, but increased to

15.03% after adjusting for the aforementioned 12.50% increase in sales price. After comparing

the MAPE between the two models we can clearly determine that the Neural Network model has

a smaller error size and is therefore significantly more accurate than the Decision Trees model.

The mean absolute deviation (MAD) measures the size of the error in units and is calculated as

the average of the unsigned errors. Because MAD expresses accuracy in the same units as the

data, it is also easy to conceptualize the amount of error.

The initial MAD calculated for our Decision Trees model was 96,375.18. Just as we adjusted the

MAPE for the 12.50% increased average sale price, we also adjusted the MAD to reflect this

increase. The adjusted MAD for our Decision Trees model increased the error to 115,897.20.

The initial MAD for our Neural Networks model was 34,843.53 but increased to 72,356.45 after

adjusting for the 12.50%. Based on these results we can conclude that the Neural Network model

is more accurate because it produces a lower mean prediction error when compared with the

Decision Trees model. See Table 5 for a full comparison of prediction error results between our

Decision Trees and Neural Network models.

As part of our model evaluation we calculated the average actual sale price for King County and

compared it with the average predicted value from our Decision Trees and Neural Network

models (see Table 6). We found that the average predicted value from our Decision Trees model

was $522,753.93 and the average predicted value from our Neural Network model was

$521,146.79, which are both very close to the average actual sales price of $522,809.69. After

adjusting these predictions for the 12.50% increase, the Decision Trees average predicted value

increases to $588,098.18 and the Neural Networks average predicted value increases to

Page 8: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

8

$586,290.13. Although both predictions are very close, it is likely that the average predicted

value from the Neural Networks algorithm is more accurate based our error prediction

calculations.

To further our analysis, we wanted to see the relationship between average income, educational

attainment, and the average predicted sales price from our Neural Network model for each zip

code in King county. We chose to use the Neural Network model for this analysis because our

previous evaluation indicated it is more accurate than the Decision Trees model. In Appendix 17

you will find a table showing the 5 zip codes with the highest average predicted sales price and

the 5 zip codes with the lowest predicted sales price along with the average income (in

thousands) and percentage of the population in each zip code that have a bachelor’s degree or

higher.

According to our analysis, the zip code corresponding with the Medina neighborhood (98039) in

King County had the highest average predicted home value at $1,457,870, had the highest

average household income at $547,754, and the highest percentage of the population who are 25

years and over and that hold a bachelor’s degree or higher (81.6%). This is not surprising since

Medina is widely known to be the wealthiest zip code in the state of Washington. In fact, Forbes

magazine listed the top three zip codes listed in our table as three of the most expensive zip

codes in the United States in 2015 (Schiffman, 2016). Looking at the five zip codes with the

lowest average predicted home value, we can see that these zip codes also have the lowest

average household income and lowest percentage of bachelor’s degree or higher for the 25 and

over population. As one may already assume, our analysis indicates that, with a few exceptions,

there is a direct correlation between home values, income, and educational attainment.

We also divided our data into four categories (Table 7) based on home value to help us

determine which categories our Decision Trees and Neural Network models predict most

accurately. We created a classification matrix, which is a standard tool for evaluation of

statistical models and is sometimes referred to as a confusion matrix. We used the classification

matrix (Tables 8 & 9) to sort all cases from the models into categories, by determining whether

the predicted value matched the actual value. All the cases in each category were then counted,

and the totals are displayed in the matrix. The matrix compares actual to predicted values for

each of the four predicted home value levels. The rows in the matrix represent the predicted

values for the model, whereas the columns represent the actual values. The categories used in

analysis are false positive, true positive, false negative, and true negative.

The Decision Trees and Neural Network models we used to predict the accuracy of each home

value level were both extremely accurate, resulting with precision and recall greater than 99%

for all home value levels in both models. Precision is the ratio of the number of relevant records

retrieved to the total number of irrelevant and relevant records retrieved and is expressed as a

percentage. Recall is the ratio of the number of relevant records retrieved to the total number of

relevant records in the database and is also expressed as a percentage. The overall accuracy of

both models was greater than 99%. This is likely because the home value levels we created

consisted of a large value range, making it more likely that that predicted level matches the

actual level. Going forward, it would likely make more sense to create more categories that are

made up of smaller home value ranges. This would help determine more precisely which home

Page 9: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

9

value levels are predicted most accurately. Overall, we didn’t find that predicting the home value

level added value to our analysis because all four levels had an accuracy greater than 99%.

Conclusions

After making predictions using Decision Trees and Neural Network models, we evaluated our

models and concluded that Neural Networks is more accurate than the Decision Trees model

because it generated fewer mean errors. With an acceptable MAPE for our Neural Network

model, we can determine that this model works well and can offer suggestions on how the model

could be improved. The accuracy of the model would likely be improved if the training data

consisted of several years of sales transactions and not just one. Providing the model with

multiple years of data would allow it to recognize the year over year change in sales prices and

therefore make a more accurate prediction.

Moving forward, we believe it would be even more beneficial to include additional attributes

such as household size, marital status, median age, other standard test scores, comparing King

County prices to other counties and potentially a deeper dive into more specific locality such as

the City of Seattle. From there on, a comparison can be made by comparing the City of Seattle to

other metropolitan cities in the pacific northwest region. Because each city provides different

attributes in their real estate sales datasets, this would be a good way to further test the accuracy

of the Neural Networks algorithm in predicting real estate sales prices. Additionally, we believe

further research should include the use of other data mining algorithms, such as Bayes Theorem,

in order to compare accuracy with Neural Networks and Decision Trees.

Page 10: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

10

Tables

Table 1: King County Real Estate Sales

Page 11: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

11

Table 2: King County Residential Building

Page 12: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

12

Table 3: Washington Educational Attainment

Page 13: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

13

Table 4: Washington State Income

Page 14: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

14

Table 5: Comparison of Error Results between Microsoft Decision Trees and Neural

Networks

Prediction Error Microsoft

Decision

Trees

Microsoft

Decision Trees

After Adjusting

For 12.5%

Neural

Networks

Neural

Networks

After Adjusting

For 12.5%

Mean absolute

percentage error

(MAPE)

20.73%

26.75% 8.19% 15.03%

Mean absolute

deviation (MAD)

96,375.18 115,897.20 34,843.53 72,356.45

Mean squared deviation

(MSD)

2.537158^10 2.966596^10 3.230883^9 9.119909^9

Standard Deviation 155,775.98 159,387.28 63,742.08 71,346.55

Table 6: Average Actual Value vs. Average Predicted Value

Average Actual Sales Value Average Predicted Value

Decision Trees

Average Predicted Value

Neural Networks

$522,809.69 $522,753.93 $521,146.79

After 12.50% Adjustment: $588,098.18 $586,290.13

Table 7: Home Value Categories

Home Value Category

Below $250,000 1

Between $250,000 and $500,000 2

Between $500,000 and $750,000 3

Above $750,000 4

Page 15: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

15

Table 8: Classification Matrix - Counts for KC_NNetworks-Discretized on DIS Sale Price

Level

Predicted Between

250000 and

500000

(Actual)

Above

750000

(Actual)

Below

250000

(Actual)

Between

500000 and

750000

(Actual)

Total Precision

Between

250000 and

500000

2465 0 0 2 2467 99.92%

Above

750000

0 865 0 0 865 100%

Below

250000

0 0 727 0 727 100%

Between

500000 and

750000

0 2 0 1388 1390 99.86%

Total 2465 867 727 1390 5449

Recall 100% 99.78% 100% 99.86%

Overall Accuracy: 99.93%

Page 16: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

16

Table 9: Classification Matrix - Counts for KC_DTrees_Discretized on Sale Price Level

Predicted Between

250000 and

500000

(Actual)

Above

750000

(Actual)

Below

250000

(Actual)

Between

500000 and

750000

(Actual)

Total Precision

Between

250000 and

500000

2385 0 0 1 2386 99.96%

Above

750000

0 914 0 0 914 100%

Below

250000

0 0 749 0 749 100%

Between

500000 and

750000

0 3 0 1397 1400 99.786%

Total 2385 917 749 1398 5449

Recall 100% 99.67% 100% 99.93%

Overall Accuracy: 99.93%

Page 17: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

17

Appendices

Appendix 1: KC_Normalized_TestData4Transactions

select top 20 percent * into KC_Normalized_TestData4Transactions

from [dbo].[KC Comprehensive Normalized]

order by newid()

Page 18: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

18

Appendix 2: KC_TestData4Transactions

select top 20 percent * into KC_TestData4Transactions

from [dbo].[KC Comprehensive]

order by newid()

Page 19: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

19

Appendix 3: KCPredictionsDTrees

CREATE TABLE [dbo].[KCPredictionDTrees](

[Transaction ID] [int] NULL,

[Parcel ID] [nvarchar](255) NULL,

[Sale Price Prediction] [float] NULL,

[Sale Price Prediction Probability] [float] NULL

) ON [PRIMARY]

GO

Appendix 4: KCPredictionsNNetworks

CREATE TABLE [dbo].[KCPredictionNNetworks](

[Transaction ID] [int] NULL,

[Parcel ID] [nvarchar](255) NULL,

[Sale Price Prediction] [float] NULL,

[Sale Price Prediction Probability] [float] NULL

) ON [PRIMARY]

GO

Page 20: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

20

Appendix 5: AvgIncome

create view AvgIncome

as

select [ZIP_code ], SUM([Total income Amount])/SUM([Number of returns]) as

AverageIncome from [dbo].[Washington State Income]

where [Total income Number of returns] >= 1

GROUP BY [ZIP_code ]

go

Page 21: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

21

Appendix 6: KC Comprehensive Normalized

create view [KC Comprehensive Normalized]

as

Select [Transaction ID], [Parcel ID],

[Date] [ORG_Date],

[Sale Price] [ORG_Sale Price],

[Sale Price Level] [DIS_Sale Price Level],

([Sale Price] - 181979.00)/(1250000.00 - 181979.00) as NRM_minmaxSalePrice,

([Sale Price] -500182.2922)/226698.929506941 as NRM_stdevSalePrice,

[Number Living Units] [ORG_Number Living Units],

([Number Living Units] - 1)/(3 - 1) as

NRM_minmaxNbrLivingUnits,

([Number Living Units] - 1.028132)/0.952903213570173 as

NRM_stdevNbrLivingUnits,

[Zip Code] [ORG_Zip Code],

stories [ORG_House Stories],

([Stories] - 1)/(3 - 1) as NRM_minmaxStories,

([Stories] - 1.344954)/0.501266361492268 as NRM_stdevStories,

SqftTotLivingLevel [DIS_SqftTotLivingLevel],

([Total Living Square Feet] - 70)/(18070 - 70) as

NRM_minmaxSqft,

([Total Living Square Feet] - 2022.016033)/938.828419460636 as

NRM_stdevSqft,

[Number of Bedrooms] [ORG_Number of Bedrooms],

([Number of Bedrooms] - 0)/(11- 0) as NRM_minmaxBedrooms,

([Number of Bedrooms] - 3.334238)/0.952903213570173 as

NRM_stdevBedrooms,

[Number of Half Baths] [ORG_Number of Half Baths],

([Number of Half Baths] - 0)/(3 - 0) as NRM_minmaxBathHalf,

([Number of Half Baths] - 0.420505)/0.522980565538599 as

NRM_stdevBathHalf,

[Number of Three Quarter Baths] [ORG_Number of Three Quarter

Baths],

([Number of Three Quarter Baths] - 0)/(9 - 0) as

NRM_minmaxBath3qtr,

([Number of Three Quarter Baths] - 0.439986)/0.60885733618247

as NRM_stdevBath3qtr,

[Number of Full Baths] [ORG_Number of Full Baths],

([Number of Full Baths] - 0)/(6 - 0) as NRM_minmaxBathFull,

([Number of Full Baths] - 1.467859)/0.655766216478015 as

NRM_stdevBathFull,

[Fireplace Single Story] [ORG_Fireplace Single Story],

([Fireplace Single Story] - 0)/(5 - 0) as

NRM_minmaxFpSingleStory,

Page 22: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

22

([Fireplace Single Story] - 0.567135)/0.612244039524042 as

NRM_stdevFpSingleStory,

[Fireplace Multi Story] [ORG_Fireplace Multi Story],

([Fireplace Multi Story] - 0)/(5 - 0) as

NRM_minmaxFpMultiStory,

([Fireplace Multi Story] - 0.349670)/0.517995522910392 as

NRM_stdevFpMultiStory,

[Fireplace Freestanding] [ORG_Fireplace Freestanding],

([Fireplace Freestanding] - 0)/(3 - 0) as

NRM_minmaxFpFreestanding,

([Fireplace Freestanding] - 0.080819)/0.286805229416352 as

NRM_stdevFpFreestanding,

[Fireplace Additional] [ORG_Fireplace Additional],

([Fireplace Additional] - 0)/(5 - 0) as NRM_minmaxFpAdditional,

([Fireplace Additional] - 0.212358)/0.434057015463764 as

NRM_stdevFpAdditional,

[Year Built] [ORG_Year Built],

[Year Renovated] [ORG_Year Renovated],

[Percent Complete] [ORG_Percent Complete],

([Percent Complete] - 0)/(100 - 0) as NRM_minmaxPcntComplete,

([Percent Complete] - 3.339718)/5.03480573857016 as

NRM_stdevPcntComplete,

[Average Income] [ORG_Average Income],

([Average Income] -

40.60089916506101477199743)/(547.75486111111111111111111 -

40.60089916506101477199743) as NRM_minmaxAvgIncome,

([Average Income] -

94.70925108704326585435336)/49.6171100660926 as NRM_stdevAvgIncome,

[Population 18 to 24 years] as [ORG_Population 18 to 24 years],

([Population 18 to 24 years] - 120 )/(20414 - 120) as

[NRM_minmax Population 18 to 24 years],

([Population 18 to 24 years] - 2738.665345)/2348.78231419757 as

[NRM_stdev Population 18 to 24 years],

[Population 18 to 24 years Bachelor's degree or higher] as

[ORG_Population 18 to 24 years Bachelor's degree or higher],

([Population 18 to 24 years Bachelor's degree or higher] - 0

)/(0.617 - 0) as [NRM_minmax Population 18 to 24 years Bachelor's degree or higher],

([Population 18 to 24 years Bachelor's degree or higher] -

0.184121)/0.123096325541119 as [NRM_stdev Population 18 to 24 years Bachelor's degree or

higher],

[Population 25 to 34 years] as [ORG_Population 25 to 34 years],

([Population 25 to 34 years] - 40 )/(13979 - 40) as [NRM_minmax

Population 25 to 34 years],

([Population 25 to 34 years] - 5116.037267)/2832.64231547651 as

[NRM_stdev Population 25 to 34 years],

Page 23: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

23

[Population 25 to 34 years Bachelor's degree or higher] as

[ORG_Population 25 to 34 years Bachelor's degree or higher],

([Population 25 to 34 years Bachelor's degree or higher] -

0.099)/(0.818 - 0.099) as [NRM_minmax Population 25 to 34 years Bachelor's degree or higher],

([Population 25 to 34 years Bachelor's degree or higher] -

0.477103)/0.20197370299301 as [NRM_stdev Population 25 to 34 years Bachelor's degree or

higher],

[Population 35 to 44 years] as [ORG_Population 35 to 44 years],

([Population 35 to 44 years] - 67)/(11126 - 67) as [NRM_minmax

Population 35 to 44 years],

([Population 35 to 44 years] - 4862.320140)/1855.36679073523 as

[NRM_stdev Population 35 to 44 years],

[Population 35 to 44 years Bachelor's degree or higher] as

[ORG_Population 35 to 44 years Bachelor's degree or higher],

([Population 35 to 44 years Bachelor's degree or higher] -

0.069)/(0.960 - 0.069) as [NRM_minmax Population 35 to 44 years Bachelor's degree or higher],

([Population 35 to 44 years Bachelor's degree or higher] -

0.513136)/0.207805829433468 as [NRM_stdev Population 35 to 44 years Bachelor's degree or

higher],

[Population 45 to 64 years] as [ORG_Population 45 to 64 years],

([Population 45 to 64 years] - 306)/(13939 - 306) as

[NRM_minmax Population 45 to 64 years],

([Population 45 to 64 years] - 8639.359863)/2963.56581762132 as

[NRM_stdev Population 45 to 64 years],

[Population 45 to 64 years Bachelor's degree or higher] as

[ORG_Population 45 to 64 years Bachelor's degree or higher],

([Population 45 to 64 years Bachelor's degree or higher] -

0.113)/(0.834 - 0.113) as [NRM_minmax Population 45 to 64 years Bachelor's degree or higher],

([Population 45 to 64 years Bachelor's degree or higher] -

0.463571)/0.174867046954704 as [NRM_stdev Population 45 to 64 years Bachelor's degree or

higher],

[Population 65 years and over] as [ORG_Population 65 years and

over],

([Population 65 years and over] - 123)/(7159 - 123) as

[NRM_minmax Population 65 years and over],

([Population 65 years and over] - 3685.495150)/1362.5483431876

as [NRM_stdev Population 65 years and over],

[Population 65 years and over Bachelor's degree or higher] as

[ORG_Population 65 years and over Bachelor's degree or higher],

([Population 65 years and over Bachelor's degree or higher] -

0.042)/(0.737 - 0.042) as [NRM_minmax Population 65 years and over Bachelor's degree or

higher],

([Population 65 years and over Bachelor's degree or higher] -

0.376489)/0.146035021931026 as [NRM_stdev Population 65 years and over Bachelor's degree

or higher]

from [dbo].[KC Comprehensive]

Page 24: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

24

Page 25: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

25

Page 26: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

26

Appendix 7: KC_Normalized_TrainingData4Transactions

select top 20 percent * into KC_Normalized_TestData4Transactions

from [dbo].[KC Comprehensive Normalized]

order by newid()

Appendix 8: KC_TrainingData4Transactions

create view KC_TrainingData4Transactions

as

select * from [dbo].[KC Comprehensive]

except

select * from [dbo].[KC_TestData4Transactions]

Appendix 9: PredictionEvaluationDTrees

create view PredictionEvaluationDTrees

as

select a.[Transaction ID], a.[Sale Price] [Actual Sale Price], p.[Sale Price Prediction]

from [dbo].[KC Comprehensive] a inner join [dbo].[KCPredictionDTrees] p on a.[Transaction

ID] = p.[Transaction ID]

Go

Appendix 10: PredictionEvaluationNNetworks

create view PredictionEvaluationNNetworks

as

select a.[Transaction ID], a.[ORG_Sale Price] [Actual Sale Price], p.[Sale Price Prediction]

from [dbo].[KC Comprehensive Normalized] a inner join [dbo].[KCPredictionNNetworks] p on

a.[Transaction ID] = p.[Transaction ID]

go

Page 27: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

27

Appendix 11: View4EvaluationKCDTrees

create view [dbo].[View4EvaluationKCDTrees]

as

select *, ABS([Actual Sale Price] - [Sale Price Prediction]) [AbsoluteDeviation],

SQUARE ([Actual Sale Price] - [Sale Price Prediction]) [SquaredDeviation]

from [dbo].[PredictionEvaluationDTrees]

GO

Appendix 12: View4EvaluationKCNNetworks

create view [dbo].[View4EvaluationKCNNetworks]

as

select *, ABS([Actual Sale Price] - [Sale Price Prediction]) [AbsoluteDeviation],

SQUARE ([Actual Sale Price] - [Sale Price Prediction]) [SquaredDeviation]

from [dbo].[PredictionEvaluationNNetworks]

GO

Page 28: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

28

Appendix 13: Decision Tree for Sale Price

Page 29: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

29

Appendix 14: Dependency Network for Sale Price

Page 30: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

30

Appendix 15: Lift Chart from Decision Trees Sale Price

Appendix 16: Lift Chart from Decision Trees Sale Price

Page 31: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

31

Appendix 17: Top 5 & Lowest 5 Zip Codes By Predicted Value From Neural Networks

Model

select i.[ZIP_code ] [Zip Code], avg(p.[Predicted Value]) [Average Predicted Value],

i.[AverageIncome] [Average Income], e.[Population 25 years and over Percent bachelor's degree

or higher]

from [dbo].[PredictedNNetworksZipCode] p inner join [dbo].[AvgIncome] i on p.[ORG_Zip

Code] = i.[ZIP_code ] inner join [dbo].[Washington Educational Attainment] e on i.[ZIP_code ]

= e.[Zip Code]

Group by i.[ZIP_code ], i.[AverageIncome], e.[Population 25 years and over Percent bachelor's

degree or higher]

order by avg(p.[Predicted Value]) desc

Page 32: Capstone Project in Business Intelligence

June 7, 2016 Seattle University

32

Reference

Internal Revenue Service (2016). https://www.irs.gov/uac/SOI-Tax-Stats-Individual-Income-

Tax-Statistics-2013-ZIP-Code-Data-(SOI)

King County Assessor (2016). http://info.kingcounty.gov/assessor/DataDownload

United States Census Bureau (2016).

http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk

Zillow King County (2016), http://www.zillow.com/king-county-wa/home-values

Zillow Zestimate (2016), http://www.zillow.com/zestimate