capstone project in business intelligence
TRANSCRIPT
June 7, 2016 Seattle University
1
King County Real Estate Prices Analysis and Predictions
IS 5325 - Capstone Project in Business Intelligence
Prepared by:
Samantha Adriaan
Braden Simonsen
Yan Zhang
June 7, 2016 Seattle University
2
Executive Summary
This paper analyzes the use of data mining models in predicting real estate prices in King
County, Washington. Based on research done by other researchers regarding the real estate
market, data mining and machine learning, we decided to use Microsoft Decision Trees and
Neural Networks for our prediction models.
Using publicly available sales and residential building data from the King County Assessor’s
office, income data from the Internal Revenue Service, and educational attainment data from the
United States Census Bureau, we performed, extracted, transformed and then loaded the data into
Microsoft SQL Server. We then used Microsoft Visual Studio to build our prediction models and
perform data mining on over 27,000 transactions. In total, four prediction models were used; a
Decision Tree model for predicting sales price, a Decision Tree model for predicting home value
category, a Neural Network model for predicting sales price, and a Neural Network model for
predicting home value category.
To predict the accuracy of the models and to determine which model best represents our data, we
evaluated the performance of each of our mining models and assessed how well the models
performed against real data by calculating prediction errors using mean absolute percentage error
(MAPE) and mean absolute deviation (MAD). We found that our Neural Network model was
significantly more accurate than Decision Trees. For the two models built to predict the home
value category, we used a classification matrix to sort all cases from the models into categories,
by determining whether the predicted value matched the actual value. Because the overall
accuracy for both of these models was greater than 99%, we didn’t find that predicting the home
value category added value to our analysis.
To further our research, we analyzed the relationship between average income, educational
attainment, and the average predicted sales price from our Neural Network model for each zip
code in King county. Our findings indicate that, with a few exceptions, there is a direct
correlation between home values, income, and educational attainment.
Ultimately, we determined that the Neural Networks algorithm is more accurate than Decision
Trees in predicting real estate prices. Moving forward, we believe that the model could be
improved by using additional attributes, such as household size, marital status, median age, and
standard test scores. Furthermore, we we believe additional research should include the use of
other data mining algorithms, such as Bayes Theorem, in order to compare model accuracy with
Neural Networks and Decision Trees.
June 7, 2016 Seattle University
3
Introduction
Home values in King County have surged in recent years as the economy has rebounded from
the global recession in 2008 as shown in Figure 1. The growth of companies, specifically
technology firms, in the Seattle area can be credited with attracting people from all over the
world, causing the population and demand in this region to grow exponentially in recent years.
This growth has caused real estate prices to hit new highs as demand continues to exceed
available inventory. According to Zillow’s Zestimate tool, the current median home price in
King County is $477,900, an increase of 11.6% since May 2015 (Zillow King County, 2016).
According to the Northwest Multiple Listing Service, median home prices sold in King County
increased 12.50% from April 2015 to April 2016 (Northwest Multiple Listing Service, 2016).
Figure 1: Zillow Price Trend of Home Values in King County
With such high value growth in residential real estate and with over over 27,000 sales
transactions in 2015 (King County Assessor, 2016), the King County real estate market provides
us with a fascinating opportunity to predict future home values using Decision Trees and Neural
Network data mining models. Zillow predicts home values in King County will rise by 6.2%
between June 2016 and June 2017 (Zillow King County, 2016).
Using proprietary automated valuation models that apply advanced algorithms to analyze data,
Zillow identifies relationships within a specific geographic area, between home-related data and
actual sales prices. Home characteristics, such as square footage, location or the number of
bathrooms, are given different weights according to their influence on home sale prices in each
specific geography over a specific period of time, resulting in a set of valuation rules, or models
that are applied to generate each home's Zestimate. Specifically, some of the data used in their
algorithm include physical attributes like lot size and square footage, tax assessments, and prior
and current transactions (Zillow Zestimate, 2016).
For our sales price prediction models, we obtained 2015 sales data from the King County
Assessor, educational attainment data from the United States Census Bureau, and income data
June 7, 2016 Seattle University
4
from the IRS. With this data we created models using the Decision Trees and Neural Networks
algorithms to predict home values in King County. We evaluated the performance of our mining
models and assessed how well our models performed against real data by calculating prediction
errors, allowing us to determine which model has the highest accuracy.
Problem Identification
The real estate market is known to fluctuate over time and the problem we observed is that
housing prices are highly unpredictable. Using database management techniques, data mining
algorithms and various secondary research we set out to forecast housing prices by taking into
account of other factors such as income and education attainment along with the size of livable
square footage.
Explanation of Domain Problems and Datasets
In gathering the datasets, we had difficulties finding reliable data sources. We made the
decision to use publicly available data from the King County Assessor's Office for residential
building details and sales data, Internal Revenue Service for average income by zip code and the
Census Bureau for the 2014 census data. We believe these four datasets are the most reliable as it
is directly collected by each of the governmental agencies.
The real estate domain is broad, it was difficult to choose a specific geographical location,
property type to focus on. Additionally, our original intent was to perform an analysis on King
County data and Scholastic Aptitude Test (SAT) scores to determine whether there is a
correlation between house prices and SAT scores. We quickly ran into roadblocks with this
domain. The problem here was that we could not get access to the SAT scores from the College
Board due to its policies. We learned that we can only have access to general data but not any
data specific to a zip code and that the more detailed data is only available to school districts. We
moved on to other viable options that can potentially achieve the same goal: education
attainment and average income.
The next problem was with the datasets. We downloaded two King County datasets: Real Estate
Sales and Residential Buildings. The King County Assessor's Office offers 24 datasets to the
general public for download. We reviewed the majority of these datasets to determine which sets
were useful. After some discussion, we agreed on the Real Estate Sales and the Residential
Buildings datasets. The Real Estate Sales dataset offers information on each sales transaction for
properties within King county and the Residential Buildings dataset offers information on
building information for each property. We then combined the Parcel Major and Parcel Minor to
obtain the Parcel ID which identifies each unique property.
In terms of the education attainment and average income datasets, we experienced difficulties
finding information that is relevant and can be organized by zip code. After a long search, we
were able to locate the dataset from the Census Bureau on education attainment by zip code, age
group and education level; from the Internal Revenue Service we obtained the average income
by zip code. We then cleaned up these datasets to better align with the existing King County
datasets.
June 7, 2016 Seattle University
5
ETL (Extraction, Transform, and Load)
We started the ETL process by obtaining publically available datasets. The King County
Assessor’s Office provided real estate property sales and residential building datasets. We also
searched from the U.S Census Bureau and Internal Revenue Services for educational attainment
and average income respectively. We used these datasets because the various government
agencies are reliable resources for the information needed. For the complete data dictionary on
our datasets, see attachments:
● Table 1: King County Real Estate Sales
● Table 2: King County Residential Building
● Table 3: Washington Educational Attainment
● Table 4: Washington State Income
We then imported the four datasets into Microsoft SQL Server where we faced a number of
challenges. We used multiple data sources where each source defined the data types differently,
this inherently made the data import process especially difficult. We started with four tables and
more than 150 attributes. Each attribute had to be reviewed and validated for accuracy in terms
of its data type. The data cleaning process consumed a significant portion of our efforts.
After the four datasets were successfully imported, we had a total of 2.3 million rows of data. We
used the joined the four individual datasets, isolated the data to 2015 real estate sales
transactions. Next, we eliminated sale price outliers which were prices lower than $100,000 and
greater than $2 million. Finally, we determined that some data were not pertinent to our research
and analysis. We were able to narrow down our final dataset to roughly 27,000 rows.
In addition, we also had to resolve a data duplication error. We later discovered that it was due to
property sales that were sold multiple times in 2015 for the same parcel ID, and multiple
residential buildings that are listed under the same parcel ID. To solve this problem, we created a
primary key to uniquely identify each transaction. In an effort to prepare for the next phase in our
research, we normalized all the numeric data and discretized sales price and total square footage
of living area. The normalized numeric data we later used in our Neural Network mining model.
As a result of the extraction, transform and load process, we created the following tables and
views from our four original datasets to capture more relevant information to be used in the data
mining models (Appendix 1-12):
● [dbo].[KC_Normalized_TestData4Transactions], Appendix 1
● [dbo].[KC_TestData4Transactions], Appendix 2
● [dbo].[KCPredictionsDTrees], created from data mining prediction model, Appendix 3
● [dbo].[KCPredictionsNNetworks], created from data mining prediction model, Appendix
4
● [dbo].[AvgIncome], Appendix 5
● [dbo].[KC Comprehensive Normalized], Appendix 6
● [dbo].[KC_Normalized_TrainingData4Transactions], Appendix 7
● [dbo].[KC_TrainingData4Transactions], Appendix 8
June 7, 2016 Seattle University
6
● [dbo].[PredictionEvaluationDTrees], Appendix 9
● [dbo].[PredictionEvaluationNNetworks], Appendix 10
● [dbo].[View4EvaluationKCDTrees], Appendix 11
● [dbo].[View4EvaluationKCNNetworks], Appendix 12
Data Mining Models and Analysis
We used Microsoft Visual Studio 2013 to apply Decision Trees and Neural Network algorithm
and predict future Real Estate Prices in King County based on the 2015 dataset. Prior to creating
the mining structures, we divided the King County Comprehensive table into 80% training and
20% testing data. We then used the training data in our mining models.
The Decision Trees algorithm is a classification and regression algorithm for use in predictive
modeling of both discrete and continuous attributes. For discrete attributes, such as the sale price
level, the algorithm makes predictions based on the relationships between input columns in the
dataset. It uses the values, known as states, of those columns to predict the states of a column
that we designate as predictable. Specifically, the algorithm identifies the input columns that are
correlated with the Sale Price Level column. The decision tree makes predictions based on this
tendency toward a particular outcome. For continuous attributes, such as the sales price, the
algorithm uses linear regression to determine where a decision tree splits. Since there is only one
column set to be predictable, the algorithm builds a single decision tree for the Sale Price
column. (Microsoft, 2016)
The Decision Tree Viewer for sale price (Appendix 13) shows a visual representation of the
decision rules that are created in the decision tree model. For example, “All” is the root of the
decision tree model and as we go from left to right, the darker color boxes have stronger
determinant towards the sale price. We can see that the population who have a Bachelor’s degree
or higher is the key determinant for sale price and it branches off from there. The Dependency
Network for sale price (Appendix 14) shows that surprisingly, the strongest link of dependency
towards the sale price is the population between 45 to 64 years old who has a Bachelor's degree
or higher.
The Neural Network algorithm is an implementation of the popular and adaptable neural network
architecture for machine learning. The algorithm works by testing each possible state of the input
attribute against each possible state of the predictable attribute, and calculating probabilities for
each combination based on the 80% training data. As mentioned, we normalized numeric data,
such as sales price and total living square feet, to be incorporated into the Neural Network
mining model.
Based on the mining models, lift charts for the sales price prediction were created as shown on
Appendix 15 and 16. The predicted sales price from the Neural Network model (Appendix 16) is
closer to the prediction line than the predicted sale price from the Decision Trees model
(Appendix 15). This means that the Neural Network model is more accurate due to the
complexity of the analysis performed.
June 7, 2016 Seattle University
7
Validation
Model evaluation is an integral part of the model development process and it helped us
determine the model that best represents our data. We evaluated the performance of our mining
models and assessed how well our models performed against real data by calculating prediction
errors using mean absolute percentage error (MAPE) and mean absolute deviation (MAD) for
the two models we used to predict home values in King County. Below we will discuss each of
these statistics and what they mean in relation to our data mining models.
The mean absolute percentage error (MAPE) measures the size of the error in percentage terms
and is calculated as the average of the unsigned percentage error. The primary purpose of the
MAPE is to assess forecast accuracy. Because it is measures the size of the error in percentage
terms it is perhaps the easiest of the statistics to interpret.
The initial MAPE that was calculated for our Microsoft Decision Trees was 20.73%. However,
as mentioned earlier, the average home price of King County homes went up by 12.50% between
April 2015 and April 2016. Because we used 2015 data to create our models, we had to adjust
the predicted prices upwards by 12.50% in order to match the overall increase. After adjusting
the data to reflect the 12.50% increase in sales price, the size of the error increased further to
26.75%.
The initial MAPE that was calculated for our Neural Network model was 8.19%, but increased to
15.03% after adjusting for the aforementioned 12.50% increase in sales price. After comparing
the MAPE between the two models we can clearly determine that the Neural Network model has
a smaller error size and is therefore significantly more accurate than the Decision Trees model.
The mean absolute deviation (MAD) measures the size of the error in units and is calculated as
the average of the unsigned errors. Because MAD expresses accuracy in the same units as the
data, it is also easy to conceptualize the amount of error.
The initial MAD calculated for our Decision Trees model was 96,375.18. Just as we adjusted the
MAPE for the 12.50% increased average sale price, we also adjusted the MAD to reflect this
increase. The adjusted MAD for our Decision Trees model increased the error to 115,897.20.
The initial MAD for our Neural Networks model was 34,843.53 but increased to 72,356.45 after
adjusting for the 12.50%. Based on these results we can conclude that the Neural Network model
is more accurate because it produces a lower mean prediction error when compared with the
Decision Trees model. See Table 5 for a full comparison of prediction error results between our
Decision Trees and Neural Network models.
As part of our model evaluation we calculated the average actual sale price for King County and
compared it with the average predicted value from our Decision Trees and Neural Network
models (see Table 6). We found that the average predicted value from our Decision Trees model
was $522,753.93 and the average predicted value from our Neural Network model was
$521,146.79, which are both very close to the average actual sales price of $522,809.69. After
adjusting these predictions for the 12.50% increase, the Decision Trees average predicted value
increases to $588,098.18 and the Neural Networks average predicted value increases to
June 7, 2016 Seattle University
8
$586,290.13. Although both predictions are very close, it is likely that the average predicted
value from the Neural Networks algorithm is more accurate based our error prediction
calculations.
To further our analysis, we wanted to see the relationship between average income, educational
attainment, and the average predicted sales price from our Neural Network model for each zip
code in King county. We chose to use the Neural Network model for this analysis because our
previous evaluation indicated it is more accurate than the Decision Trees model. In Appendix 17
you will find a table showing the 5 zip codes with the highest average predicted sales price and
the 5 zip codes with the lowest predicted sales price along with the average income (in
thousands) and percentage of the population in each zip code that have a bachelor’s degree or
higher.
According to our analysis, the zip code corresponding with the Medina neighborhood (98039) in
King County had the highest average predicted home value at $1,457,870, had the highest
average household income at $547,754, and the highest percentage of the population who are 25
years and over and that hold a bachelor’s degree or higher (81.6%). This is not surprising since
Medina is widely known to be the wealthiest zip code in the state of Washington. In fact, Forbes
magazine listed the top three zip codes listed in our table as three of the most expensive zip
codes in the United States in 2015 (Schiffman, 2016). Looking at the five zip codes with the
lowest average predicted home value, we can see that these zip codes also have the lowest
average household income and lowest percentage of bachelor’s degree or higher for the 25 and
over population. As one may already assume, our analysis indicates that, with a few exceptions,
there is a direct correlation between home values, income, and educational attainment.
We also divided our data into four categories (Table 7) based on home value to help us
determine which categories our Decision Trees and Neural Network models predict most
accurately. We created a classification matrix, which is a standard tool for evaluation of
statistical models and is sometimes referred to as a confusion matrix. We used the classification
matrix (Tables 8 & 9) to sort all cases from the models into categories, by determining whether
the predicted value matched the actual value. All the cases in each category were then counted,
and the totals are displayed in the matrix. The matrix compares actual to predicted values for
each of the four predicted home value levels. The rows in the matrix represent the predicted
values for the model, whereas the columns represent the actual values. The categories used in
analysis are false positive, true positive, false negative, and true negative.
The Decision Trees and Neural Network models we used to predict the accuracy of each home
value level were both extremely accurate, resulting with precision and recall greater than 99%
for all home value levels in both models. Precision is the ratio of the number of relevant records
retrieved to the total number of irrelevant and relevant records retrieved and is expressed as a
percentage. Recall is the ratio of the number of relevant records retrieved to the total number of
relevant records in the database and is also expressed as a percentage. The overall accuracy of
both models was greater than 99%. This is likely because the home value levels we created
consisted of a large value range, making it more likely that that predicted level matches the
actual level. Going forward, it would likely make more sense to create more categories that are
made up of smaller home value ranges. This would help determine more precisely which home
June 7, 2016 Seattle University
9
value levels are predicted most accurately. Overall, we didn’t find that predicting the home value
level added value to our analysis because all four levels had an accuracy greater than 99%.
Conclusions
After making predictions using Decision Trees and Neural Network models, we evaluated our
models and concluded that Neural Networks is more accurate than the Decision Trees model
because it generated fewer mean errors. With an acceptable MAPE for our Neural Network
model, we can determine that this model works well and can offer suggestions on how the model
could be improved. The accuracy of the model would likely be improved if the training data
consisted of several years of sales transactions and not just one. Providing the model with
multiple years of data would allow it to recognize the year over year change in sales prices and
therefore make a more accurate prediction.
Moving forward, we believe it would be even more beneficial to include additional attributes
such as household size, marital status, median age, other standard test scores, comparing King
County prices to other counties and potentially a deeper dive into more specific locality such as
the City of Seattle. From there on, a comparison can be made by comparing the City of Seattle to
other metropolitan cities in the pacific northwest region. Because each city provides different
attributes in their real estate sales datasets, this would be a good way to further test the accuracy
of the Neural Networks algorithm in predicting real estate sales prices. Additionally, we believe
further research should include the use of other data mining algorithms, such as Bayes Theorem,
in order to compare accuracy with Neural Networks and Decision Trees.
June 7, 2016 Seattle University
10
Tables
Table 1: King County Real Estate Sales
June 7, 2016 Seattle University
11
Table 2: King County Residential Building
June 7, 2016 Seattle University
12
Table 3: Washington Educational Attainment
June 7, 2016 Seattle University
13
Table 4: Washington State Income
June 7, 2016 Seattle University
14
Table 5: Comparison of Error Results between Microsoft Decision Trees and Neural
Networks
Prediction Error Microsoft
Decision
Trees
Microsoft
Decision Trees
After Adjusting
For 12.5%
Neural
Networks
Neural
Networks
After Adjusting
For 12.5%
Mean absolute
percentage error
(MAPE)
20.73%
26.75% 8.19% 15.03%
Mean absolute
deviation (MAD)
96,375.18 115,897.20 34,843.53 72,356.45
Mean squared deviation
(MSD)
2.537158^10 2.966596^10 3.230883^9 9.119909^9
Standard Deviation 155,775.98 159,387.28 63,742.08 71,346.55
Table 6: Average Actual Value vs. Average Predicted Value
Average Actual Sales Value Average Predicted Value
Decision Trees
Average Predicted Value
Neural Networks
$522,809.69 $522,753.93 $521,146.79
After 12.50% Adjustment: $588,098.18 $586,290.13
Table 7: Home Value Categories
Home Value Category
Below $250,000 1
Between $250,000 and $500,000 2
Between $500,000 and $750,000 3
Above $750,000 4
June 7, 2016 Seattle University
15
Table 8: Classification Matrix - Counts for KC_NNetworks-Discretized on DIS Sale Price
Level
Predicted Between
250000 and
500000
(Actual)
Above
750000
(Actual)
Below
250000
(Actual)
Between
500000 and
750000
(Actual)
Total Precision
Between
250000 and
500000
2465 0 0 2 2467 99.92%
Above
750000
0 865 0 0 865 100%
Below
250000
0 0 727 0 727 100%
Between
500000 and
750000
0 2 0 1388 1390 99.86%
Total 2465 867 727 1390 5449
Recall 100% 99.78% 100% 99.86%
Overall Accuracy: 99.93%
June 7, 2016 Seattle University
16
Table 9: Classification Matrix - Counts for KC_DTrees_Discretized on Sale Price Level
Predicted Between
250000 and
500000
(Actual)
Above
750000
(Actual)
Below
250000
(Actual)
Between
500000 and
750000
(Actual)
Total Precision
Between
250000 and
500000
2385 0 0 1 2386 99.96%
Above
750000
0 914 0 0 914 100%
Below
250000
0 0 749 0 749 100%
Between
500000 and
750000
0 3 0 1397 1400 99.786%
Total 2385 917 749 1398 5449
Recall 100% 99.67% 100% 99.93%
Overall Accuracy: 99.93%
June 7, 2016 Seattle University
17
Appendices
Appendix 1: KC_Normalized_TestData4Transactions
select top 20 percent * into KC_Normalized_TestData4Transactions
from [dbo].[KC Comprehensive Normalized]
order by newid()
June 7, 2016 Seattle University
18
Appendix 2: KC_TestData4Transactions
select top 20 percent * into KC_TestData4Transactions
from [dbo].[KC Comprehensive]
order by newid()
June 7, 2016 Seattle University
19
Appendix 3: KCPredictionsDTrees
CREATE TABLE [dbo].[KCPredictionDTrees](
[Transaction ID] [int] NULL,
[Parcel ID] [nvarchar](255) NULL,
[Sale Price Prediction] [float] NULL,
[Sale Price Prediction Probability] [float] NULL
) ON [PRIMARY]
GO
Appendix 4: KCPredictionsNNetworks
CREATE TABLE [dbo].[KCPredictionNNetworks](
[Transaction ID] [int] NULL,
[Parcel ID] [nvarchar](255) NULL,
[Sale Price Prediction] [float] NULL,
[Sale Price Prediction Probability] [float] NULL
) ON [PRIMARY]
GO
June 7, 2016 Seattle University
20
Appendix 5: AvgIncome
create view AvgIncome
as
select [ZIP_code ], SUM([Total income Amount])/SUM([Number of returns]) as
AverageIncome from [dbo].[Washington State Income]
where [Total income Number of returns] >= 1
GROUP BY [ZIP_code ]
go
June 7, 2016 Seattle University
21
Appendix 6: KC Comprehensive Normalized
create view [KC Comprehensive Normalized]
as
Select [Transaction ID], [Parcel ID],
[Date] [ORG_Date],
[Sale Price] [ORG_Sale Price],
[Sale Price Level] [DIS_Sale Price Level],
([Sale Price] - 181979.00)/(1250000.00 - 181979.00) as NRM_minmaxSalePrice,
([Sale Price] -500182.2922)/226698.929506941 as NRM_stdevSalePrice,
[Number Living Units] [ORG_Number Living Units],
([Number Living Units] - 1)/(3 - 1) as
NRM_minmaxNbrLivingUnits,
([Number Living Units] - 1.028132)/0.952903213570173 as
NRM_stdevNbrLivingUnits,
[Zip Code] [ORG_Zip Code],
stories [ORG_House Stories],
([Stories] - 1)/(3 - 1) as NRM_minmaxStories,
([Stories] - 1.344954)/0.501266361492268 as NRM_stdevStories,
SqftTotLivingLevel [DIS_SqftTotLivingLevel],
([Total Living Square Feet] - 70)/(18070 - 70) as
NRM_minmaxSqft,
([Total Living Square Feet] - 2022.016033)/938.828419460636 as
NRM_stdevSqft,
[Number of Bedrooms] [ORG_Number of Bedrooms],
([Number of Bedrooms] - 0)/(11- 0) as NRM_minmaxBedrooms,
([Number of Bedrooms] - 3.334238)/0.952903213570173 as
NRM_stdevBedrooms,
[Number of Half Baths] [ORG_Number of Half Baths],
([Number of Half Baths] - 0)/(3 - 0) as NRM_minmaxBathHalf,
([Number of Half Baths] - 0.420505)/0.522980565538599 as
NRM_stdevBathHalf,
[Number of Three Quarter Baths] [ORG_Number of Three Quarter
Baths],
([Number of Three Quarter Baths] - 0)/(9 - 0) as
NRM_minmaxBath3qtr,
([Number of Three Quarter Baths] - 0.439986)/0.60885733618247
as NRM_stdevBath3qtr,
[Number of Full Baths] [ORG_Number of Full Baths],
([Number of Full Baths] - 0)/(6 - 0) as NRM_minmaxBathFull,
([Number of Full Baths] - 1.467859)/0.655766216478015 as
NRM_stdevBathFull,
[Fireplace Single Story] [ORG_Fireplace Single Story],
([Fireplace Single Story] - 0)/(5 - 0) as
NRM_minmaxFpSingleStory,
June 7, 2016 Seattle University
22
([Fireplace Single Story] - 0.567135)/0.612244039524042 as
NRM_stdevFpSingleStory,
[Fireplace Multi Story] [ORG_Fireplace Multi Story],
([Fireplace Multi Story] - 0)/(5 - 0) as
NRM_minmaxFpMultiStory,
([Fireplace Multi Story] - 0.349670)/0.517995522910392 as
NRM_stdevFpMultiStory,
[Fireplace Freestanding] [ORG_Fireplace Freestanding],
([Fireplace Freestanding] - 0)/(3 - 0) as
NRM_minmaxFpFreestanding,
([Fireplace Freestanding] - 0.080819)/0.286805229416352 as
NRM_stdevFpFreestanding,
[Fireplace Additional] [ORG_Fireplace Additional],
([Fireplace Additional] - 0)/(5 - 0) as NRM_minmaxFpAdditional,
([Fireplace Additional] - 0.212358)/0.434057015463764 as
NRM_stdevFpAdditional,
[Year Built] [ORG_Year Built],
[Year Renovated] [ORG_Year Renovated],
[Percent Complete] [ORG_Percent Complete],
([Percent Complete] - 0)/(100 - 0) as NRM_minmaxPcntComplete,
([Percent Complete] - 3.339718)/5.03480573857016 as
NRM_stdevPcntComplete,
[Average Income] [ORG_Average Income],
([Average Income] -
40.60089916506101477199743)/(547.75486111111111111111111 -
40.60089916506101477199743) as NRM_minmaxAvgIncome,
([Average Income] -
94.70925108704326585435336)/49.6171100660926 as NRM_stdevAvgIncome,
[Population 18 to 24 years] as [ORG_Population 18 to 24 years],
([Population 18 to 24 years] - 120 )/(20414 - 120) as
[NRM_minmax Population 18 to 24 years],
([Population 18 to 24 years] - 2738.665345)/2348.78231419757 as
[NRM_stdev Population 18 to 24 years],
[Population 18 to 24 years Bachelor's degree or higher] as
[ORG_Population 18 to 24 years Bachelor's degree or higher],
([Population 18 to 24 years Bachelor's degree or higher] - 0
)/(0.617 - 0) as [NRM_minmax Population 18 to 24 years Bachelor's degree or higher],
([Population 18 to 24 years Bachelor's degree or higher] -
0.184121)/0.123096325541119 as [NRM_stdev Population 18 to 24 years Bachelor's degree or
higher],
[Population 25 to 34 years] as [ORG_Population 25 to 34 years],
([Population 25 to 34 years] - 40 )/(13979 - 40) as [NRM_minmax
Population 25 to 34 years],
([Population 25 to 34 years] - 5116.037267)/2832.64231547651 as
[NRM_stdev Population 25 to 34 years],
June 7, 2016 Seattle University
23
[Population 25 to 34 years Bachelor's degree or higher] as
[ORG_Population 25 to 34 years Bachelor's degree or higher],
([Population 25 to 34 years Bachelor's degree or higher] -
0.099)/(0.818 - 0.099) as [NRM_minmax Population 25 to 34 years Bachelor's degree or higher],
([Population 25 to 34 years Bachelor's degree or higher] -
0.477103)/0.20197370299301 as [NRM_stdev Population 25 to 34 years Bachelor's degree or
higher],
[Population 35 to 44 years] as [ORG_Population 35 to 44 years],
([Population 35 to 44 years] - 67)/(11126 - 67) as [NRM_minmax
Population 35 to 44 years],
([Population 35 to 44 years] - 4862.320140)/1855.36679073523 as
[NRM_stdev Population 35 to 44 years],
[Population 35 to 44 years Bachelor's degree or higher] as
[ORG_Population 35 to 44 years Bachelor's degree or higher],
([Population 35 to 44 years Bachelor's degree or higher] -
0.069)/(0.960 - 0.069) as [NRM_minmax Population 35 to 44 years Bachelor's degree or higher],
([Population 35 to 44 years Bachelor's degree or higher] -
0.513136)/0.207805829433468 as [NRM_stdev Population 35 to 44 years Bachelor's degree or
higher],
[Population 45 to 64 years] as [ORG_Population 45 to 64 years],
([Population 45 to 64 years] - 306)/(13939 - 306) as
[NRM_minmax Population 45 to 64 years],
([Population 45 to 64 years] - 8639.359863)/2963.56581762132 as
[NRM_stdev Population 45 to 64 years],
[Population 45 to 64 years Bachelor's degree or higher] as
[ORG_Population 45 to 64 years Bachelor's degree or higher],
([Population 45 to 64 years Bachelor's degree or higher] -
0.113)/(0.834 - 0.113) as [NRM_minmax Population 45 to 64 years Bachelor's degree or higher],
([Population 45 to 64 years Bachelor's degree or higher] -
0.463571)/0.174867046954704 as [NRM_stdev Population 45 to 64 years Bachelor's degree or
higher],
[Population 65 years and over] as [ORG_Population 65 years and
over],
([Population 65 years and over] - 123)/(7159 - 123) as
[NRM_minmax Population 65 years and over],
([Population 65 years and over] - 3685.495150)/1362.5483431876
as [NRM_stdev Population 65 years and over],
[Population 65 years and over Bachelor's degree or higher] as
[ORG_Population 65 years and over Bachelor's degree or higher],
([Population 65 years and over Bachelor's degree or higher] -
0.042)/(0.737 - 0.042) as [NRM_minmax Population 65 years and over Bachelor's degree or
higher],
([Population 65 years and over Bachelor's degree or higher] -
0.376489)/0.146035021931026 as [NRM_stdev Population 65 years and over Bachelor's degree
or higher]
from [dbo].[KC Comprehensive]
June 7, 2016 Seattle University
24
June 7, 2016 Seattle University
25
June 7, 2016 Seattle University
26
Appendix 7: KC_Normalized_TrainingData4Transactions
select top 20 percent * into KC_Normalized_TestData4Transactions
from [dbo].[KC Comprehensive Normalized]
order by newid()
Appendix 8: KC_TrainingData4Transactions
create view KC_TrainingData4Transactions
as
select * from [dbo].[KC Comprehensive]
except
select * from [dbo].[KC_TestData4Transactions]
Appendix 9: PredictionEvaluationDTrees
create view PredictionEvaluationDTrees
as
select a.[Transaction ID], a.[Sale Price] [Actual Sale Price], p.[Sale Price Prediction]
from [dbo].[KC Comprehensive] a inner join [dbo].[KCPredictionDTrees] p on a.[Transaction
ID] = p.[Transaction ID]
Go
Appendix 10: PredictionEvaluationNNetworks
create view PredictionEvaluationNNetworks
as
select a.[Transaction ID], a.[ORG_Sale Price] [Actual Sale Price], p.[Sale Price Prediction]
from [dbo].[KC Comprehensive Normalized] a inner join [dbo].[KCPredictionNNetworks] p on
a.[Transaction ID] = p.[Transaction ID]
go
June 7, 2016 Seattle University
27
Appendix 11: View4EvaluationKCDTrees
create view [dbo].[View4EvaluationKCDTrees]
as
select *, ABS([Actual Sale Price] - [Sale Price Prediction]) [AbsoluteDeviation],
SQUARE ([Actual Sale Price] - [Sale Price Prediction]) [SquaredDeviation]
from [dbo].[PredictionEvaluationDTrees]
GO
Appendix 12: View4EvaluationKCNNetworks
create view [dbo].[View4EvaluationKCNNetworks]
as
select *, ABS([Actual Sale Price] - [Sale Price Prediction]) [AbsoluteDeviation],
SQUARE ([Actual Sale Price] - [Sale Price Prediction]) [SquaredDeviation]
from [dbo].[PredictionEvaluationNNetworks]
GO
June 7, 2016 Seattle University
28
Appendix 13: Decision Tree for Sale Price
June 7, 2016 Seattle University
29
Appendix 14: Dependency Network for Sale Price
June 7, 2016 Seattle University
30
Appendix 15: Lift Chart from Decision Trees Sale Price
Appendix 16: Lift Chart from Decision Trees Sale Price
June 7, 2016 Seattle University
31
Appendix 17: Top 5 & Lowest 5 Zip Codes By Predicted Value From Neural Networks
Model
select i.[ZIP_code ] [Zip Code], avg(p.[Predicted Value]) [Average Predicted Value],
i.[AverageIncome] [Average Income], e.[Population 25 years and over Percent bachelor's degree
or higher]
from [dbo].[PredictedNNetworksZipCode] p inner join [dbo].[AvgIncome] i on p.[ORG_Zip
Code] = i.[ZIP_code ] inner join [dbo].[Washington Educational Attainment] e on i.[ZIP_code ]
= e.[Zip Code]
Group by i.[ZIP_code ], i.[AverageIncome], e.[Population 25 years and over Percent bachelor's
degree or higher]
order by avg(p.[Predicted Value]) desc
June 7, 2016 Seattle University
32
Reference
Internal Revenue Service (2016). https://www.irs.gov/uac/SOI-Tax-Stats-Individual-Income-
Tax-Statistics-2013-ZIP-Code-Data-(SOI)
King County Assessor (2016). http://info.kingcounty.gov/assessor/DataDownload
United States Census Bureau (2016).
http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk
Zillow King County (2016), http://www.zillow.com/king-county-wa/home-values
Zillow Zestimate (2016), http://www.zillow.com/zestimate