predictive modeling paper-team8 v0.1
TRANSCRIPT
July 26, 2016
Bike Sharing Team 8
AUTHORS
Arpita Majumder
Jenny (Qian) Zhao
Alicia Ramharack
Rajarshi Das
1 | P a g e
Table of Contents 1. Project Objective........................................................................................................................................ 2
2. Description ................................................................................................................................................ 2
3. Data Source ............................................................................................................................................... 2
4. Data Definition .......................................................................................................................................... 2
5. Project Approach ....................................................................................................................................... 4
6. Data Preparation (Explore-Modify Phase): Adding new variables to the date set.............................................. 5
7. Data Preparation (Explore-Modify Phase): Missing value check ...................................................................... 6
8. Explore Phase: Distribution and outlier analysis and key observations ............................................................ 6
9. Explore Phase: Robust outlier analysis and decision to delete....................................................................... 11
10. Explore Phase: K-means Clustering ............................................................................................................ 13
11. Explore Phase: Hierarchical Clustering ........................................................................................................ 14
12. Modify Phase: Data set Split...................................................................................................................... 14
13. Modeling phase: Multiple regression model ............................................................................................... 15
14. Modeling phase: (Single) Decision tree Model............................................................................................. 19
15. Modeling phase: Boosted tree model ......................................................................................................... 23
16. Modeling phase: Bootstrap forest model .................................................................................................... 25
17. Modeling phase: Neural network model..................................................................................................... 27
18. Assess Phase: Model comparison............................................................................................................... 30
2 | P a g e
1. Project Objective Objective of this project is to predict the Bike sharing and rental demand, using the data generated by kiosk
system throughout a city. The project aims to predict the bike demand per hour based on some key available
data like for example, weather and other associated factors like season (summer/winter/fall/spring),
temperature, wind speed etc. From a business perspective, the model can be utilized to forecast the
customer’s demand and be prepared for it in terms of the rental inventory as well as using the demand data,
the rental company can also promote their business, showcasing their considerable demand handling
capacity, the company can also think of promoting other ancillary services like biking gears, biking attires etc.
in future if they can forecast considerable demands, assuming some repeat customers who will be willing to
take other offers as well in future.
2. Description The project is using a publicly available data-set, containing the data for the first 19 days of each month from
year 2011 to 2012. Each record contains the number of rented bikes based on date and timestamp (per hour
basis). Other than this, seasonal and weather related details are also available in the dataset. It also reflects
the details whether bike is rented by the registered customer or casual customers.
3. Data Source Following is the link for Bike Sharing demand dataset –
https://www.kaggle.com/c/bike-sharing-demand/data
4. Data Definition Following are the high level definitions for the different attributes available in the data-set being used by the
project team.
3 | P a g e
Table 1:
Attribute-Name Attribute Definition Sample value(s)
Daytime Hourly date + timestamp 1/20/2011
12:00:00 AM
Season 1 = spring, 2 = summer, 3 = fall,
4 = winter
1
Holiday Whether the day is considered
a holiday
0
Working day Whether the day is neither a
weekend nor holiday
1
Weather 1: Clear, few clouds, partly
cloudy, partly cloudy
2: Mist + Cloudy, Mist + Broken
clouds, Mist + Few clouds,
Mist
3: Light Snow, Light Rain +
Thunderstorm + Scattered
clouds, Light Rain + Scattered
clouds
4: Heavy Rain + Ice Pallets +
Thunderstorm + Mist, Snow +
Fog
1
Temperature Actual temperature in Celsius 10.66
Feels like "Feels like" temperature in
Celsius
11.365
Humidity Relative humidity 56
Wind speed Wind speed 26.0027
Casual Number of non-registered user
rentals initiated
3
Registered Number of registered user
rentals initiated
13
Count number of total rentals (Casual
+Registered)
16
4 | P a g e
5. Project Approach For this project conventional SEMMA approach is being followed for the predictive analysis and
modelling, for analyzing data and retrieving understandable information from the dataset.
Following is a holistic description on how the SEMMA approach is being followed under this project
and what are the technical activities being executed under each constituent of the SEMMA process.
Also in the next few sections, of this project report, we have delineated with necessary graphical
representations from JMP, the different stages we have executed under the SEMMA process.
Sample:
The project team, started the sample process, with the data sampling, where we have scavenged
through a wide variety of the publicly available data-sets from a vast range of domains, ranging from
healthcare insurance, scientific clinical trials, presidential elections, customer demands (like the Bike
Sharing rental) etc. Based on our project timeline and scope, we have ultimately decided at the end of
our sampling phase, to select the ‘Bike Sharing and rental Demand’ data set, considering its data
volume, which would be ideal for analysis for our project with a stringent schedule, and also we will
be able to learn some aspect of consumer demand analysis. We have also did some minor data
partitioning in this phase to make sure we have data set with optimal range of data rows (Neither too
big nor too small).
Explore: Under the explore phase, our project team, worked on to understand the data, digging a little deeper
into the data definitions, discovering the anticipated and unanticipated relationships between the
variables, and also we explored the few abnormalities with in the variables with the aid of some data
visualization techniques in JMP that we have learned in our class. We have also explored to identify if
there are any missing available in the data-set or not so that we are prepared to correct them as
needed.
Modify: After the data exploration, our project team progressed towards the modification phase, where we
looked closely again into each of the variables under the bike sharing demand data-set, decided with
a team consensus, to select certain variables as key variables to watch for, some of our team
members rightly explained the need for the ‘massaging’ & minor ‘transformation’ of certain data
attributes and some addition of new variables as part of the data preparation, which we have
adhered to considering, the fact that this will give the data more adequate variability, and also it will
enrich the predictor variables ultimately.
Model: Under the modelling phase, our project team, focused on applying various modeling techniques like
for example, regression, Decision tree algorithms including boosted tree, and bootstrap forest, neural
network algorithm, towards the prepared data-set we have come up with some possible outcomes of
our target variable (Count) to demonstrate the predicted values of the bike rental demand.
5 | P a g e
Assess: Under the assess phase, our team, worked on the comparison of the predicted response of our
target variable, which we have obtained using the different modelling vehicles as explained under
the model section above. This comparison helped us in the evaluation of the effectiveness, reliability
and usefulness of the different models that we have utilized to come up with the forecasting of our
target variable.
6. Data Preparation (Explore-Modify Phase): Adding new variables
to the date set Project team, worked on the modification of some of the existing data attributes and came up with some
new modified columns and added them under the data-set.
These seven new manufactured attributes are added to the data-set for better understanding and
interpretation of the data, so that we can use them in our modelling effectively.
Following is a tabular representation on how we have modified the existing attributes; the table represents
the following details.
o Existing available attribute
o Derived Attribute
o Derivation formula, used to create the resulting new variables.
o Note: For detail definition of the Existing attribute, please refer the Table 1 above.
Table 2:
Existing Attribute (Available)
Derived Attribute (New)
Derivation Formula
Datetime Date Abbrev Date(: datetime)
Datetime Time (hour of the day)
Hour(:datetime)
Date Day number of Week Day Of Week(Informat(:Date))
Day number of Week
Day of the week If(:Day number of Week == 1, "Sunday", If(:Day number of Week == 2, "Monday", If(:Day number of Week == 3, "Tuesday", If(:Day number of Week == 4, "Wednesday", If(:Day number of Week == 5, "Thursday", If(:Day number of Week == 6, "Friday", "Saturday"))))))
season Season elaborated If(:season == 1, "Spring", If(:season == 2, "Summer", If(:season == 3, "Fall", "Winter")))
holiday National Holiday If(:holiday == 0, "Not Holiday ", "National Holiday")
weather Weather elaborated If(:weather == 1, "Clear, few clouds, partly cloudy", If(:weather == 2, "Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ", If(:weather == 3, "Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds ", "Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog ")))
6 | P a g e
7. Data Preparation (Explore-Modify Phase): Missing value check
Project team also analyzed the data-set to check if there are any missing values available or not
Based on the analysis, in JMP missing value exploration, we did not encounter any missing values.
Fig 1 below represents our missing value analysis in JMP.
Fig1:
8. Explore Phase: Distribution and outlier analysis and key observations
Bike dataset has few continuous variable and few Nominal variables.
The data set used in the project, a mixture of Continuous and Nominal variables (as documented
below in each section of type of variables)
Before starting our modelling, our team analyzed some of these variable a little deeper, to come up
with some observations as delineated below, which helped us to understand the data and the
relationships in details. These are some preliminary prediction observations we made based on
individual analysis of the data, not necessary all of them affected directly the final prediction when
we ran these through the modelling algorithms, however, these are key factors in understanding the
pattern or the behavior how these individual data items can influence the decision collectively. This
exploration helped us to analyze and predict informally without modelling, and enriched the
analytical ability of each of our project team member.
List of Nominal/ordinal variables Available in the Data-set:
o Date time
o Season
o Holiday
o Working day
o Weather
o Date
o Time
o Day number of the week
o Season elaborated
o National Holiday
o Weather elaborated
Few Nominal variables are derived from another Nominal variable as well as you have
seen in Table 2 above.
Below are the few observations on of the Nominal variables:
7 | P a g e
Fig 2:
Fig 2a:
Like for example, the above tabulation (fig 2) shows, that there is a propensity towards higher
bike demand on Saturdays.
We can also see from the graph representation(fig2a) the higher bike demand also shifts towards
late afternoon to early evening
Similarly, the tabulation below (Fig 3) shows that people are more interested to rent bike on Fall
and the demand is least in spring
Fig 3:
8 | P a g e
Fig 3a:
Fig 3b:
Fig 3c:
Fig 4 below also shows a pattern that people tend to rent bikes more on weeks where there are
no holidays.
Also from the graphical representation (fig 3a, to 3c) we can observe the following patterns of the
bike rental demands
o Fall season is the peak of demand.
o Higher temperature is preferred for the renters, however less or moderate humidity is
preferred as well, high humidity or extreme low temperature days can observe very low
or weak demand.
o We can also see one very important item from these individual analyses that, each
individual observation is affecting the target but it’s contributing towards the collective
9 | P a g e
influence of all variables (Some more, some less) towards the target as well. Like we
know from individual results that moderate temperature with moderate humidity leads
to high demand, we can understand from this, why Fall is also showing as season for
high demand, because it has comfortable temperature (not too high or low) and
moderate humidity as well.
Fig 4:
Continuous variable:
We have explored the 3 continuous variable as well, Temp, humidity, wind speed
The distribution for the variables are as below:
As per the below observation, ‘Temp’ variable does not have any outlier data whereas
‘humidity ‘and ‘wind speed’ has few outliers
10 | P a g e
Fig 5
‘Johnson Si’ transformation for the variables (Humidity and Wind speed) (see in fig 6) shows some
detail representation of the outliers.
11 | P a g e
Fig 6:
9. Explore Phase: Robust outlier analysis and decision to delete As some of the outliers are detected in the data-sets based on the project team’s analysis above,
the team went on to use the robust outlier analysis to assess what is the volume of the outlier in
the entire data set.
As you can see in from fig 7-9, we have explored the Mahalanobis Distance with respect to the
correlation structure in our robust outlier analysis, there are many points/rows which are above
the distance line (UCL = 3.75). These points are considered as outlier
The Mahalanobis Distance is saved in dataset for each row, and marked the rows where distance
is more than 3.75. This is done to find out the number of outlier rows
We found that 669 rows are having outlier among 10886 rows which is around 6% of data . As the
outlier % is very low we have decided as a team to delete the rows.
12 | P a g e
Fig 7:
Fig 8:
13 | P a g e
Fig 9:
10. Explore Phase: K-means Clustering The project also went on executing the different clustering methods learned in class on the data-set (like you
can see in in section 10 and section 11 followed)
However, this helped us to understand the distribution of the data, but we did not have to take any further
action on the data preparation or modification based on these clustering analysis.
Fig 10
14 | P a g e
11. Explore Phase: Hierarchical Clustering Fig 11
12. Modify Phase: Data set Split After all the individual data exploration, modification and preparation our team moved towards
modelling, however before modelling we have segregated our entire data set into 3 categories as
follows.
o Training Data Set
o Validation Data set.
o Testing Data Set.
Though this is a forecasting type of model and NOT classification, we still went to use a stratified
partition using the stratification on the Target variable, so that we have an optimized proportion,
though it was not mandatory.
All our subsequent modelling exercise was constructed based on these partitioned data, so that we
could compare the modelling effects and efficiency on each partitioned data set.
A figurative representation of the data set is given below, after the partition.
15 | P a g e
Fig 12
13. Modeling phase: Multiple regression model Response variable:
o Bike rent count
Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
Prediction model outcomes:
16 | P a g e
Fig 13
Fig 14
Based on primary modeling outcome, National Holiday and Wind speed appeared to be less effective
in prediction as the PValue is very high for these variables.
So these two variables are removed from the model.
After removing these variables, we have re-executed the regression model again and came up with the
following outcome.
17 | P a g e
Fig 15
The RSquare value for the current model is 0.378.
The prediction profile is represented as below.
Fig 16:
Importance of the variables as per the prediction profiler analysis:
Based on the prediction profiler analysis of the influence of the individual prediction variable, we have
observed the following patterns from this model.
o Bike rent demand is increasing as the day progresses.
o Between noon to evening and beyond the demand increases.
o Saturday is the day of the week, where the demand is very high. Whereas on other days of the
week the demand does not vary that much.
o This modelling shows that during fall to early winter the bike renting peaks.
o Also temperature and humidity is a significant predictor of the bike renting demand. Medium to
high temperature and moderate humidity is key to higher demands.
Prediction model formula is saved into the data-set. The prediction formula for this model is depicted
below
18 | P a g e
Fig 17:
Error for this model calculated as below :
19 | P a g e
Fig 18:
14. Modeling phase: (Single) Decision tree Model Response variable:
o Bike rent count
Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
Predictive model outcomes:
Fig 19:
20 | P a g e
RSquare value for dataset given below:
RSquare value is more for this model as compared to the previous model.
Fig 20:
21 | P a g e
Column contribution in this model is given below:
Fig 21:
Model prediction is saved in the dataset.
22 | P a g e
Fig 22:
Error is calculated for this dataset as well.
23 | P a g e
Fig 23:
15. Modeling phase: Boosted tree model Response variable: Bike rent count
Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
Prediction model outcomes:
24 | P a g e
Fig 24:
Prediction Formula is saved in the dataset.
Fig 25:
25 | P a g e
Error is calculated for this model:
Fig 26:
16. Modeling phase: Bootstrap forest model Response variable: Bike rent count
Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
Prediction model outcomes:
26 | P a g e
Fig 27:
Prediction model formula is saved in the dataset:
Fig 28:
27 | P a g e
Error is calculated for this model:
Fig 29:
17. Modeling phase: Neural network model Response variable: Bike rent count
Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
Prediction model outcomes:
Fig 30:
28 | P a g e
Fig 31:
Prediction model formula is saved in the dataset:
Fig 31:
29 | P a g e
Error is calculated for this model:
Fig 32:
30 | P a g e
18. Assess Phase: Model comparison After running multiple modelling on this data-set and obtaining multiple different prediction outcomes
of the bike rent count from each of the model, we are now at a stage where we should compare our
modelling results from each of the modelling to evaluate the best possible prediction model, which
can be employed on this data set.
Following are the steps we have performed as a team using the available JMP software to compare
each of our models across all of the partitioned data-set e.g. Training, Validation and testing.
Modeling comparison outcome for training data:
Fig 33:
Modeling comparison outcome for validation data:
Fig 34:
Modeling comparison outcome for testing data:
Fig 35:
31 | P a g e
Prediction Metrics – Numeric Distribution of Prediction error for each model
Fig 36:
Conclusion:
Based on the modelling comparison and analysis of the prediction error distribution for each model that we
have executed on this data-set, we have come up to the following conclusion.
Based on the statistics of the comparison data it is evident that the Decision tree model is giving us the
most efficient and effective prediction model to count the Rental demand.
The next in order of ranking is the Boosted Tree Model.
From the error distribution also, we can see evidently that decision tree model has the smallest error
% (Error Mean =0.33), Boosted Tree model is giving slightly higher % of error (Error Mean =0.44)
whereas the multiple regression model is giving us the highest error % (Error mean = 0.82) for which
we have considered the regression model as the least effective.
However, we had some important learning during our exploration phase that, individual analysis of the
data as well, can also help us understanding the prediction outcome, even when we ran regression,
even though the prediction error was high, still we found that under regression model, the prediction
profiler gave us the same predictor variable with influence characteristic, which we observed in the
individual observation as well. So even if the regression model did not give us the best efficient and
accurate result, it certainly helped us corroborating the fact that our exploration and analysis was
going in right direction in terms of understanding the influence of each variable. Which we
ultimately confirmed when we had the column contribution in our decision tree model which is the
best model as per our evaluation.
Business Solution:
32 | P a g e
In walking through SEMMA, we find that the data helps us draw conclusions that address business
problems. From the data, we find that there are different bike rental habits between the casual customers and
registered customers. This is valuable data that can help grow the customer base of both populations. Rental
trends show that we can manage our inventory according to the seasons, offering more inventories during the
peak months to accommodate more users.
Casual customers include tourists and infrequent bike renters. For tourists to Washington DC, bike
rentals are a cost effective way of getting around the city for exploring and sightseeing. As a company, we can
offer recommendations and coupons to visit other attractions which they can access by bike. By offering this
type of incentive, we are not cutting into profit by reducing the price of a rental with offering a bike rental
coupon. In order to attract new customers, a first time renter’s discount can be offered. This can allow the
user to try the bike rental with low risk. Our registered customers are most valuable. In order to retain them,
accessory options can be offered. By registering, you are now a member of the loyalty program where you
have exclusive access to amenities such as cooling centers or coupons for related products.