predictive modeling paper-team8 v0.1

July 26, 2016

Bike Sharing Team 8

AUTHORS

Arpita Majumder

Jenny (Qian) Zhao

Alicia Ramharack

Rajarshi Das

1 | P a g e

Table of Contents 1. Project Objective........................................................................................................................................ 2

2. Description ................................................................................................................................................ 2

3. Data Source ............................................................................................................................................... 2

4. Data Definition .......................................................................................................................................... 2

5. Project Approach ....................................................................................................................................... 4

6. Data Preparation (Explore-Modify Phase): Adding new variables to the date set.............................................. 5

7. Data Preparation (Explore-Modify Phase): Missing value check ...................................................................... 6

8. Explore Phase: Distribution and outlier analysis and key observations ............................................................ 6

9. Explore Phase: Robust outlier analysis and decision to delete....................................................................... 11

10. Explore Phase: K-means Clustering ............................................................................................................ 13

11. Explore Phase: Hierarchical Clustering ........................................................................................................ 14

12. Modify Phase: Data set Split...................................................................................................................... 14

13. Modeling phase: Multiple regression model ............................................................................................... 15

14. Modeling phase: (Single) Decision tree Model............................................................................................. 19

15. Modeling phase: Boosted tree model ......................................................................................................... 23

16. Modeling phase: Bootstrap forest model .................................................................................................... 25

17. Modeling phase: Neural network model..................................................................................................... 27

18. Assess Phase: Model comparison............................................................................................................... 30

2 | P a g e

1. Project Objective Objective of this project is to predict the Bike sharing and rental demand, using the data generated by kiosk

system throughout a city. The project aims to predict the bike demand per hour based on some key available

data like for example, weather and other associated factors like season (summer/winter/fall/spring),

temperature, wind speed etc. From a business perspective, the model can be utilized to forecast the

customer’s demand and be prepared for it in terms of the rental inventory as well as using the demand data,

the rental company can also promote their business, showcasing their considerable demand handling

capacity, the company can also think of promoting other ancillary services like biking gears, biking attires etc.

in future if they can forecast considerable demands, assuming some repeat customers who will be willing to

take other offers as well in future.

2. Description The project is using a publicly available data-set, containing the data for the first 19 days of each month from

year 2011 to 2012. Each record contains the number of rented bikes based on date and timestamp (per hour

basis). Other than this, seasonal and weather related details are also available in the dataset. It also reflects

the details whether bike is rented by the registered customer or casual customers.

3. Data Source Following is the link for Bike Sharing demand dataset –

https://www.kaggle.com/c/bike-sharing-demand/data

4. Data Definition Following are the high level definitions for the different attributes available in the data-set being used by the

project team.

https://www.kaggle.com/c/bike-sharing-demand/data

3 | P a g e

Table 1:

Attribute-Name Attribute Definition Sample value(s)

Daytime Hourly date + timestamp 1/20/2011

12:00:00 AM

Season 1 = spring, 2 = summer, 3 = fall,

4 = winter

1

Holiday Whether the day is considered

a holiday

0

Working day Whether the day is neither a

weekend nor holiday

1

Weather 1: Clear, few clouds, partly

cloudy, partly cloudy

2: Mist + Cloudy, Mist + Broken

clouds, Mist + Few clouds,

Mist

3: Light Snow, Light Rain +

Thunderstorm + Scattered

clouds, Light Rain + Scattered

clouds

4: Heavy Rain + Ice Pallets +

Thunderstorm + Mist, Snow +

Fog

1

Temperature Actual temperature in Celsius 10.66

Feels like "Feels like" temperature in

Celsius

11.365

Humidity Relative humidity 56

Wind speed Wind speed 26.0027

Casual Number of non-registered user

rentals initiated

3

Registered Number of registered user

rentals initiated

13

Count number of total rentals (Casual

+Registered)

16

4 | P a g e

5. Project Approach For this project conventional SEMMA approach is being followed for the predictive analysis and

modelling, for analyzing data and retrieving understandable information from the dataset.

Following is a holistic description on how the SEMMA approach is being followed under this project

and what are the technical activities being executed under each constituent of the SEMMA process.

Also in the next few sections, of this project report, we have delineated with necessary graphical

representations from JMP, the different stages we have executed under the SEMMA process.

Sample:

The project team, started the sample process, with the data sampling, where we have scavenged

through a wide variety of the publicly available data-sets from a vast range of domains, ranging from

healthcare insurance, scientific clinical trials, presidential elections, customer demands (like the Bike

Sharing rental) etc. Based on our project timeline and scope, we have ultimately decided at the end of

our sampling phase, to select the ‘Bike Sharing and rental Demand’ data set, considering its data

volume, which would be ideal for analysis for our project with a stringent schedule, and also we will

be able to learn some aspect of consumer demand analysis. We have also did some minor data

partitioning in this phase to make sure we have data set with optimal range of data rows (Neither too

big nor too small).

Explore: Under the explore phase, our project team, worked on to understand the data, digging a little deeper

into the data definitions, discovering the anticipated and unanticipated relationships between the

variables, and also we explored the few abnormalities with in the variables with the aid of some data

visualization techniques in JMP that we have learned in our class. We have also explored to identify if

there are any missing available in the data-set or not so that we are prepared to correct them as

needed.

Modify: After the data exploration, our project team progressed towards the modification phase, where we

looked closely again into each of the variables under the bike sharing demand data-set, decided with

a team consensus, to select certain variables as key variables to watch for, some of our team

members rightly explained the need for the ‘massaging’ & minor ‘transformation’ of certain data

attributes and some addition of new variables as part of the data preparation, which we have

adhered to considering, the fact that this will give the data more adequate variability, and also it will

enrich the predictor variables ultimately.

Model: Under the modelling phase, our project team, focused on applying various modeling techniques like

for example, regression, Decision tree algorithms including boosted tree, and bootstrap forest, neural

network algorithm, towards the prepared data-set we have come up with some possible outcomes of

our target variable (Count) to demonstrate the predicted values of the bike rental demand.

5 | P a g e

Assess: Under the assess phase, our team, worked on the comparison of the predicted response of our

target variable, which we have obtained using the different modelling vehicles as explained under

the model section above. This comparison helped us in the evaluation of the effectiveness, reliability

and usefulness of the different models that we have utilized to come up with the forecasting of our

target variable.

6. Data Preparation (Explore-Modify Phase): Adding new variables

to the date set Project team, worked on the modification of some of the existing data attributes and came up with some

new modified columns and added them under the data-set.

These seven new manufactured attributes are added to the data-set for better understanding and

interpretation of the data, so that we can use them in our modelling effectively.

Following is a tabular representation on how we have modified the existing attributes; the table represents

the following details.

o Existing available attribute

o Derived Attribute

o Derivation formula, used to create the resulting new variables.

o Note: For detail definition of the Existing attribute, please refer the Table 1 above.

Table 2:

Existing Attribute (Available)

Derived Attribute (New)

Derivation Formula

Datetime Date Abbrev Date(: datetime)

Datetime Time (hour of the day)

Hour(:datetime)

Date Day number of Week Day Of Week(Informat(:Date))

Day number of Week

Day of the week If(:Day number of Week == 1, "Sunday", If(:Day number of Week == 2, "Monday", If(:Day number of Week == 3, "Tuesday", If(:Day number of Week == 4, "Wednesday", If(:Day number of Week == 5, "Thursday", If(:Day number of Week == 6, "Friday", "Saturday"))))))

season Season elaborated If(:season == 1, "Spring", If(:season == 2, "Summer", If(:season == 3, "Fall", "Winter")))

holiday National Holiday If(:holiday == 0, "Not Holiday ", "National Holiday")

weather Weather elaborated If(:weather == 1, "Clear, few clouds, partly cloudy", If(:weather == 2, "Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ", If(:weather == 3, "Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds ", "Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog ")))

6 | P a g e

7. Data Preparation (Explore-Modify Phase): Missing value check

Project team also analyzed the data-set to check if there are any missing values available or not

Based on the analysis, in JMP missing value exploration, we did not encounter any missing values.

Fig 1 below represents our missing value analysis in JMP.

Fig1:

8. Explore Phase: Distribution and outlier analysis and key observations

Bike dataset has few continuous variable and few Nominal variables.

The data set used in the project, a mixture of Continuous and Nominal variables (as documented

below in each section of type of variables)

Before starting our modelling, our team analyzed some of these variable a little deeper, to come up

with some observations as delineated below, which helped us to understand the data and the

relationships in details. These are some preliminary prediction observations we made based on

individual analysis of the data, not necessary all of them affected directly the final prediction when

we ran these through the modelling algorithms, however, these are key factors in understanding the

pattern or the behavior how these individual data items can influence the decision collectively. This

exploration helped us to analyze and predict informally without modelling, and enriched the

analytical ability of each of our project team member.

List of Nominal/ordinal variables Available in the Data-set:

o Date time

o Season

o Holiday

o Working day

o Weather

o Date

o Time

o Day number of the week

o Season elaborated

o National Holiday

o Weather elaborated

Few Nominal variables are derived from another Nominal variable as well as you have

seen in Table 2 above.

Below are the few observations on of the Nominal variables:

7 | P a g e

Fig 2:

Fig 2a:

Like for example, the above tabulation (fig 2) shows, that there is a propensity towards higher

bike demand on Saturdays.

We can also see from the graph representation(fig2a) the higher bike demand also shifts towards

late afternoon to early evening

Similarly, the tabulation below (Fig 3) shows that people are more interested to rent bike on Fall

and the demand is least in spring

Fig 3:

8 | P a g e

Fig 3a:

Fig 3b:

Fig 3c:

Fig 4 below also shows a pattern that people tend to rent bikes more on weeks where there are

no holidays.

Also from the graphical representation (fig 3a, to 3c) we can observe the following patterns of the

bike rental demands

o Fall season is the peak of demand.

o Higher temperature is preferred for the renters, however less or moderate humidity is

preferred as well, high humidity or extreme low temperature days can observe very low

or weak demand.

o We can also see one very important item from these individual analyses that, each

individual observation is affecting the target but it’s contributing towards the collective

9 | P a g e

influence of all variables (Some more, some less) towards the target as well. Like we

know from individual results that moderate temperature with moderate humidity leads

to high demand, we can understand from this, why Fall is also showing as season for

high demand, because it has comfortable temperature (not too high or low) and

moderate humidity as well.

Fig 4:

Continuous variable:

We have explored the 3 continuous variable as well, Temp, humidity, wind speed

The distribution for the variables are as below:

As per the below observation, ‘Temp’ variable does not have any outlier data whereas

‘humidity ‘and ‘wind speed’ has few outliers

10 | P a g e

Fig 5

‘Johnson Si’ transformation for the variables (Humidity and Wind speed) (see in fig 6) shows some

detail representation of the outliers.

11 | P a g e

Fig 6:

9. Explore Phase: Robust outlier analysis and decision to delete As some of the outliers are detected in the data-sets based on the project team’s analysis above,

the team went on to use the robust outlier analysis to assess what is the volume of the outlier in

the entire data set.

As you can see in from fig 7-9, we have explored the Mahalanobis Distance with respect to the

correlation structure in our robust outlier analysis, there are many points/rows which are above

the distance line (UCL = 3.75). These points are considered as outlier

The Mahalanobis Distance is saved in dataset for each row, and marked the rows where distance

is more than 3.75. This is done to find out the number of outlier rows

We found that 669 rows are having outlier among 10886 rows which is around 6% of data . As the

outlier % is very low we have decided as a team to delete the rows.

12 | P a g e

Fig 7:

Fig 8:

13 | P a g e

Fig 9:

10. Explore Phase: K-means Clustering The project also went on executing the different clustering methods learned in class on the data-set (like you

can see in in section 10 and section 11 followed)

However, this helped us to understand the distribution of the data, but we did not have to take any further

action on the data preparation or modification based on these clustering analysis.

Fig 10

14 | P a g e

11. Explore Phase: Hierarchical Clustering Fig 11

12. Modify Phase: Data set Split After all the individual data exploration, modification and preparation our team moved towards

modelling, however before modelling we have segregated our entire data set into 3 categories as

follows.

o Training Data Set

o Validation Data set.

o Testing Data Set.

Though this is a forecasting type of model and NOT classification, we still went to use a stratified

partition using the stratification on the Target variable, so that we have an optimized proportion,

though it was not mandatory.

All our subsequent modelling exercise was constructed based on these partitioned data, so that we

could compare the modelling effects and efficiency on each partitioned data set.

A figurative representation of the data set is given below, after the partition.

15 | P a g e

Fig 12

13. Modeling phase: Multiple regression model Response variable:

o Bike rent count

Predictor variables:

o Time (hour of the day)

o Day of the week

o Season elaborated

o National Holiday

o Atemp

o Humidity

o Windspeed

Prediction model outcomes:

16 | P a g e

Fig 13

Fig 14

Based on primary modeling outcome, National Holiday and Wind speed appeared to be less effective

in prediction as the PValue is very high for these variables.

So these two variables are removed from the model.

After removing these variables, we have re-executed the regression model again and came up with the

following outcome.

17 | P a g e

Fig 15

The RSquare value for the current model is 0.378.

The prediction profile is represented as below.

Fig 16:

Importance of the variables as per the prediction profiler analysis:

Based on the prediction profiler analysis of the influence of the individual prediction variable, we have

observed the following patterns from this model.

o Bike rent demand is increasing as the day progresses.

o Between noon to evening and beyond the demand increases.

o Saturday is the day of the week, where the demand is very high. Whereas on other days of the

week the demand does not vary that much.

o This modelling shows that during fall to early winter the bike renting peaks.

o Also temperature and humidity is a significant predictor of the bike renting demand. Medium to

high temperature and moderate humidity is key to higher demands.

Prediction model formula is saved into the data-set. The prediction formula for this model is depicted

below

18 | P a g e

Fig 17:

Error for this model calculated as below :

19 | P a g e

Fig 18:

14. Modeling phase: (Single) Decision tree Model Response variable:

o Bike rent count



o Day of the week

o Season elaborated

o National Holiday

o Atemp

o Humidity

o Windspeed

Predictive model outcomes:

Fig 19:

20 | P a g e

RSquare value for dataset given below:

RSquare value is more for this model as compared to the previous model.

Fig 20:

21 | P a g e

Column contribution in this model is given below:

Fig 21:

Model prediction is saved in the dataset.

22 | P a g e

Fig 22:

Error is calculated for this dataset as well.

23 | P a g e

Fig 23:

15. Modeling phase: Boosted tree model Response variable: Bike rent count



o Day of the week

o Season elaborated

o National Holiday

o Atemp

o Humidity

o Windspeed


24 | P a g e

Fig 24:

Prediction Formula is saved in the dataset.

Fig 25:

25 | P a g e

Error is calculated for this model:

Fig 26:

16. Modeling phase: Bootstrap forest model Response variable: Bike rent count



o Day of the week

o Season elaborated

o National Holiday

o Atemp

o Humidity

o Windspeed


26 | P a g e

Fig 27:

Prediction model formula is saved in the dataset:

Fig 28:

27 | P a g e


Fig 29:

17. Modeling phase: Neural network model Response variable: Bike rent count



o Day of the week

o Season elaborated

o National Holiday

o Atemp

o Humidity

o Windspeed


Fig 30:

28 | P a g e

Fig 31:

Prediction model formula is saved in the dataset:

Fig 31:

29 | P a g e


Fig 32:

30 | P a g e

18. Assess Phase: Model comparison After running multiple modelling on this data-set and obtaining multiple different prediction outcomes

of the bike rent count from each of the model, we are now at a stage where we should compare our

modelling results from each of the modelling to evaluate the best possible prediction model, which

can be employed on this data set.

Following are the steps we have performed as a team using the available JMP software to compare

each of our models across all of the partitioned data-set e.g. Training, Validation and testing.

Modeling comparison outcome for training data:

Fig 33:

Modeling comparison outcome for validation data:

Fig 34:

Modeling comparison outcome for testing data:

Fig 35:

31 | P a g e

Prediction Metrics – Numeric Distribution of Prediction error for each model

Fig 36:

Conclusion:

Based on the modelling comparison and analysis of the prediction error distribution for each model that we

have executed on this data-set, we have come up to the following conclusion.

Based on the statistics of the comparison data it is evident that the Decision tree model is giving us the

most efficient and effective prediction model to count the Rental demand.

The next in order of ranking is the Boosted Tree Model.

From the error distribution also, we can see evidently that decision tree model has the smallest error

% (Error Mean =0.33), Boosted Tree model is giving slightly higher % of error (Error Mean =0.44)

whereas the multiple regression model is giving us the highest error % (Error mean = 0.82) for which

we have considered the regression model as the least effective.

However, we had some important learning during our exploration phase that, individual analysis of the

data as well, can also help us understanding the prediction outcome, even when we ran regression,

even though the prediction error was high, still we found that under regression model, the prediction

profiler gave us the same predictor variable with influence characteristic, which we observed in the

individual observation as well. So even if the regression model did not give us the best efficient and

accurate result, it certainly helped us corroborating the fact that our exploration and analysis was

going in right direction in terms of understanding the influence of each variable. Which we

ultimately confirmed when we had the column contribution in our decision tree model which is the

best model as per our evaluation.

Business Solution:

32 | P a g e

In walking through SEMMA, we find that the data helps us draw conclusions that address business

problems. From the data, we find that there are different bike rental habits between the casual customers and

registered customers. This is valuable data that can help grow the customer base of both populations. Rental

trends show that we can manage our inventory according to the seasons, offering more inventories during the

peak months to accommodate more users.

Casual customers include tourists and infrequent bike renters. For tourists to Washington DC, bike

rentals are a cost effective way of getting around the city for exploring and sightseeing. As a company, we can

offer recommendations and coupons to visit other attractions which they can access by bike. By offering this

type of incentive, we are not cutting into profit by reducing the price of a rental with offering a bike rental

coupon. In order to attract new customers, a first time renter’s discount can be offered. This can allow the

user to try the bike rental with low risk. Our registered customers are most valuable. In order to retain them,

accessory options can be offered. By registering, you are now a member of the loyalty program where you

have exclusive access to amenities such as cooling centers or coupons for related products.

predictive modeling paper-team8 v0.1

Documents