forecasting p r roject eport nalytics · executive summary: problem description: we aim to forecast...
Post on 16-Apr-2020
3 Views
Preview:
TRANSCRIPT
FORECASTING
ANALYTICS PROJECT REPORT
Submitted By:
Arka Sarkar (61310610)
Kushal Paliwal (61310280)
Malvika Gaur (61310456)
Shwaitang Singh (61310261)
Executive Summary: Problem Description: We aim to forecast daily sales (unit sold) of the top 5 selling SKUs for the
coming week (1st August 2012 to 7th August 2012). We identify the top 5 selling SKUs in 2012 as
follows:
Item Sales (INR)
Saffola Gold Oil 5 Lt. Jar 902,920.47
BikajiNmknBhujiaSev 1KG 819,325.89
DaawatDevaaya BM Rice 5 Kg. Pack 780,153.15
Ashirwaad Atta 10 Kg PO 587,513.58
Fortune Refine Soya Oil 5 Lt. Jar 538,473.99
From a total of nearly 10,000 SKUs, the top 5 SKUs alone are responsible for nearly 3% of the store’s
revenues (see Appendix I). Also these 5 SKUs are bellwether products for their classesand it would
be instructive for store managers to note change in sales in these as an indication for change in sales
for the respective classes themselves. Forecasts of the sales of the top 5 SKUs can help managers to:
Estimate volatility of earning and design promotion campaigns to smoothen earnings
Protect against stock-outs to avoid lost business opportunities
Data Description: We’re provided with 2 datasets, namely customer data and transaction data. The
transaction data comprises of details (quantity sold, extended price etc.)
Figure: Quantity Sold per day for Saffola Gold 5 Lt. Jar per day
Figure: Quantity Sold per day of the week for Saffola Gold 5 Lt. Jar
Key Data Characteristics:
Missing Values: There were a number of missing values for the Top 5 SKUs. These missing values
can either be interpreted as (1) error in recording data or (2) no sales (i.e. data value = 0) due
either stock-outs or no demand for the product.
Seasonality: The SKUs exhibit some level of seasonality. In most cases there is a pronounced
shift in trend over the weekends as compared to the weekdays. In some cases, the there is a 7
day seasonality that is exhibited.
Transaction size: We see that the transaction size (i.e. the number of units bought) is in
multiples of 3 each.
Outliers: There are number of outliers (i.e. when values are more than 3 standard deviations
away). These occur due to (1) more than normal number of transactions driven in some cases by
proximity to public holidays (2) a single ‘bulk buyer’ who dominates the sales.
High Level description of the final forecasting method and performance:
Item Forecasting Method* RMSE MAPE
Saffola Gold Oil 5 Lt. Jar MLR and Residual Forecast 5.262 71.66%
Bikaji Nmkn Bhujia Sev 1KG Seasonal Naïve Forecast** 8.635 69.81%
Daawat Devaaya BM Rice 5 Kg. Pack MLR and Residual Forecast 6.862 68.14%
Ashirwaad Atta 10 Kg PO Multiple Linear Regression (MLR) 16.059 45.33%
Fortune Refine Soya Oil 5 Lt. Jar Holt-Winters Smoothening 11.097 52.69%
*Please refer to appendix III, IV, V and VI for the respective models tested, intermediate results and forecast
plot for the next 7 days. Forecast for Saffola shown in the accompanying presentation.
**For the SKU Bikaji, although the MLR with Residual Forecast shows a much better visual fit (see appendix III),
it does not outperform seasonal naive when it comes to the error measures for the specific 7 day period that
we chose.
Forecast assumptions:
Level of confidence for the forecast: There are just forecast and should not be construed as
predictions. Methods used produce forecast with a 95% level of confidence. We believe the
relevant error measures in this case are: RMSE and MAPE. Relatively higher values of
Customer preferences do not change over the 7 day forecast period and stay the same
experienced over the training and validation period (1st August, 2011 to 31st July, 2012)
The store has sufficient inventory and there would be no stock outs that would act as upper
limits number of units that can be sold
Missing Values imply that there were no sales on that given day for the SKU. The data has been
fitted to incorporate zeros in cases where there was missing data.
Outliers due to ‘bulk purchases’ (>4X times the regular purchase size) made by a particular
customer are considered to be random events and cannot be forecasted
Impact of holidays is not incorporated as we do not have an exhaustive set of holidays in the
region where the store is located. Additionally, we have limited information on how the effect
of the holiday should be built into the model. In some cases, the the SKU sales peak 2 days
before the actual holiday, in others they peak on the actual day of the holiday while in others
they peak 2 days after the actual holiday. In the absence of credible source of different types of
holiday, coupled with the fact that the data that we have is only for 1 year (i.e. we cannot
confirm the type of holiday across years), we’ve not incorporated the effects of holidays into our
forecast.
Conclusions and recommendations:
Cash flow variations: Since these five SKUs combined account for over 2% of the revenues of the
store, accurate forecasting will help in understanding the cash flows via receipts over the
immediate future. This information can then be used to balance outflows of cash, including
payments to be made to suppliers, wholesalers, etc. We recommend the store-owner to use this
information to manage cash flow positions.
Protection against stock-outs: Accurate forecasts will help us to gauge demand for these
products over the next one week. Since these SKUs are important from a revenue standpoint,
accurate forecasts will help us to anticipate an increase in demand, and give the store-owner
ample lead time to order the SKUs and prevent stock-outs.
Inventory management: The store-owner could contrast revenues generated by these SKUs and
lower selling SKUs. In case of high anticipated demand and limited shelf space, the owner could
remove lower grossing items and order excess quantity of these top selling SKUs. Additionally,
these forecasts could be used to better manage inventory levels in general.
Holiday demand: It was observed that the sales of these 5 SKUs tend to peak just before
holidays. The store-owner could use these forecasts to protect against stock-outs during such
periods, especially since these SKUs are the highest grossing items.
Technical Summary:
Data Preparation Issues:
Filtering: The first level of data preparation included filtering for the particular SKU of interest
Missing Values Removal:We insert zeros (assuming that there were no sales on the particular
day) for the dates where there were no data in the dataset
Outlier Removal: In this step, a frequency chart wascreated to understand if there any
aberrations in the purchase pattern. For example, if 3 and 6 are the most common purchase
quantities, we needed to understand how to deal with a customer who has bought 9 units. Our
treatment of outliers has been conservative and we have replaced the rows with only the
highest frequency with the average value, for example a value of ‘9’ has been replaced with ‘3’
in the case of Saffola Gold
Methodology: The sequence of steps in forecast is shown below:
Data Visualization Random Walk
Test
Benchmarking: Naive and
Seasonal Forecast
Single Layer: Multiple Linear
Regression Model
Dual Layer: Multiple Linear
Regression Model with Residual
Forecast
Smoothing Method (Additive
Holt-Winters)
Caveats: As part of our investigations, we also did find customers who purchased certain SKUs (eg:
Saffola) did in-fact return for a second purchase after a certain period of time. To forecast the sales
of Saffola, an alternate method could have been to build a forecast on the number of customers
visiting the particular store on any given day to purchase Saffola Gold. We believe this would be
venturing into the realm of econometric models, which given the current data may have a very high
explanatory power; the correlation could not be construed as causation and made the basis for a
forecast.
Data Visualization: We first perform a visual check to detect level, trend and seasonality in the data.
To detect seasonality we plot ACF plots of the quantity sold, to identify the seasonal lags. An
example of how this was done for the Saffola SKU is shown below, where in we conclude that there
is 7 day seasonality in the data base on the ACF Plot. We also see that we would need an additive
model based on the linear trend line plotted for the quantity of Saffola Gold SKU sold per day.
Random Walk Test: We perform the random walk test to determine if the series at hand can actually
be forecasted. We find that all the series are not random walks.
Benchmarking: We use the naïve and the seasonal naïve forecast as the benchmark for the final
model.
Error Measure Naïve Forecast Seasonal Naïve Forecast
RMSE 13.81094 11.31513
MAPE 150.20% 89.70%
Multiple Linear Regression Model: Based on the trend and seasonality involved we forecast each
SKU using a multiple linear trend model with seasonality being built in by using dummy variables
with Training, Validation and Testing partitions. The time period for each of these periods are:
o Training: 1st August 2011 to July 24th, 2012
o Validation (1 week): July 25th 2012 to 31st July 2012
o Test (1 month): August 1st 2012 and August 31st 2012
Multiple Linear Regression Model with Residual Forecast:To further improve the model, we add an
additional layer of residual forecast to the model. In the case of Saffola Gold SKU, we notice that
there is still some seasonality (lag 4) that remains in the residuals, which can be captured by an AR(4)
model.
Holt Winter’s Smoothing Method: Last but not the least, we try a smoothing method to estimate
the sales for the various SKUs to find the most optimal values of RMSE and MAPE for the 1 week
forecast that we intend to develop. To take the example of Saffola Gold, we find the below results by
running the Holt Winters Smoothing, using the various parameters.
Summary of results for the various SKUs:
-20
0
20
40
Quantity
Sold
Dayindex
Time Plot of Actual Vs Forecast (Training Data)
Actual Forecast
Appendix I: Selection of Top 5 SKUs
Top 5 selling SKUs in 2012 are as shown below:
Total Sales in the Department for the years 2011 and 2012:
Top 5 selling SKUs in 2012 as a percentage of total sales (in INR) in 2012, is shown below:
SaffolaGold Oil 5 Lt. Jar 902,920.47
BikajiNmknBhujiaSev 1KG 819,325.89
DaawatDevaaya BM Rice 5 Kg. Pack 780,153.15
Ashirwaad Atta 10 Kg PO 587,513.58
Fortune Refine Soya Oil 5 Lt. Jar 538,473.99
Sum of Top 5 SKUs 3,628,387.08
Total (for all SKUs) 122,781,098.6
% of sales 2.96%
Appendix II: Line Graphs for SKUs
Figure: Haldiram Sev Bhujia (Quantity Sold on Y-axis and dates on X-axis)
Figure: Fortune Refined Oil (Quantity Sold on Y-axis and dates on X-axis)
Figure: Daawat Rice (Quantity Sold on Y-axis and dates on X-axis)
Figure: Aashirwaad Aata (Quantity Sold on Y-axis and dates on X-axis)
Appendix III: Bikaji Sev Bhujia
A histogram plot of the data, after data cleaning was performed. The ACF plot shows that there is a
seven day seasonality that can be exploited during forecasting.
Visualizing the data: Data is fairly noisy with a close to no trend.
Naïve forecast and the Seasonal Naive Forecast:
Multiple Linear Regression Plot using a 7 day seasonality (modelled using dummy variables)
0
100
200
0 20 40 60 80 100 120
Fre
qu
en
cy
Quantity Sold
Histogram
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for Quantity Sold
ACF UCI LCI
0
20
40
60
80
100
120
Quantity Sold Linear (Quantity Sold) 2 per. Mov. Avg. (Quantity Sold)
020406080
100120
Quantity Sold Naïve Forecast
020406080
100120
Quantity Sold Seasonal Naïve Forecast
ACF plot for the residuals shows that the there is some signal that remains in the residuals
We see a good fit, once we capture the signal in the residuals using an AR(5) model and add it back
to our orginal , however for the one week forecast that we wish to make we do not get a result
better than the seasonal naive forecast on the error measures chosen.
020406080
100120
Quantity Sold Forecast
-20
0
20
40
60
80
100
120
1
12
23
34
45
56
67
78
89
10
0
11
1
12
2
13
3
14
4
15
5
16
6
17
7
18
8
19
9
21
0
22
1
23
2
24
3
25
4
26
5
27
6
28
7
29
8
30
9
32
0
33
1
34
2
35
3
36
4
Actual Value New Forecast
ACF Plot for residuals shows that there is no further signal to be captured.
Forecast for Bikaji Namkeen, for the next 7 days using the multiple linear regression with AR(5)
model for the residuals:
0
10
20
30
40
50
60
1 2 3 4 5 6 7
Actual Forecast
Appendix IV: Daawat Rice
Day of the week break down for Daawat Rice is shown below. We notice that there is a clear
difference in levels for weekends as compared to weekdays.
Data preparation, we see that there a number of missing values and negative values. For the missing
values, we use our judgement to conclude that these were recording errors and replace them with
the average value in the data.
Quantity bought per Transaction Number of such transactions
-6 1
-3 2
3 1056
6 92
9 6
12 5
15 2
27 1
30 1
Grand Total 1166
Seasonal Naïve Forecast
0102030405060
Quantity Sold Naïve Forecast
We show histogram of the data of the data after we correct for missing values and negative values.
We also show the ACF plot see that there is seven seasonality in the data.
Training and Validation Actual vs. Forecasted (MLR with seven day seasonality) and the respective
residual plot is shown below. We then check the ACF plot for
0
100
200
0 5 10 15 20 25 30 35 40 45 50
Fre
qu
en
cy
Quantity Sold
Histogram
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for Quantity Sold
ACF UCI LCI
-10
0
10
20
30
40
50
60
1
13
25
37
49
61
73
85
97
10
9
12
1
13
3
14
5
15
7
16
9
18
1
19
3
20
5
21
7
22
9
24
1
25
3
26
5
27
7
28
9
30
1
31
3
32
5
33
7
34
9
36
1
Actual Value Predicted Value
-40
-20
0
20
40
1
13
25
37
49
61
73
85
97
10
9
12
1
13
3
14
5
15
7
16
9
18
1
19
3
20
5
21
7
22
9
24
1
25
3
26
5
27
7
28
9
30
1
31
3
32
5
33
7
34
9
36
1Training and Validation Residuals Residuals
05
1015202530354045
Actual Predicted
-25-20-15-10-505
1015202530
Residual
We check the residual plots to see that there is some seasonality that is not captured and we
attempt to do it via an AR model.
Results of the multiple linear regression with overlay of the residual forecasts show a much better
fit. Additionally, the residual plots do not show any further signal that can be captured.
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for Residual Combined
ACF UCI LCI
-100
102030405060
1-A
ug-
11
1-S
ep
-11
1-O
ct-1
1
1-N
ov-
11
1-D
ec-1
1
1-J
an-1
2
1-F
eb
-12
1-M
ar-1
2
1-A
pr-
12
1-M
ay-1
2
1-J
un
-12
1-J
ul-
12
Quantity Sold ARIMA Forecast
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for New Residuals
ACF UCI LCI
Appendix V: Fortune Oil
Visualizing the data :
Sales Dominant on Weekends:
Data Clean Up: Negative quantity values are replaced with the lowest value (assuming those were
data entry errors). For missing values we’ve inserted zero value rows.
Quantity bought per transaction No. of such transactions
-3 2
3 571
6 8
9 2
Grand Total 583
Test for Random Walk:
ARIMA Model
ARIMA Coeff StErr p-value
Const. term 2.80973697 0.36031583 0
AR1 0.3708598 0.04654105 0
Naïve Forecast:
Residuals & Errors:
Seasonal Naïve Forecast: Not much improvement since there are many intermediate missing values
which cause even the seasonal model to break quite often.
We then check for seasonality in data. We notice that seasonality is most pronounced for day 7.
0
10
20
30
40
50
60
Quantity Sold
Naïve Forecast
-40
-20
0
20
40
60
Residual
Residual
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for Quantity Sold
ACF UCI LCI
Multiple Linear Regression with 7 day seasonality:
The Regression Model
Input variables Coefficient Std. Error p-value SS
Constant term 5.91226006 0.94501734 0 5970.183594
Dayindex 0.01308713 0.00286613 0.00000684 670.5512085
Weekday_1 -4.97293854 1.10906196 0.0000099 37.27494812
Weekday_2 -5.15910244 1.10904717 0.00000462 76.001297
Weekday_3 -3.24176908 1.11446846 0.0038592 22.27444077
Weekday_4 -5.84309149 1.11444271 0.00000023 235.9051971
Weekday_5 -5.44441414 1.11442423 0.00000153 332.5310669
Weekday_6 -4.63397169 1.11441314 0.00004032 547.5755615
Training Data scoring - Summary Report
Total sum of
squared errors RMS Error Average Error
11115.70246 5.564437033 -1.33148E-08
Validation Data scoring - Summary Report
Total sum of
squared errors RMS Error Average Error
113.1180556 4.019915699 -1.334990053
Actual Vs Forecast and Residuals:
0102030405060
Quantity Sold Forecast
-20
0
20
40
60Residuals Res
ACF Plot for Residuals of the multiple linear regression
Plot of multiple linear regression with overlay of AR model for residuals, we see that there is still
some signal remaining in the residuals – however on running the model, we do not see a significant
improvement in terms of the error measures.
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for E
ACF UCI LCI
Appendix VI: Aashirwaad Atta
Visualizing the data:
Sales dominant on weekends:
Outliers and Missing Data:
Transaction Size Count of Customer_No 3 1001 6 114 9 28
12 8 15 5 18 1 21 2 24 2 27 1
30 3
36 1
Grand Total 1166
Missing values are replaced with zero. Spikes are treated as outliers and are replaced with closest
values that fall within our tolerance range. Most such cases are due to one single customer doing
infrequent bulk purchases.
Random Walk Test: We see that the
ARIMA Coeff StErr p-value
Const. term 8.31235886 0.61765313 0
AR1 0.22535062 0.02676752 0
Naïve Forecast:
Residuals:
0
10
20
30
40
50
60
70
Quantity Sold
Naïve Forecast
-60
-40
-20
0
20
40
60
Residual
Residual
ACF Plot to detect seasonality:
ACF Values
Lags
ACF
0
1
1
0.22546686
2
0.01018919
3
0.07472585
4
0.01640873
5
-0.0139351
6
0.03910767
7
0.21895115
8
0.03814469
9
-
0.10238601
10
0.07832895
Multiple Linear Regression with seven day seasonality:
Residuals:
010203040506070
Quantity Sold
Forecast
-20
0
20
40
60
Residuals
Residuals
-1
-0.5
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10
AC
F
Lags
ACF Plot for Quantity Sold
ACF UCI LCI
Multiple Linear Regression with overlay of the AR(7) for the residuals:
-10
0
10
20
30
40
50
60
Quantity Sold
Double Layer Forecast
top related