forecasting p r roject eport nalytics · executive summary: problem description: we aim to forecast...

FORECASTING

ANALYTICS PROJECT REPORT

Submitted By:

Arka Sarkar (61310610)

Kushal Paliwal (61310280)

Malvika Gaur (61310456)

Shwaitang Singh (61310261)

Executive Summary: Problem Description: We aim to forecast daily sales (unit sold) of the top 5 selling SKUs for the

coming week (1st August 2012 to 7th August 2012). We identify the top 5 selling SKUs in 2012 as

follows:

Item Sales (INR)

Saffola Gold Oil 5 Lt. Jar 902,920.47

BikajiNmknBhujiaSev 1KG 819,325.89

DaawatDevaaya BM Rice 5 Kg. Pack 780,153.15

Ashirwaad Atta 10 Kg PO 587,513.58

Fortune Refine Soya Oil 5 Lt. Jar 538,473.99

From a total of nearly 10,000 SKUs, the top 5 SKUs alone are responsible for nearly 3% of the store’s

revenues (see Appendix I). Also these 5 SKUs are bellwether products for their classesand it would

be instructive for store managers to note change in sales in these as an indication for change in sales

for the respective classes themselves. Forecasts of the sales of the top 5 SKUs can help managers to:

Estimate volatility of earning and design promotion campaigns to smoothen earnings

Protect against stock-outs to avoid lost business opportunities

Data Description: We’re provided with 2 datasets, namely customer data and transaction data. The

transaction data comprises of details (quantity sold, extended price etc.)

Figure: Quantity Sold per day for Saffola Gold 5 Lt. Jar per day

Figure: Quantity Sold per day of the week for Saffola Gold 5 Lt. Jar

Key Data Characteristics:

Missing Values: There were a number of missing values for the Top 5 SKUs. These missing values

can either be interpreted as (1) error in recording data or (2) no sales (i.e. data value = 0) due

either stock-outs or no demand for the product.

Seasonality: The SKUs exhibit some level of seasonality. In most cases there is a pronounced

shift in trend over the weekends as compared to the weekdays. In some cases, the there is a 7

day seasonality that is exhibited.

Transaction size: We see that the transaction size (i.e. the number of units bought) is in

multiples of 3 each.

Outliers: There are number of outliers (i.e. when values are more than 3 standard deviations

away). These occur due to (1) more than normal number of transactions driven in some cases by

proximity to public holidays (2) a single ‘bulk buyer’ who dominates the sales.

High Level description of the final forecasting method and performance:

Item Forecasting Method* RMSE MAPE

Saffola Gold Oil 5 Lt. Jar MLR and Residual Forecast 5.262 71.66%

Bikaji Nmkn Bhujia Sev 1KG Seasonal Naïve Forecast** 8.635 69.81%

Daawat Devaaya BM Rice 5 Kg. Pack MLR and Residual Forecast 6.862 68.14%

Ashirwaad Atta 10 Kg PO Multiple Linear Regression (MLR) 16.059 45.33%

Fortune Refine Soya Oil 5 Lt. Jar Holt-Winters Smoothening 11.097 52.69%

*Please refer to appendix III, IV, V and VI for the respective models tested, intermediate results and forecast

plot for the next 7 days. Forecast for Saffola shown in the accompanying presentation.

**For the SKU Bikaji, although the MLR with Residual Forecast shows a much better visual fit (see appendix III),

it does not outperform seasonal naive when it comes to the error measures for the specific 7 day period that

we chose.

Forecast assumptions:

Level of confidence for the forecast: There are just forecast and should not be construed as

predictions. Methods used produce forecast with a 95% level of confidence. We believe the

relevant error measures in this case are: RMSE and MAPE. Relatively higher values of

Customer preferences do not change over the 7 day forecast period and stay the same

experienced over the training and validation period (1st August, 2011 to 31st July, 2012)

The store has sufficient inventory and there would be no stock outs that would act as upper

limits number of units that can be sold

Missing Values imply that there were no sales on that given day for the SKU. The data has been

fitted to incorporate zeros in cases where there was missing data.

Outliers due to ‘bulk purchases’ (>4X times the regular purchase size) made by a particular

customer are considered to be random events and cannot be forecasted

Impact of holidays is not incorporated as we do not have an exhaustive set of holidays in the

region where the store is located. Additionally, we have limited information on how the effect

of the holiday should be built into the model. In some cases, the the SKU sales peak 2 days

before the actual holiday, in others they peak on the actual day of the holiday while in others

they peak 2 days after the actual holiday. In the absence of credible source of different types of

holiday, coupled with the fact that the data that we have is only for 1 year (i.e. we cannot

confirm the type of holiday across years), we’ve not incorporated the effects of holidays into our

forecast.

Conclusions and recommendations:

Cash flow variations: Since these five SKUs combined account for over 2% of the revenues of the

store, accurate forecasting will help in understanding the cash flows via receipts over the

immediate future. This information can then be used to balance outflows of cash, including

payments to be made to suppliers, wholesalers, etc. We recommend the store-owner to use this

information to manage cash flow positions.

Protection against stock-outs: Accurate forecasts will help us to gauge demand for these

products over the next one week. Since these SKUs are important from a revenue standpoint,

accurate forecasts will help us to anticipate an increase in demand, and give the store-owner

ample lead time to order the SKUs and prevent stock-outs.

Inventory management: The store-owner could contrast revenues generated by these SKUs and

lower selling SKUs. In case of high anticipated demand and limited shelf space, the owner could

remove lower grossing items and order excess quantity of these top selling SKUs. Additionally,

these forecasts could be used to better manage inventory levels in general.

Holiday demand: It was observed that the sales of these 5 SKUs tend to peak just before

holidays. The store-owner could use these forecasts to protect against stock-outs during such

periods, especially since these SKUs are the highest grossing items.

Technical Summary:

Data Preparation Issues:

Filtering: The first level of data preparation included filtering for the particular SKU of interest

Missing Values Removal:We insert zeros (assuming that there were no sales on the particular

day) for the dates where there were no data in the dataset

Outlier Removal: In this step, a frequency chart wascreated to understand if there any

aberrations in the purchase pattern. For example, if 3 and 6 are the most common purchase

quantities, we needed to understand how to deal with a customer who has bought 9 units. Our

treatment of outliers has been conservative and we have replaced the rows with only the

highest frequency with the average value, for example a value of ‘9’ has been replaced with ‘3’

in the case of Saffola Gold

Methodology: The sequence of steps in forecast is shown below:

Data Visualization Random Walk

Benchmarking: Naive and

Seasonal Forecast

Single Layer: Multiple Linear

Regression Model

Dual Layer: Multiple Linear

Regression Model with Residual

Forecast

Smoothing Method (Additive

Holt-Winters)

Caveats: As part of our investigations, we also did find customers who purchased certain SKUs (eg:

Saffola) did in-fact return for a second purchase after a certain period of time. To forecast the sales

of Saffola, an alternate method could have been to build a forecast on the number of customers

visiting the particular store on any given day to purchase Saffola Gold. We believe this would be

venturing into the realm of econometric models, which given the current data may have a very high

explanatory power; the correlation could not be construed as causation and made the basis for a

forecast.

Data Visualization: We first perform a visual check to detect level, trend and seasonality in the data.

To detect seasonality we plot ACF plots of the quantity sold, to identify the seasonal lags. An

example of how this was done for the Saffola SKU is shown below, where in we conclude that there

is 7 day seasonality in the data base on the ACF Plot. We also see that we would need an additive

model based on the linear trend line plotted for the quantity of Saffola Gold SKU sold per day.

Random Walk Test: We perform the random walk test to determine if the series at hand can actually

be forecasted. We find that all the series are not random walks.

Benchmarking: We use the naïve and the seasonal naïve forecast as the benchmark for the final

model.

Error Measure Naïve Forecast Seasonal Naïve Forecast

RMSE 13.81094 11.31513

MAPE 150.20% 89.70%

Multiple Linear Regression Model: Based on the trend and seasonality involved we forecast each

SKU using a multiple linear trend model with seasonality being built in by using dummy variables

with Training, Validation and Testing partitions. The time period for each of these periods are:

o Training: 1st August 2011 to July 24th, 2012

o Validation (1 week): July 25th 2012 to 31st July 2012

o Test (1 month): August 1st 2012 and August 31st 2012

Multiple Linear Regression Model with Residual Forecast:To further improve the model, we add an

additional layer of residual forecast to the model. In the case of Saffola Gold SKU, we notice that

there is still some seasonality (lag 4) that remains in the residuals, which can be captured by an AR(4)

model.

Holt Winter’s Smoothing Method: Last but not the least, we try a smoothing method to estimate

the sales for the various SKUs to find the most optimal values of RMSE and MAPE for the 1 week

forecast that we intend to develop. To take the example of Saffola Gold, we find the below results by

running the Holt Winters Smoothing, using the various parameters.

Summary of results for the various SKUs:

Quantity

Dayindex

Time Plot of Actual Vs Forecast (Training Data)

Actual Forecast

Appendix I: Selection of Top 5 SKUs

Top 5 selling SKUs in 2012 are as shown below:

Total Sales in the Department for the years 2011 and 2012:

Top 5 selling SKUs in 2012 as a percentage of total sales (in INR) in 2012, is shown below:

SaffolaGold Oil 5 Lt. Jar 902,920.47

BikajiNmknBhujiaSev 1KG 819,325.89

DaawatDevaaya BM Rice 5 Kg. Pack 780,153.15

Ashirwaad Atta 10 Kg PO 587,513.58

Fortune Refine Soya Oil 5 Lt. Jar 538,473.99

Sum of Top 5 SKUs 3,628,387.08

Total (for all SKUs) 122,781,098.6

% of sales 2.96%

Appendix II: Line Graphs for SKUs

Figure: Haldiram Sev Bhujia (Quantity Sold on Y-axis and dates on X-axis)

Figure: Fortune Refined Oil (Quantity Sold on Y-axis and dates on X-axis)

Figure: Daawat Rice (Quantity Sold on Y-axis and dates on X-axis)

Figure: Aashirwaad Aata (Quantity Sold on Y-axis and dates on X-axis)

Appendix III: Bikaji Sev Bhujia

A histogram plot of the data, after data cleaning was performed. The ACF plot shows that there is a

seven day seasonality that can be exploited during forecasting.

Visualizing the data: Data is fairly noisy with a close to no trend.

Naïve forecast and the Seasonal Naive Forecast:

Multiple Linear Regression Plot using a 7 day seasonality (modelled using dummy variables)

0 20 40 60 80 100 120

Quantity Sold

Histogram

0 1 2 3 4 5 6 7 8 9 10

ACF Plot for Quantity Sold

ACF UCI LCI

Quantity Sold Linear (Quantity Sold) 2 per. Mov. Avg. (Quantity Sold)

020406080

100120

Quantity Sold Naïve Forecast

020406080

100120

Quantity Sold Seasonal Naïve Forecast

ACF plot for the residuals shows that the there is some signal that remains in the residuals

We see a good fit, once we capture the signal in the residuals using an AR(5) model and add it back

to our orginal , however for the one week forecast that we wish to make we do not get a result

better than the seasonal naive forecast on the error measures chosen.

020406080

100120

Quantity Sold Forecast

Actual Value New Forecast

ACF Plot for residuals shows that there is no further signal to be captured.

Forecast for Bikaji Namkeen, for the next 7 days using the multiple linear regression with AR(5)

model for the residuals:

1 2 3 4 5 6 7

Actual Forecast

Appendix IV: Daawat Rice

Day of the week break down for Daawat Rice is shown below. We notice that there is a clear

difference in levels for weekends as compared to weekdays.

Data preparation, we see that there a number of missing values and negative values. For the missing

values, we use our judgement to conclude that these were recording errors and replace them with

the average value in the data.

Quantity bought per Transaction Number of such transactions

3 1056

Grand Total 1166

Seasonal Naïve Forecast

0102030405060

Quantity Sold Naïve Forecast

We show histogram of the data of the data after we correct for missing values and negative values.

We also show the ACF plot see that there is seven seasonality in the data.

Training and Validation Actual vs. Forecasted (MLR with seven day seasonality) and the respective

residual plot is shown below. We then check the ACF plot for

0 5 10 15 20 25 30 35 40 45 50

Quantity Sold

Histogram

0 1 2 3 4 5 6 7 8 9 10

ACF UCI LCI

Actual Value Predicted Value

1Training and Validation Residuals Residuals

1015202530354045

Actual Predicted

-25-20-15-10-505

1015202530

Residual

We check the residual plots to see that there is some seasonality that is not captured and we

attempt to do it via an AR model.

Results of the multiple linear regression with overlay of the residual forecasts show a much better

fit. Additionally, the residual plots do not show any further signal that can be captured.

0 1 2 3 4 5 6 7 8 9 10

ACF Plot for Residual Combined

ACF UCI LCI

102030405060

Quantity Sold ARIMA Forecast

0 1 2 3 4 5 6 7 8 9 10

ACF Plot for New Residuals

ACF UCI LCI

Appendix V: Fortune Oil

Visualizing the data :

Sales Dominant on Weekends:

Data Clean Up: Negative quantity values are replaced with the lowest value (assuming those were

data entry errors). For missing values we’ve inserted zero value rows.

Quantity bought per transaction No. of such transactions

Grand Total 583

Test for Random Walk:

ARIMA Model

ARIMA Coeff StErr p-value

Const. term 2.80973697 0.36031583 0

AR1 0.3708598 0.04654105 0

Naïve Forecast:

Residuals & Errors:

Seasonal Naïve Forecast: Not much improvement since there are many intermediate missing values

which cause even the seasonal model to break quite often.

We then check for seasonality in data. We notice that seasonality is most pronounced for day 7.

Quantity Sold

Naïve Forecast

Residual

0 1 2 3 4 5 6 7 8 9 10

ACF UCI LCI

Multiple Linear Regression with 7 day seasonality:

The Regression Model

Input variables Coefficient Std. Error p-value SS

Constant term 5.91226006 0.94501734 0 5970.183594

Dayindex 0.01308713 0.00286613 0.00000684 670.5512085

Weekday_1 -4.97293854 1.10906196 0.0000099 37.27494812

Weekday_2 -5.15910244 1.10904717 0.00000462 76.001297

Weekday_3 -3.24176908 1.11446846 0.0038592 22.27444077

Weekday_4 -5.84309149 1.11444271 0.00000023 235.9051971

Weekday_5 -5.44441414 1.11442423 0.00000153 332.5310669

Weekday_6 -4.63397169 1.11441314 0.00004032 547.5755615

Training Data scoring - Summary Report

Total sum of

squared errors RMS Error Average Error

11115.70246 5.564437033 -1.33148E-08

Validation Data scoring - Summary Report

Total sum of

squared errors RMS Error Average Error

113.1180556 4.019915699 -1.334990053

Actual Vs Forecast and Residuals:

0102030405060

Quantity Sold Forecast

60Residuals Res

ACF Plot for Residuals of the multiple linear regression

Plot of multiple linear regression with overlay of AR model for residuals, we see that there is still

some signal remaining in the residuals – however on running the model, we do not see a significant

improvement in terms of the error measures.

0 1 2 3 4 5 6 7 8 9 10

ACF Plot for E

ACF UCI LCI

Appendix VI: Aashirwaad Atta

Visualizing the data:

Sales dominant on weekends:

Outliers and Missing Data:

Transaction Size Count of Customer_No 3 1001 6 114 9 28

12 8 15 5 18 1 21 2 24 2 27 1

Grand Total 1166

Missing values are replaced with zero. Spikes are treated as outliers and are replaced with closest

values that fall within our tolerance range. Most such cases are due to one single customer doing

infrequent bulk purchases.

Random Walk Test: We see that the

ARIMA Coeff StErr p-value

Const. term 8.31235886 0.61765313 0

AR1 0.22535062 0.02676752 0

Naïve Forecast:

Residuals:

Quantity Sold

Naïve Forecast

Residual

ACF Plot to detect seasonality:

ACF Values

0.22546686

0.01018919

0.07472585

0.01640873

-0.0139351

0.03910767

0.21895115

0.03814469

0.10238601

0.07832895

Multiple Linear Regression with seven day seasonality:

Residuals:

010203040506070

Quantity Sold

Forecast

Residuals

0 1 2 3 4 5 6 7 8 9 10

ACF UCI LCI

Multiple Linear Regression with overlay of the AR(7) for the residuals:

Quantity Sold

Double Layer Forecast

forecasting p r roject eport nalytics · executive summary: problem description: we aim to forecast...

Documents

p roject jhoanna andrea

constitution electronic p roject

cad d esign p roject

final p roject [bar rendering]

pp roject roject mm anagementanagement · powerpoint...

lg-5008 roject nforation roject ae

ep roject: airlines reservation system

roject ocument ection roject dentification

)roject squid - apps.dtic.mil

p roject :medicines

project document roject dentification

developing an intergenerational p roject

intermotor import - superior medium-duty diesel coverage...

a nalytics d ecisive a nalytics 1 introduction to the common...

inventory & sub skus

team name: p roject 44

u nix p roject 2

roject anagement development center wisconsin small ...frank...

engineering d esign p roject

#1. new skus