mma863 - individual assignment - simon campbell - revised
TRANSCRIPT
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
1
MMA863 Mathematical Foundations for Analytics
Jeff McGill
Individual Assignment June 24th, 2015
Simon Campbell
Order of files:
Filename Pages Comments and/or Instructions
MMA863 – Individual Assignment – Simon Campbell.pdf
8
Additional Comments:
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
2
1. In screening the data for Total Number of Units by Case, it is quickly clear that case 48, 83, and 141 each
have a total of 0 policies sold. This seems like unusual data, and for the purposes of determining how FTE
levels affect sales, we will remove these 3 points. It is also worth noting that these branches have 2, 3, and 4
FTEs respectively.
Having removed those cases, next I created a scatter plot illustrating the relationship between the number of
FTEs and the number of Clients.
From this, it is clear that 3 more data points are concerning. Case 134 has 30 FTEs and only 307 clients. This
is concerning as (with the modified data) the average number of FTEs is 5 and the average number of clients
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
3
is 1,241. There are 3 cases that have 9,999 clients (two have 3 FTEs so the points on the scatter plot overlap).
These cases are concerning for the same reason as case 134 – the numbers are clearly outside of what we
would expect and will distort our analysis results. For this reason, we will remove case 134, 96, 185, and 348
from the study. In doing so, we can see that the remaining points have a much clearer relationship between
the Total Number of Clients and FTEs. The relationship appears to be fairly linear.
2. Based on this frequency histogram, the data does not appear to follow a normal distribution. It is positively
skewed, and appears to better represent a binomial distribution.
Case BranchCodeNumber of Home PoliciesNumber of Extended Home PoliciesNumner of Auto PoliciesTotUnitsTotal Num of ClientsNumFTESales BranchType Region
96 4709 413 161 7 581 9999 2 Suburb Other
185 5108 56 305 442 803 9999 3 Suburb South
348 1934 396 162 450 1008 9999 3 Suburb GTA
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
4
3. In both simple and multiple regression we would assume that the dependent variable follows an
approximately linear relationship with the independent variable(s), which would result in the residuals
(errors) being somewhat normally distributed. The dependent variable data itself does not have to follow a
normal distribution, therefore this is not an issue.
4. City branches appear to
have fairly similar
performance in terms of
number of accounts
across all regions. They
also tend to have less
variability (wobble) in
terms of number of
accounts (particularly
when compared to
Suburban branches).
Suburban branches on
average have the most
accounts, however their
variability tends to be
much higher than the city branches. There are also far fewer suburban branches than City branches. The rural
branch types have a wide range of performance in terms of the number of accounts by region. Rural branches
also have a large discrepancy in terms of variation between regions. Rural branches make up the smallest
number of branches.
5. No, there does not appear to be a significant difference in mean total units across the branch type for brokers
working in the City region. The means range from 1,234 – 1,310, resulting in a range of only 76 Units. This
means that the average number of units sold at each branch fall relatively close to each other as all of their
standard deviation values (levels of wobble) are far greater than 76 units.
6. For the purposes of this question, I will conduct a two-sided test as I will assume that although management
expected there to be 1,200 unit sales in 2014, they would have been happy with any actual value within a
reasonable range of that number.
Significance level 5%
Expected Value 1200
H0 U = 1200
Sum of TotUnits GTA_City Sum of TotUnits Other_City Sum of TotUnits South_City
Mean 1310 Mean 1234 Mean 1277
Standard Error 46.233 Standard Error 91.43909183 Standard Error 55.09182688
Median 1189 Median 1096 Median 1159
Mode 901 Mode #N/A Mode #N/A
Standard Deviation 537.1789 Standard Deviation 466.2497135 Standard Deviation 548.1567563
Sample Variance 288561.2 Sample Variance 217388.7954 Sample Variance 300475.8295
Kurtosis 0.62935 Kurtosis 0.140056912 Kurtosis 0.413468932
Skewness 0.810531 Skewness 0.665023783 Skewness 0.850359333
Range 2887 Range 1857 Range 2620
Minimum 397 Minimum 335 Minimum 416
Maximum 3284 Maximum 2192 Maximum 3036
Sum 176893 Sum 32075 Sum 126454
Count 135 Count 26 Count 99
Confidence Level(95.0%) 91.44081 Confidence Level(95.0%) 188.3223349 Confidence Level(95.0%) 109.3279375
Mean
Standard
Deviation
# of
Branches
GTA 1,310 537 135
Other 1,234 466 26
South 1,277 548 99
GTA n/a n/a n/a
Other 807 381 4
South 1,780 865 16
GTA 1,784 1,074 55
Other 1,881 808 9
South 1,516 794 30
City
Rural
Suburb
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
5
H1 U ≠ 1200
Sample Mean 1779.625
Sample Standard deviation 864.614
Standard error 152.844
Mean - Expected Value 579.625
𝑠�̅� 216.1536
t-value = 2.681543
p-value = 0.017079
With a null hypothesis that sales would be 1,200 units in 2014, based on these results we can reject the null
hypothesis and say that 1,780 units significantly deviated from what was expected.
7. It should not be surprising that any of the policies have moderate correlation with the total number of units
(policies) sold as total number of units sold is calculated based off of many of these numbers. Also, it should
not be surprising to see a very strong correlation between the total number of FTE Sales people and the total
number of clients. We could reasonably expect one to follow the other, and after having cleaned the data, we
know from question 1 that they have a fairly linear relationship.
8. About 15% of the variability in Total Units sold can be accounted for by the number of FTEs. On average,
the model is missing by about 657 units (plus or minus).
9. Yes, this is highly statistically significant. With a p-value of 9.6E-15, there is almost no possible way for the
number of FTEs to have no impact on
Total Units sold. However, since we
are only accounting for about 15% of
the variability in Total Units sold,
there is room to improve this model
by adding other variables to the
analysis.
Number
of Home
Policies
Number
of
Extended
Home
Policies
Number
of Auto
Policies TotUnits
Total
Num of
Clients
NumFTE
Sales
Number of Home Policies 1
Number of Extended Home Policies 0.192937 1
Numner of Auto Policies 0.084249 0.012225 1
TotUnits 0.672859 0.658332 0.559753 1
Total Num of Clients 0.184491 0.17866 0.215022 0.304491 1
NumFTESales 0.244188 0.242525 0.244813 0.386107 0.922868 1
Regression Statistics
Multiple R 0.386106925
R Square 0.149078558
Adjusted R Square 0.146791135
Standard Error 656.7274794
Observations 374
ANOVA
df SS MS F Significance F
Regression 1 28108588 28108588 65.17314 9.60984E-15
Residual 372 1.6E+08 431291
Total 373 1.89E+08
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
6
10. Without accounting for any FTEs, or imagine if we had no FTEs, we would sell about 645 Total Units.
11. With each FTE we ad, we can expect to sell approximately 143 additional units.
12. The slope estimate is 143 and with a 90% confidence interval we can expect this to be between 114 and 172
units.
�̅� = ± t Sx
Variable Solved
t =T.INV.2T(1-0.9,374-2) = 1.65
�̅� = ± 1.65 * 17.73 = 29.24
13. �̂� = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑠𝑙𝑜𝑝𝑒(𝑋)
�̂� = 645 + 143 (6) = 1,503
Using the formula: �̂� ± 𝑡𝑛−2𝑆𝑒√1
𝑛+
(𝑥𝑝−�̅�)2
(𝑛−1)𝑆𝑥2 we can determine that at 95% CI we can expect that this value
will be ± 67 units (from: 1,437 – 1,570)
14. �̂� = 645 + 143 (6) = 1,503
Using the formula: �̂� ± 𝑡𝑛−2𝑆𝑒√1 +1
𝑛+
(𝑥𝑝−�̅�)2
(𝑛−1)𝑆𝑥2 we can determine that at 95% CI we can expect that this
value will be ± 1,293 units (from: 210 – 2,797)
15. This model does not appear to be useful for predicting
policies sold. The amount of variability (wobble) around our
predictions is extremely large. There is certainly a lot of
room to improve this model, potentially the addition of more
variables would be helpful.
16. The four main regression assumptions are:
Approximate linearity: We know that we have approximate linearity in this model between our
independent and dependent variables.
Normal distribution of errors: The residual histogram appears to follow a somewhat normal
distribution, however it is fairly right-skewed. One of our assumptions for regression analysis is that
the errors should be somewhat normally distributed. For our purposes, the actual follows the
theoretical in a close enough fashion to be considered somewhat normally distributed.
Constant variance of errors: As depicted in the charts below, we do not appear to have a constant
variance in our errors. There appears to be evidence of heteroscedasticity, and a transformation of
the data appears to be required. A log transformation might be appropriate as it would accommodate
diminishing returns which would likely be realized from adding more FTEs.
Independence of errors: Our errors do appear to have a greater variance as we increase the predicted
number of policies sold or add more FTEs, which suggests that the errors are not independent.
Coefficients
Standard
Error t Stat P-value Lower 95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercept 644.568215 100.8191 6.393317 4.87E-10 446.321481 842.8149 446.3215 842.8149
NumFTESales 143.1645792 17.73378 8.072988 9.61E-15 108.293562 178.0356 108.2936 178.0356
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 90.0% Upper 90.0%
Intercept 644.568215 100.8190679 6.393317 4.87E-10 446.321481 842.8149489 478.3215984 810.8148315
NumFTESales 143.1645792 17.73377743 8.072988 9.61E-15 108.293562 178.0355965 113.9222885 172.40687
Regression Statistics
Multiple R 0.386106925
R Square 0.149078558
Adjusted R Square 0.146791135
Standard Error 656.7274794
Observations 374
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
7
MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015
8
17.
The “Other” region has the highest percent variation in Total Units explained by the number of
FTEs.
The “Other” region has the least statistical significance. This does make sense, statistical
significance is a measure of how unlikely a result is. The “Other” region has the most amount of
variability, or wobble. Because of this, there is more of a “safety-net” protecting the results, which
logically means that the results are not as accurate and therefore, not as surprising.
The “Other” region appears to be the most sensitive to increases in the number of FTEs.
This could potentially be due to random error as the Other region has the highest wobble (as
measured by the standard error of the slope). This reflects that the Other region has the least precise
data, and therefore the answers to the previous questions could be a matter of random error.
N R-Square
Adjusted
R-Square Intercept Slope
Standard
Error of
Slope P-value
All Regions 374 0.1491 0.1468 644.57 143.16 17.73 0.00000000000000961
GTA 190 0.1301 0.1254 672.18 142.06 26.80 0.00000031967349815
South 39 0.1528 0.1469 674.71 135.51 26.68 0.00000116728911229
Other 145 0.2683 0.2485 412.50 173.78 47.18 0.00073128052186167