mma863 - individual assignment - simon campbell - revised

MMA863 – Mathematical Foundations for Analytics Simon Campbell, submitted June 24th, 2015

1

MMA863 Mathematical Foundations for Analytics

Jeff McGill

Individual Assignment June 24th, 2015

Simon Campbell

Order of files:

Filename Pages Comments and/or Instructions

MMA863 – Individual Assignment – Simon Campbell.pdf

8

Additional Comments:


2

1. In screening the data for Total Number of Units by Case, it is quickly clear that case 48, 83, and 141 each

have a total of 0 policies sold. This seems like unusual data, and for the purposes of determining how FTE

levels affect sales, we will remove these 3 points. It is also worth noting that these branches have 2, 3, and 4

FTEs respectively.

Having removed those cases, next I created a scatter plot illustrating the relationship between the number of

FTEs and the number of Clients.

From this, it is clear that 3 more data points are concerning. Case 134 has 30 FTEs and only 307 clients. This

is concerning as (with the modified data) the average number of FTEs is 5 and the average number of clients


3

is 1,241. There are 3 cases that have 9,999 clients (two have 3 FTEs so the points on the scatter plot overlap).

These cases are concerning for the same reason as case 134 – the numbers are clearly outside of what we

would expect and will distort our analysis results. For this reason, we will remove case 134, 96, 185, and 348

from the study. In doing so, we can see that the remaining points have a much clearer relationship between

the Total Number of Clients and FTEs. The relationship appears to be fairly linear.

2. Based on this frequency histogram, the data does not appear to follow a normal distribution. It is positively

skewed, and appears to better represent a binomial distribution.

Case BranchCodeNumber of Home PoliciesNumber of Extended Home PoliciesNumner of Auto PoliciesTotUnitsTotal Num of ClientsNumFTESales BranchType Region

96 4709 413 161 7 581 9999 2 Suburb Other

185 5108 56 305 442 803 9999 3 Suburb South

348 1934 396 162 450 1008 9999 3 Suburb GTA


4

3. In both simple and multiple regression we would assume that the dependent variable follows an

approximately linear relationship with the independent variable(s), which would result in the residuals

(errors) being somewhat normally distributed. The dependent variable data itself does not have to follow a

normal distribution, therefore this is not an issue.

4. City branches appear to

have fairly similar

performance in terms of

number of accounts

across all regions. They

also tend to have less

variability (wobble) in

terms of number of

accounts (particularly

when compared to

Suburban branches).

Suburban branches on

average have the most

accounts, however their

variability tends to be

much higher than the city branches. There are also far fewer suburban branches than City branches. The rural

branch types have a wide range of performance in terms of the number of accounts by region. Rural branches

also have a large discrepancy in terms of variation between regions. Rural branches make up the smallest

number of branches.

5. No, there does not appear to be a significant difference in mean total units across the branch type for brokers

working in the City region. The means range from 1,234 – 1,310, resulting in a range of only 76 Units. This

means that the average number of units sold at each branch fall relatively close to each other as all of their

standard deviation values (levels of wobble) are far greater than 76 units.

6. For the purposes of this question, I will conduct a two-sided test as I will assume that although management

expected there to be 1,200 unit sales in 2014, they would have been happy with any actual value within a

reasonable range of that number.

Significance level 5%

Expected Value 1200

H0 U = 1200

Sum of TotUnits GTA_City Sum of TotUnits Other_City Sum of TotUnits South_City

Mean 1310 Mean 1234 Mean 1277

Standard Error 46.233 Standard Error 91.43909183 Standard Error 55.09182688

Median 1189 Median 1096 Median 1159

Mode 901 Mode #N/A Mode #N/A

Standard Deviation 537.1789 Standard Deviation 466.2497135 Standard Deviation 548.1567563

Sample Variance 288561.2 Sample Variance 217388.7954 Sample Variance 300475.8295

Kurtosis 0.62935 Kurtosis 0.140056912 Kurtosis 0.413468932

Skewness 0.810531 Skewness 0.665023783 Skewness 0.850359333

Range 2887 Range 1857 Range 2620

Minimum 397 Minimum 335 Minimum 416

Maximum 3284 Maximum 2192 Maximum 3036

Sum 176893 Sum 32075 Sum 126454

Count 135 Count 26 Count 99

Confidence Level(95.0%) 91.44081 Confidence Level(95.0%) 188.3223349 Confidence Level(95.0%) 109.3279375

Mean

Standard

Deviation

# of

Branches

GTA 1,310 537 135

Other 1,234 466 26

South 1,277 548 99

GTA n/a n/a n/a

Other 807 381 4

South 1,780 865 16

GTA 1,784 1,074 55

Other 1,881 808 9

South 1,516 794 30

City

Rural

Suburb


5

H1 U ≠ 1200

Sample Mean 1779.625

Sample Standard deviation 864.614

Standard error 152.844

Mean - Expected Value 579.625

𝑠�̅� 216.1536

t-value = 2.681543

p-value = 0.017079

With a null hypothesis that sales would be 1,200 units in 2014, based on these results we can reject the null

hypothesis and say that 1,780 units significantly deviated from what was expected.

7. It should not be surprising that any of the policies have moderate correlation with the total number of units

(policies) sold as total number of units sold is calculated based off of many of these numbers. Also, it should

not be surprising to see a very strong correlation between the total number of FTE Sales people and the total

number of clients. We could reasonably expect one to follow the other, and after having cleaned the data, we

know from question 1 that they have a fairly linear relationship.

8. About 15% of the variability in Total Units sold can be accounted for by the number of FTEs. On average,

the model is missing by about 657 units (plus or minus).

9. Yes, this is highly statistically significant. With a p-value of 9.6E-15, there is almost no possible way for the

number of FTEs to have no impact on

Total Units sold. However, since we

are only accounting for about 15% of

the variability in Total Units sold,

there is room to improve this model

by adding other variables to the

analysis.

Number

of Home

Policies

Number

of

Extended

Home

Policies

Number

of Auto

Policies TotUnits

Total

Num of

Clients

NumFTE

Sales

Number of Home Policies 1

Number of Extended Home Policies 0.192937 1

Numner of Auto Policies 0.084249 0.012225 1

TotUnits 0.672859 0.658332 0.559753 1

Total Num of Clients 0.184491 0.17866 0.215022 0.304491 1

NumFTESales 0.244188 0.242525 0.244813 0.386107 0.922868 1

Regression Statistics

Multiple R 0.386106925

R Square 0.149078558

Adjusted R Square 0.146791135

Standard Error 656.7274794

Observations 374

ANOVA

df SS MS F Significance F

Regression 1 28108588 28108588 65.17314 9.60984E-15

Residual 372 1.6E+08 431291

Total 373 1.89E+08


6

10. Without accounting for any FTEs, or imagine if we had no FTEs, we would sell about 645 Total Units.

11. With each FTE we ad, we can expect to sell approximately 143 additional units.

12. The slope estimate is 143 and with a 90% confidence interval we can expect this to be between 114 and 172

units.

�̅� = ± t Sx

Variable Solved

t =T.INV.2T(1-0.9,374-2) = 1.65

�̅� = ± 1.65 * 17.73 = 29.24

13. �̂� = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑠𝑙𝑜𝑝𝑒(𝑋)

�̂� = 645 + 143 (6) = 1,503

Using the formula: �̂� ± 𝑡𝑛−2𝑆𝑒√1

𝑛+

(𝑥𝑝−�̅�)2

(𝑛−1)𝑆𝑥2 we can determine that at 95% CI we can expect that this value

will be ± 67 units (from: 1,437 – 1,570)

14. �̂� = 645 + 143 (6) = 1,503

Using the formula: �̂� ± 𝑡𝑛−2𝑆𝑒√1 +1

𝑛+

(𝑥𝑝−�̅�)2

(𝑛−1)𝑆𝑥2 we can determine that at 95% CI we can expect that this

value will be ± 1,293 units (from: 210 – 2,797)

15. This model does not appear to be useful for predicting

policies sold. The amount of variability (wobble) around our

predictions is extremely large. There is certainly a lot of

room to improve this model, potentially the addition of more

variables would be helpful.

16. The four main regression assumptions are:

Approximate linearity: We know that we have approximate linearity in this model between our

independent and dependent variables.

Normal distribution of errors: The residual histogram appears to follow a somewhat normal

distribution, however it is fairly right-skewed. One of our assumptions for regression analysis is that

the errors should be somewhat normally distributed. For our purposes, the actual follows the

theoretical in a close enough fashion to be considered somewhat normally distributed.

Constant variance of errors: As depicted in the charts below, we do not appear to have a constant

variance in our errors. There appears to be evidence of heteroscedasticity, and a transformation of

the data appears to be required. A log transformation might be appropriate as it would accommodate

diminishing returns which would likely be realized from adding more FTEs.

Independence of errors: Our errors do appear to have a greater variance as we increase the predicted

number of policies sold or add more FTEs, which suggests that the errors are not independent.

Coefficients

Standard

Error t Stat P-value Lower 95%

Upper

95%

Lower

95.0%

Upper

95.0%

Intercept 644.568215 100.8191 6.393317 4.87E-10 446.321481 842.8149 446.3215 842.8149

NumFTESales 143.1645792 17.73378 8.072988 9.61E-15 108.293562 178.0356 108.2936 178.0356

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 90.0% Upper 90.0%

Intercept 644.568215 100.8190679 6.393317 4.87E-10 446.321481 842.8149489 478.3215984 810.8148315

NumFTESales 143.1645792 17.73377743 8.072988 9.61E-15 108.293562 178.0355965 113.9222885 172.40687

Regression Statistics

Multiple R 0.386106925

R Square 0.149078558

Adjusted R Square 0.146791135

Standard Error 656.7274794

Observations 374


7


8

17.

The “Other” region has the highest percent variation in Total Units explained by the number of

FTEs.

The “Other” region has the least statistical significance. This does make sense, statistical

significance is a measure of how unlikely a result is. The “Other” region has the most amount of

variability, or wobble. Because of this, there is more of a “safety-net” protecting the results, which

logically means that the results are not as accurate and therefore, not as surprising.

The “Other” region appears to be the most sensitive to increases in the number of FTEs.

This could potentially be due to random error as the Other region has the highest wobble (as

measured by the standard error of the slope). This reflects that the Other region has the least precise

data, and therefore the answers to the previous questions could be a matter of random error.

N R-Square

Adjusted

R-Square Intercept Slope

Standard

Error of

Slope P-value

All Regions 374 0.1491 0.1468 644.57 143.16 17.73 0.00000000000000961

GTA 190 0.1301 0.1254 672.18 142.06 26.80 0.00000031967349815

South 39 0.1528 0.1469 674.71 135.51 26.68 0.00000116728911229

Other 145 0.2683 0.2485 412.50 173.78 47.18 0.00073128052186167

mma863 - individual assignment - simon campbell - revised

Documents