bb a 3 econometric sand excel

EXCEL AND ECONOMETRICS

Edward Omey

HUB

[email protected]

Introduction .............................................................................................................................. 2 1. Preparation ........................................................................................................................... 2 2. Linear models ....................................................................................................................... 4 3. Selection of variables part 1 ................................................................................................ 6

3.1. Example and first order QMC..................................................................................... 6 3.2. Higher order QMC...................................................................................................... 12

4. Regression analysis............................................................................................................. 13 4.1. Example 1 (Ctd)........................................................................................................... 13

4.1.1. MODEL 1: Y^ = a + bX1..................................................................................... 13 4.1.2. MODEL 2: Y^ = a + bX1 + cX2.......................................................................... 19

4.2. Basic Assumptions....................................................................................................... 22 4.2.1. BA1: E(εi) = 0........................................................................................................ 22 4.2.2. Homoscedasticity.................................................................................................. 23 4.2.3. BA3: no autocorrelation ...................................................................................... 25 4.2.4. BA4: εi ∼ N(0, σ²) ................................................................................................. 25

Econometrics and Excel E.Omey HUB BBA 1

mailto:[email protected]

http://www.edwardomey.com/

Introduction This text provides a guide about how to use EXCEL to make some calculations related to linear models in econometrics. The data I use here (EDUCATION DATA) is available as an excel sheet on my webpage.

1. Preparation We need a TOOL to make the calculations. The tool is called “DATA ANALYSIS” and it can be found by using TOOLS ==> ADD-INS ==> DATA ANALYSIS:


Now we find the “DATA ANALYSIS “ as one of the options in TOOLS:

Selecting “DATA ANALYSIS”, we see the following screen:


We are going to use: - Correlation: allows us to calculate data-correlation-coefficients; - Descriptive Statistics; - Regression: regression analysis for linear models.

2. Linear models We will study linear models or models that can be made linear. By linear we mean that the model is linear in the parameters. linear: Y^ = a + bX1 + cX2 + dX3 exponential: Y^ = aebX (==> ln(Y^) = ln(a) + bX of Y*^ = a* + bX) loglinear: Y^ = aXbZc (==> ln(Y^) = ln(a) + bln(X) + cln(Z)) etc We can transform variables by using the FUNCTION WIZZARD:

Complicated and non linear models should be treated by using more sophisticated software.


In EXCEL we estimate parameters by using OLS, this is the METHOD OF LEAST SQUARES. To construct and evaluate a good econometric model, we need to discuss several topics: * selection of variables

- to avoid QMC-problems - to find marginal contribution of variables.

* the determination coefficient or R²: do the results of the analysis correspond to reality? * theoretical consistency: is the sign of the estimated parameters theoretically correct? * predictive power: can we use the model to make predictions? * can we assume that the basic assumptions hold??


3. Selection of variables part 1

3.1. Example and first order QMC We try to explain Y = the expenses for education in the states of the US. To this end we use a small number of variables: X1 = number of people that live in cities per 1000; X2 = mean income; X3 = number of young people (under 18) per 1000. These are the data:

state Y X1 X2 X31 235 508 394,4 3252 231 564 457,8 3233 270 322 401,1 3284 261 846 523,3 3055 300 871 478 3036 317 774 588,9 3077 387 856 566,3 3018 285 889 575,9 3109 300 715 489,4 300

10 221 753 501,2 32411 264 649 490,8 32912 308 830 575,3 32013 379 738 543,9 33714 342 659 463,4 32815 378 664 492,1 33016 232 572 486,9 31817 231 701 467,2 30918 246 443 478,2 33319 230 446 429,6 33020 268 615 482,7 31821 337 661 505,7 30422 344 722 554 32823 330 766 533,1 32324 261 631 741,5 31725 214 390 382,8 31026 245 450 412 32127 233 576 381,7 34228 250 603 424,3 33929 243 805 464,7 28730 216 523 396,7 32531 212 588 394,6 31532 208 584 372,4 33233 215 445 344,8 35834 221 500 368 32035 244 661 382,5 35536 234 680 418,9 30637 269 797 433,6 33538 269 534 441,8 33539 268 541 432,3 344


40 323 605 482 33141 304 785 504,6 32442 317 698 376,4 36643 332 796 450,4 34044 315 804 400,5 37845 291 809 556 33046 312 726 498,9 31347 316 671 469,7 30548 332 909 543,8 30749 311 831 530,9 333

We expect that each of the variables has a positive influence on Y. To find the order in which we will include the variables in our model, we calculate all correlation coefficients. We use TOOLS DATA ANALYSIS CORRELATION:

As INPUT RANGE we select all the data and the labels. As OUTPUT RANGE we first select the white ball and then click in the white bar that appears. Then we select an empty cell where we want our output. Then we click “OK”.


The result is:

Y X1 X2 X3 Y 1 X1 0,559354 1 X2 0,506961 0,55568 1 X3 -0,01823 -0,22613 -0,41237 1 The table serves for several purposes: a. We can sort the variables w.r.t. their importance for Y. The order is: X1 with r(Y, X1) = 0.559… X2 with r(Y, X2) = 0.506… X3 with r(Y, X3) = -0.018… Our first choice will be X1, then possibly X2 and then possibly X3. b. We can check the sign of r(Y, .) and check if it is consistent with our theoretical expectations. Here we find: X1: r(Y, X1) > 0: OK X2 : r(Y, X2) > 0: OK; X3 : r(Y, X3) < 0: problem! For X3 we expected a positive sign! There are several possible explanations for this problem: - we made a mistake when entering the data; - the theoretical expectations were wrong; - the correlation coefficient is not significantly different from zero.. In our example the third reason applies! We confirm this by using the t-test for correlation coefficients.

We calculate the t-value: t(r) = ²1

)2(r

nr−

−

Here n = 49 and r = -0.018…and we find t(r) = -0.125. The P-value (using the t-distribution with parameter n – 2) can be found using FUNCTION WIZZARD TDIST:


As x-value we take the calculated t-value; the degrees of freedom are n – 2; as for ‘tails’ we choose“1” because we want the P-value.

We find a P-value of 0.45 or 45%. This large P-value shows that r(Y, X3) is not significantly different from ρ = 0. But then the sign (+ or -) is unimportant. c. As a 3rd use, we have to have a closer look to the “small” correlation coefficients. It is possible that there is not a linear but another relationship with Y.


We check this in a graph. We make an X-Y-scatter with X3 on the horizontal axis.

We get the following result:


and:

050

100150200250300350400

0 200 400 600

Series1

Some make-up leads to the following graphs

050

100150200250300350400450

0 100 200 300 400

We change the scale and set the minimum = 250 resp. 200:


200

250

300

350

400

250 300 350 400

In the graph, there is non-linear relationship visible! The correlation r(Y, X3) is small and we don’t see a good way to transform X3. So, possibly X3 was a bad choice of variable. d. The table with correlations can be used to select 2 variables. In our examples we choose: FIRST VARIABLE: we take X1 because r(Y, X1) is the largest number; SECOND VARIABLE: we check X2. * We are allowed to choose X2 if there are no QMC-problems. In our example we have r(X1, X2) = 0.55. Since this is less that 60% (cf. classroom), we have no QMC-problem. Conclusion The analysis shows that we are ready to examine 2 models:

Model 1: Y^ = a + bX1 Model 2: Y^ = a + bX1 + cX2

3.2. Higher order QMC The next variable to use is X3. To decide whether or not X3 will be included in the model, we have to check first order QMC and then higher order QMC. First order QMC: we look at r(X1,X3) = -0,22 and r(X2,X3) = -0,41. Both correlation coefficients are less than the limit 0,60 Higher order QMX: we consider the model X3^ = u + vX1 + wX2 (*) In which we try to explain X3 (the candidate) with the variables that we already included in the model. If we can explain X3 well by this model, we have a problem of QMC. If we can’t explain X3 well with this model, we conclude that there are no QMC-problems. We use R² of model (*) and take as a limit the value of 36%: if the calculated R² is larger than 36% we consider this as QMC and we are not including X3 in the model.


4. Regression analysis

4.1. Example 1 (Ctd)

4.1.1. MODEL 1: Y^ = a + bX1

Using “COPY PASTE” we select the data that we need. We create a NEW Worksheet: INSERT NEW WORKSHEET. Here we put the data (Y and X1) that we want:

state Y X11 235 5082 231 5643 270 3224 261 8465 300 8716 317 7747 387 8568 285 8899 300 715

10 221 75311 264 64912 308 83013 379 73814 342 65915 378 66416 232 57217 231 70118 246 44319 230 44620 268 61521 337 66122 344 72223 330 76624 261 63125 214 39026 245 45027 233 57628 250 60329 243 80530 216 52331 212 58832 208 58433 215 44534 221 50035 244 66136 234 68037 269 79738 269 53439 268 54140 323 605


41 304 78542 317 69843 332 79644 315 80445 291 80946 312 72647 316 67148 332 90949 311 831

Now we choose: TOOLS DATA ANALYSIS REGRESSION.

* Input Y-range: we select the data about Y (with LABEL) * Input X-range: we select the variables (with LABEL) * We select “labels” * Confidence level: EXCEL gives 95 % confident statements. If we also want 99 % statements, we change 95 % into 99%. (then we get 95% ànd 99% c.s.) * Output range: click the white ball and then select an empty cell in the white space; * Residuals: the residuals are given by e(i) = Y(i) – Y^(i); * We can ask for graphs –but the quality of the graphs is bad. In our example we choose:


We get the following output: PART 1

Regression Statistics Multiple R 0,55935389 R Square 0,31287678 Adjusted R Square 0,29825713 Standard Error 40,862049 Observations 49 ANOVA

df SS MS F Significance F Regression 1 35733,60563 35733,61 21,401123 2,94097E-05 Residual 47 78476,23111 1669,707 Total 48 114209,8367

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99,0% UppIntercept 151,419701 28,10288381 5,388049 2,24E-06 94,88404472 207,95536 75,9759497 226X1 0,19170106 0,041438711 4,626135 2,941E-05 0,108337205 0,2750649 0,08045654 0,30 * First we get general information about the model: R² = 31% and R = 55% R² adjusted (not for us) standard error = s(e) = sqrt(SSE/(n-p)) where SSE = e²(1) + e²(2) + .... * s²(e) is a good estimator for the variance σ² = Var(ε). * The number of observations is n = 49 * ANOVA = the analysis of variations or variances


SSRegression = variation of Y^ = the explained variation = 35733 SSResidual = SSE = variation of the errors e = the unexplained variation SSTotal = variation of Y = the variation we have to explain In the column MS, we calculate the mean squares For linear models with a constant term, we always have:

SST = SSR + SSE and R² = SSR/SST = 1 – SSE/SST = (r(Y, Y^))²

* The value of R² can be evaluated with an F-value

)/(²)1(

)1/(²pnR

pRF−−

−=

and its P-values can be found by using the F(p – 1, n – p)-distribution. In EXCEL the P-value is given by “Significance F”. In the example, the P-value is small enough to conclude that R² is significantly different from zero. Out model makes sense! * Next we get information a bout the parameter estimates (a^ for a, and b^ for b) and their statistical properties. For the parameter a we find: a^ = least –squares estimate = 151.4 s(a^) = the estimated standard error of a^ = 28.1 the t-value of is t = a^/s(a^) = 5.38 the P-value of a^ = 2 times the one-sided prob-value of the t-value 5.38 95 % c.i. for the parameter a 99 % c.i. for the parameter a For parameter b we get similar information. In this example we conclude that the parameter estimates significantly differ from zero. (high t-value; small P-value; 0 is NOT in the c.i.) PART 2: the residuals We print a part of the residual output: RESIDUAL OUTPUT

Observation Predicted Y Residuals 1 248,803839 -13,803839462 259,539099 -28,539098793 213,147442 56,852557594 313,598798 -52,598797545 318,391324 -18,391324036 299,796321 17,203678747 315,515808 71,484191868 321,841943 -36,8419431


9 288,485959 11,5140412410 295,770599 -74,7705990211 275,833689 -11,8336888412 310,531581 -2,53158059113 292,895083 86,1049168714 277,750699 64,2493005715 278,709205 99,2907952716 261,072707 -29,0727072717 285,802144 -54,80214393

In the second column, we find the Y^-values. In the last column, we find the errors e(i) = Y(i) – Y^(i) The residuals are needed to check the basic assumptions later. PART 3: We get some bad graphs and we are not going to use these graphs: we will make our own graphs!

X1 Residual Plot

-1000

100200

0 200 400 600 800 1000

X1

Res

idua

ls

X1 Line Fit Plot

0

500

0 500 1000

X1

Y

YPredicted Y

It is better to make our own graphs. In our example we make a scatter plot of (Y, Y^). By using COPY - PASTE we place Y and Y^ in separate columns, and then use CHART WIZZARD ==> XY scatter

Y Predicted Y 235 248,803839231 259,539099270 213,147442261 313,598798


300 318,391324317 299,796321387 315,515808285 321,841943300 288,485959221 295,770599264 275,833689308 310,531581379 292,895083342 277,750699378 278,709205

... ...212 264,139924208 263,37312215 236,726673221 247,270231244 278,134102234 281,776422269 304,205446269 253,788067268 255,129974323 267,398842304 301,905033317 285,227041332 304,013745315 305,547353291 306,505858312 290,59467316 280,051112332 325,675964311 310,723282

We get

050

100150200250300350

0 200 400 600

Series1

After make up we get:


200220240260280300320340

200 250 300 350 400

Y

Y^

In ideal situations we find (Y, Y^) on the first diagonal (45°).

4.1.2. MODEL 2: Y^ = a + bX1 + cX2 We COPY – PASTE the data that we need in a new worksheet.

state Y X1 X21 235 508 394,42 231 564 457,83 270 322 401,14 261 846 523,35 300 871 4786 317 774 588,97 387 856 566,38 285 889 575,99 300 715 489,4

10 221 753 501,211 264 649 490,812 308 830 575,313 379 738 543,914 342 659 463,415 378 664 492,116 232 572 486,917 231 701 467,218 246 443 478,219 230 446 429,620 268 615 482,721 337 661 505,722 344 722 55423 330 766 533,124 261 631 741,525 214 390 382,826 245 450 41227 233 576 381,728 250 603 424,329 243 805 464,730 216 523 396,731 212 588 394,632 208 584 372,433 215 445 344,8


34 221 500 36835 244 661 382,536 234 680 418,937 269 797 433,638 269 534 441,839 268 541 432,340 323 605 48241 304 785 504,642 317 698 376,443 332 796 450,444 315 804 400,545 291 809 55646 312 726 498,947 316 671 469,748 332 909 543,849 311 831 530,9

In DATA ANALYSIS we choose regression and then fill in the form (without choosing plots)

We get the following output:


SUMMARY OUTPUT

Regression Statistics Multiple R 0,607069231 R Square 0,368533051 Adjusted R Square0,341077966 Standard Error 39,59571155 Observations 49 ANOVA

df SS MS F Significance F Regression 2 42090,09956 21045,05 13,42313 2,559E-05 Residual 46 72119,73717 1567,8204 Total 48 114209,8367

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99,0%Upper 9Intercept 100,1897013 37,26811096 2,6883493 0,009966 25,172956 175,20645 0,0498717 200,32X1 0,137661559 0,04829766 2,8502739 0,006517 0,0404435 0,2348796 0,0078852 0,2674X2 0,184833355 0,091795138 2,0135419 0,049929 5,95E-05 0,3696072 -0,061821 0,4314 * R² = 0.368 with an F-value = 13.42 with Prob-value = 2.559E-5: OK * The parameter estimates show that the parameters are different from 0 at the 95% level; at the 99% level, the 3rd estimate is not significantly different from 0. * The marginal contribution of X2 is equal to MC(X2) = R²(OLD model 1) – R²(NEW model 2) = 0.368…- 0.0313… = 0.05565. The corresponding F-value is

F = MC(X2)* (n- p(NEW model 2)/(1 - R²(NEW model 2) = 3.73

The P-value of this number can be found using the F(1, n – p(new))- distribution. Using the FUNCTION WIZARD and FDIST we find a P-value = 5,9%. If we decide to use α = 10%, this P-value show that the marginal contribution of X2 is acceptable. If we choose another level, the decision can be different.


4.2. Basic Assumptions In order to use the statistics of the previous section, there are several basic assumptions (BA). It is necessary to check whether or not we can assume these BA hold. We use the example and MODEL 2: Y = a + bX1 + cX2 + ε

4.2.1. BA1: E(εi) = 0 We make an XY-scatter with Y on the horizontal axis and the residuals on the vertical axis. If we see CLUSTERS or OULIERS in the graph, we might have problems with BA 1. Using COPY PASTE we collect the data we need: (only part of the data are here)

Y e 235 -8,02005231 -31,4475270 51,34662261 -52,3747300 -8,44326317 1,411889387 64,30088285 -44,0164300 10,92484221 -75,4873264 -16,2483308 -12,7834379 76,68521342 65,43955378 95,44653232 -36,9275231 -52,0446246 -3,56108230 -10,9912268 -6,07062

As a graph we find (after make-up):


-100

-50

0

50

100

150

200 250 300 350 400

We don’t see outliers. A closer look shows that we have around 3 clusters or groups of data. We need to investigate the origin of these clusters and possibly adapt the model by adding new variables.

4.2.2. Homoscedasticity We have to check whether or not Var(ε) is a constant. We make graphs to check this assumption. By COPY PASTE, we make the following table and we add a new column in which we calculate e². We find (part of the table is here):

e Y X1 X2 e² -8,0200 235 508 394,4 64,32118

-31,4475 231 564 457,8 988,947251,3466 270 322 401,1 2636,475

-52,3747 261 846 523,3 2743,107-8,4433 300 871 478 71,288691,4119 317 774 588,9 1,993431

64,3009 387 856 566,3 4134,603-44,0164 285 889 575,9 1937,4410,9248 300 715 489,4 119,3521

-75,4873 221 753 501,2 5698,337-16,2483 264 649 490,8 264,0061-12,7834 308 830 575,3 163,415976,6852 379 738 543,9 5880,62165,4396 342 659 463,4 4282,33595,4465 378 664 492,1 9110,04

-36,9275 232 572 486,9 1363,638-52,0446 231 701 467,2 2708,64

-3,5611 246 443 478,2 12,68131-10,9912 230 446 429,6 120,8057

-6,0706 268 615 482,7 36,85243 Method 1: graphs We make XY- scatters with the e² on the vertical axis and with Y or X1 or X2 or the index i (for times series) on the horizontal axis. The ideal graph is a horizontal box. We find (after make-up):


0

2000

4000

6000

8000

10000

200 250 300 350 400

Y

e²

0

2000

4000

6000

8000

10000

200 400 600 800 1000

X1

e²

0

2000

4000

6000

8000

10000

200 300 400 500 600 700 800

X2

e²

Clearly the graphs don’t really show a horizontal box. Moreover, the clusters can be seen more clear now. Method 2: correlations Under ideal situations, we find r(e², Y) = r(e², X1) = r(e², X2) = 0. We calculate the correlations (using the same table as in Method 1) and find:

Y X1 X2 e² e² 0,38436 0,15507 0,2688 1 We see that r(e²,Y) is not “small”. More investigation is needed.


Method 3: Bartlett + Park + Feldstein + ... See handbook

4.2.3. BA3: no autocorrelation See any handbook – We don’t treat this here.

4.2.4. BA4: εi ∼ N(0, σ²) We use the test of Kolmogorov-Smirnov and compare the TDF (theoretical df) and the EDF (Empirical df). TDF: this is ε ∼ N(0,σ²) We estimate σ² by s²(e) = (39.59)² (cf. OUTPUT of model 2) For each error e(i), we calculate P(ε ≤ e(i)) by using the FUNCTION WIZZARD and the function NORMDIST, cf below. EVF:

For each error e(i) we calculate the proportion of errors e(i)

In practice, we proceed as follows:. - using COPY PASTE, we copy the errors; - then we sort the errors from the smallest to the largest (DATA ==> SORT) and number them with 1, 2, 3, … We get columns 1 and 2 in the following table:

number e EVF TVF |EVF-TVF|1 -75,4873 0,020408 0,028296 0,0078882 -63,1081 0,040816 0,055489 0,0146733 -53,8993 0,061224 0,086719 0,0254944 -52,3747 0,081633 0,092962 0,0113295 -52,0446 0,102041 0,094356 0,0076856 -44,0164 0,122449 0,133146 0,0106977 -42,0699 0,142857 0,144007 0,001158 -41,416 0,163265 0,147787 0,0154789 -37,2263 0,183673 0,173568 0,010105

10 -36,9275 0,204082 0,17551 0,02857211 -31,4475 0,22449 0,213535 0,01095512 -29,5101 0,244898 0,22805 0,01684813 -23,3252 0,265306 0,277902 0,01259614 -21,0497 0,285714 0,297496 0,01178215 -17,8828 0,306122 0,325767 0,01964416 -17,0337 0,326531 0,333529 0,006998


17 -16,2483 0,346939 0,340773 0,00616618 -16,0392 0,367347 0,342712 0,02463519 -12,7834 0,387755 0,373405 0,0143520 -11,6244 0,408163 0,38454 0,02362321 -10,9912 0,428571 0,390666 0,03790622 -10,6319 0,44898 0,394153 0,05482723 -10,1796 0,469388 0,398555 0,07083324 -8,44326 0,489796 0,415571 0,07422525 -8,02005 0,510204 0,419744 0,0904626 -6,07062 0,530612 0,439075 0,09153727 -3,56108 0,55102 0,464169 0,08685128 -1,71448 0,571429 0,482731 0,08869729 1,411889 0,591837 0,514222 0,07761430 2,479064 0,612245 0,524961 0,08728431 6,163563 0,632653 0,56185 0,07080332 6,711255 0,653061 0,567296 0,08576533 10,92484 0,673469 0,608691 0,06477834 13,43194 0,693878 0,632781 0,06109735 13,63965 0,714286 0,634755 0,07953136 19,65465 0,734694 0,690188 0,04450637 25,82688 0,755102 0,742885 0,01221738 30,10465 0,77551 0,776463 0,00095239 36,62317 0,795918 0,822498 0,0265840 38,98275 0,816327 0,83757 0,02124341 42,02097 0,836735 0,855712 0,01897742 50,43538 0,857143 0,898625 0,04148343 51,15126 0,877551 0,901793 0,02424244 51,34662 0,897959 0,902645 0,00468645 52,34578 0,918367 0,906917 0,0114546 64,30088 0,938776 0,947805 0,0090347 65,43955 0,959184 0,950803 0,0083848 76,68521 0,979592 0,973609 0,00598249 95,44653 1 0,992035 0,007965

- In Column 3 we calculated EDF. Because of our method, we get EDF(e(i)) = i/n. - In Column 4, for each e(i) we calculated TDF:


We enter and then copy this formula to get the TDF. The graph of EDF and TDF is the following: (select the errors and EDF, TDF) (graph 2 is graph 1 after make up!)

00,20,40,60,8

11,2

-100 -50 0 50 100 150

e

EDF and TDF

Series1

Series2


00,10,20,30,40,50,60,70,80,9

1

-100 -50 0 50 100 150

e

In Column 5 of the table, we calculated ⏐EDF - TDF⏐and then the maximum of these numbers KS = MAX(⏐EDF - TDF⏐) = 0.091. This value should be compared with the theoretical values of KS. In this example, we find that KS = 0.091 is sufficiently small and we don’t reject BA4. As extra, we can make a histogram of the errors e(i). If BA4 holds, we should find a curve in the form of a clock Using the other excel-manuel (descriptive statistics) we find the following graph:

Histogram

-75 -58 -41 -24 -7 10 27 44 61 78 95 112


bb a 3 econometric sand excel

Documents