dummy

© 2002-Present. Jeeshim and KUCC625 (2005-03-26) Dummy Variables in Regression: 1

http://mypage.iu.edu/~kucc625

Using Dummy Variables in Regression

Park, Hun Myoung Indiana University at Bloomington

This document explains how to use dummy variables only in linear regression models (as opposed to nonlinear models like logit and probit). The primary focus is on fixed group effect models rather than random effect models. 1. Introduction A dummy variable is a binary variable that has either 1 or zero. It is commonly used to examine group and time effects in regression. Panel data analysis estimates the fixed effect and/or random effect models using dummy variables. The fixed effect model examines difference in intercept among groups, assuming the same slopes. By contrast, the random effect model estimates error variances of groups, assuming the same intercept and slopes. An example of the random effect model is the groupwise heteroscesasticity model that assumes each group has different variances (Greene 2000: 511-513). The data used here are of the top 50 information technology firms from the 308 page of OECD Information Technology Outlook 2004 (http://thesius.sourceoecd.org/). The data set contains revenue, R&D budget, and net income in current USD millions. 2. Regression without Dummy Consider a model of regressing R&D budget in 2002 on net income in 2000 and firm type. The dummy variable d is set to 1 for equipment and software companies and zero for other firms. Let us take a look at the data structure. Table 1. Dummy Variable Coding +-----------------------------------------------------------------------+ | firm country rd2002 net2000 type d | |-----------------------------------------------------------------------| | Samsung Korea 2,500 4,768 Electronics 0 | | AT&T USA 254 4,669 Telecom 0 | | IBM USA 4,750 8,093 IT Equipment 1 | | Siemens Germany 5,490 6,528 Electronics 0 | | Verizon USA . 11,797 Telecom 0 | | Microsoft USA 4,307 9,421 Service & S/W 1 | | EDS USA 0 1,143 Service & S/W 1 | … … … … … … … …

Let us first think about a linear regression model, ordinary least squares (OLS), without the dummy variable. Note that 0β is the intercept; 1β is the slope of net income in 2000; and iε is the error term of the regression equation. Model 1: iii incomeresearch εββ ++= 10



The estimated model has the intercept 1,482.697 and slope .223. For $ one million increase in net income, a firm is likely to increase R&D budget in 2002 by $ .223 million, holding all other things constant. Table 2. Regression without Dummy Variables (Model 1) Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 1, 37) = 7.07 Model | 15902406.5 1 15902406.5 Prob > F = 0.0115 Residual | 83261299.1 37 2250305.38 R-squared = 0.1604 -------------+------------------------------ Adj R-squared = 0.1377 Total | 99163705.6 38 2609571.2 Root MSE = 1500.1 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2230523 .0839066 2.66 0.012 .0530414 .3930632 _cons | 1482.697 314.7957 4.71 0.000 844.8599 2120.533 ------------------------------------------------------------------------------ 3. Regression with a Dummy: Binary Categories Despite moderate goodness of fit statistics such as F and t, it must be a naïve model. R&D investment tends to vary across industries. So, let us take such a difference into account in the model, assuming that equipment and software firms have more R&D investment than do telecommunications and electronics companies. There may or may not be correlation (dependence) between the dummy variable (firm types) and regressors (net income). 3.1 Model and Estimation Now, the new model with a dummy variable becomes, Model 2: iiii dincomeresearch εδββ +++= 10 , where δ is the coefficient of the dummy variable that affects equipment and software companies only. Thus, this model has two slightly different regression equations for two groups.

iii incomeresearch εδββ +++= 1*10 for equipment and software firms

iii incomeresearch εδββ +++= 0*10 for telecom and electronics firms The regression indicates a positive impact of two-year- lagged net income on firm’s R&D budget. Equipment and software firms on average invest $ 1,007 million more for R&D than do telecommunication and electronics companies. There is only tiny difference in the slop (.223 versus .218) between two models with/without the dummy, supporting the assumption that all firms have the same impact of net income on investing for research and development. The regression equations of the two groups are,



Equipment and Software : Research = 2140.205 + .218*income 1 Telecom. and Electronics : Research = 1133.579 + .218*income Table 3. Regression with a Dummy Variable (Model 2) Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 2, 36) = 6.06 Model | 24987948.9 2 12493974.4 Prob > F = 0.0054 Residual | 74175756.7 36 2060437.69 R-squared = 0.2520 -------------+------------------------------ Adj R-squared = 0.2104 Total | 99163705.6 38 2609571.2 Root MSE = 1435.4 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d | 1006.626 479.3717 2.10 0.043 34.41498 1978.837 _cons | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------ 3.2 Comparison between Model 1 and Model 2 Let us draw a plot to highlight the difference between the Model 1 and 2 more clearly. Look at the middle red line first. It is the regression line of the Model 1 without the dummy variable. The top green is regression line for equipment and software companies, while the bottom yellow line is one for telecommunication and electronics firms in Model 2. Of course, green and yellow lines are parallel with a difference of 1,006.626, the coefficient of the dummy variable in Table 3. Figure 1. Comparison between Model 1 and Model 2 (Fixed Group Effect)

1 The intercept of equipment and software firm is computed as 2140.205 = 1006.626 +1133.579.



This plot shows that the Model 1 is canceling out the group difference, and thus report misleading intercept. The difference between two groups of firms looks substantial. The t-test for the dummy parameter reject the null hypothesis of no difference in intercepts at the .05 level (p<.043). Consequently, we conclude that the Model 2 considering fixed group effects is better than the simple Model 1. You may compare goodness of fit statistics (e.g., F, t, R-squared, and SSE) of the two models.2 3.3 Common Misunderstandings Some people, especially those who do not know exactly how dummies work, may ask, “What if we code the dummy variable reversely?” The simplest answer is “It gives equivalent results.” Let us give 1 to d0 if d is 0 (telecommunications and electronics firm) and zero if d is 1 (equipment and software). And then replace d with d0 in Model 2. The model becomes, Model 2-1: iiii dincomeresearch εδββ +++= 0''

1'0

Model 2-1 is equivalent to Model 2 in that both produce the identical regression equations. ANOVA table of two models are identical. The slope of the regressor remains unchanged: 1

'1 ββ = ; The sign of dummy parameter was switched: δδ −=' ; the intercept

of Model 2-1 is the actual intercept of equipment and software companies whose dummy variable is excluded in Model 2-1: δββ += 0

'0 . That is, one implies the other. It is

because two models use different baseline categories, reference points. Model 2 uses telecommunications and electronics firms as a baseline, while Model 2-1 switches to equipment and software companies. They see the same thing from different views. Table 4. Regression with a Reversely Coded Dummy (Model 2-1) Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 2, 36) = 6.06 Model | 24987948.9 2 12493974.4 Prob > F = 0.0054 Residual | 74175756.7 36 2060437.69 R-squared = 0.2520 -------------+------------------------------ Adj R-squared = 0.2104 Total | 99163705.6 38 2609571.2 Root MSE = 1435.4 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d0 | -1006.626 479.3717 -2.10 0.043 -1978.837 -34.41498 _cons | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 ------------------------------------------------------------------------------ Some may also ask, “Then, why don’t we run regression on a group by group basis?” Yes, we may get similar regression equations by running regression only on equipment and software firms and another regression on telecommunications and electronics companies.

2 If the coefficient of the dummy variable d turns out statistically insignificant, we can conclude that there is no group effect, or that all firms have the same intercept, in favor of Model 1.



Model 1-1: iii incomeresearch εββ ++= 10 for equipment and software firms Model 1-2: jjj incomeresearch ελλ ++= 10 for telecom and electronics firms What is the difference between this group by group regression, Model 1-1 and 1-2, and the Model 2 with a dummy? The former assumes that two groups are different species like monkey versus lemon. The parameters β and λ are not comparable in a strict statistical sense. Thus, we may not be able to examine the group differences by comparing (eyeballing) goodness-of- fits of two separate regressions (Model 1-1 and 1-2). Another difference lines in the efficiency of the slope, which is improved by pooling data; thus, Model 2 produces more efficient estimates than Model 1-1 and 1-2. What if you present Model 1 (pooled regression), Model 1-1, and Model 1-2 at the same time? What if you report Model 1 as well as Model 2. These attempts will end up with logical fallacy because these models have contradictory assumptions. If Model 2 is true, for example, Model 1 must be false. Model 1-1 is not comparable to Model 1 and 1-2. 4. Meanings of Dummy Variable Coefficients In order to directly get regression equations fo r Model 2, you may run the regression with two dummy variables: one for equipment and software firms and another d0 for telecommunication and electronics (see Table 4). Let us call this model as Model 2’ since it is equivalent to Model 2. Note that the intercept 0β was suppressed to avoid perfect multicollinearity. Model 2-2: iiiii ddincomeresearch εδδβ +++= 0011 Table 5. Regression with Two Dummies without the Intercept (Model 2-2) Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 3, 36) = 29.88 Model | 184685604 3 61561868.1 Prob > F = 0.0000 Residual | 74175756.7 36 2060437.69 R-squared = 0.7135 -------------+------------------------------ Adj R-squared = 0.6896 Total | 258861361 39 6637470.79 Root MSE = 1435.4 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 d0 | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------ You may observe several differences in statistics between Table 3 (Model 2) and Table 5 (Model 2-2). In particular, coefficients and t statistics of dummy variables are different, although two models are equivalent.3 How do we explain these differences? 3 The R2 and adjusted R2 are not well defined (incorrect) in the Model 2-2 that suppresses the intercept.



The coefficients of dummy variables in Model 2 and 2-2 have different meanings. In Model 2-2, the coefficient estimates of dummies, 0δ and 1δ , are actual intercepts of two groups (2,040.205 and 1,133.579). Accordingly, the null hypothesis of t-test is that parameters 0δ and 1δ are zero. By contrast, the coefficient of d in Model 2 estimates the difference of 0δ from 1δ , where

0δ is the intercept of the baseline category, telecom and electronics firms. Accordingly, the null hypothesis is that the difference, not the actual intercepts, is zero: 001 =−= δδδ . Consider the following two plots of regression lines. The left plot depicts a situation where both 0δ and 1δ are close to zero in Model 2-2 and their difference 01 δδδ −= is not substantial in Model 2. T-tests in both models may not be rejected; No group effect. Thus, Model 1, a pooled model, may be better than Model 2. In the right plot, 1δ may turn out statistically different (far away) from the zero (t-test may be rejected), while 0δ is close to zero (not rejected). Accordingly, the difference

01 δδδ −= is also substantial in Model 2 (rejected). It indicates that there is some fixed effect between two groups; so, the Model 2 is superior to Model 1. Figure 2. Meanings of Dummy Variable Coefficients

Let us run the three regression models mentioned so far using SAS and STATA. In SAS, use the REG procedure as follows. Note that the /*…*/ is used for comments. PROC REG; MODEL rd2002 = net2000; /* Model 1*/ MODEL rd2002 = net2000 d; /* Model 2 */ MODEL rd2002 = net2000 d d0 /NOINT; /* Model 2-2 */ RUN; In STATA, run the .regress command as follows. Note that the // is used for comments. . regress rd2002 net2000 // Model 1



. regress rd2002 net2000 d // Model 2

. regress rd2002 net2000 d d0, noconstant // Model 2-2 5. Regression with Dummies: Multiple Categories

Now, imagine a situation where more than three groups need to be considered in a model. We may classify the firms into three types: telecommunication, electronics, and equipment/software, assuming they have different intercepts. Researchers may examine seasonal impacts (spring, summer, fall, and winter; 1st through 4th quarter) by deseasonalizing data (Greene 2000: 319). 5.1 Model and Data Structure Here is a regression model with multiple dummy variables.

iiiiii dddincomeresearch εδδδββ +++++= 321 32110 How do we make three dummy variables for the three firm types? The d1 is set 1 for telecommunications firms and 0 for others; d2 is set 1 for electronics firms and 0 for others. Similarly, d3 is set 1 for equipment and software and 0 otherwise; so d3 is identical to d of the Model 2 in section 3. Look at the data structure of multiple dummies. Table 6. Data Structure for Multiple Dummy Model +----------------------------------------------------------------------+ | firm rd2002 net2000 type d1 d2 d3 | |----------------------------------------------------------------------| | Samsung 2,500 4,768 Electronics 0 1 0 | | AT&T 254 4,669 Telecom 1 0 0 | | IBM 4,750 8,093 IT Equipment 0 0 1 | | Siemens 5,490 6,528 Electronics 0 1 0 | | Verizon . 11,797 Telecom 1 0 0 | | Microsoft 3,772 9,421 Service & S/W 0 0 1 | | EDS 24 1,143 Service & S/W 0 0 1 | … … … … … … … … …

5.2 Three Approaches to Running LSDV Regression Now, we are ready for regression analysis, called the least squares dummy variable (LSDV) regression. However, here is the problem. When including all the three dummy variables and an intercept, we will be caught in a so called “dummy variable trap.” This problem is a perfect multicollinearity; the regression equation is not solvable since X matrix is not fully ranked. There are three approaches to running regression analyses with multiple dummy variables. First look at the functional forms below. The first approach--let us call it LSDV1--run OLS with all dummy variables, ignoring intercept. The second LSDV2 omits one of dummy variables and includes the intercept. The final approach LSDV3 includes all



dummy variables and the intercept, but it imposes a restriction that the sum of parameters of all dummies is zero. Table 6 summarizes the features of the three LSDVs.

iiiiii dddincomeresearch εδδδβ ++++= 321 3211 (without intercept)

iiiii ddincomeresearch εδδββ ++++= 21 2110 (without one of three dummy variables)

iiiiii dddincomeresearch εδδδββ +++++= 321 32110 with restriction of 0321 =++ δδδ

The biggest difference is the meanings of dummy variable parameters and their hypothesis tests. The first approach reports the coefficients that are easy to interpret substantively. They are actual intercepts of three groups as in the following regression equations (see Table 7). Telecom firm : Research = 153.624 + .215*income Electronics : Research = 1695.486 + .215*income Equipment & S/W : Research = 2147.559 + .215*income In the second approach, LSDV2, the intercept is the coefficient of the dropped dummy, playing a role of baseline or reference point. Other coefficients are differences of the baseline from corresponding actual coefficients (see Table 8). For example, the intercept 2,147.559 in LSDV2 is the actual coefficient of d3 that is dropped. The coefficient -452.073 of d2 is computed as 1695.486- 2147.559. Likewise, 153.624 in LSDV1 is computed as -1993.935 + 2147.559. What if we omit d2 instead of d3? We may have different parameter estimates and standard errors of the dummy variables. Note that the coefficient of net income is quite similar to those of Model 1 and Model 2 (LSDV2) in section 2 and 3 (.223 versus .218 versus .215). Table 7. LSDV1 without the Intercept Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 4, 35) = 28.70 Model | 198376404 4 49594101.1 Prob > F = 0.0000 Residual | 60484956.6 35 1728141.62 R-squared = 0.7663 -------------+------------------------------ Adj R-squared = 0.7396 Total | 258861361 39 6637470.79 Root MSE = 1314.6 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2151104 .0735702 2.92 0.006 .065755 .3644659 d1 | 153.6238 469.5762 0.33 0.745 -799.6665 1106.914 d2 | 1695.486 373.0145 4.55 0.000 938.2267 2452.746 d3 | 2147.559 397.9181 5.40 0.000 1339.742 2955.375 ------------------------------------------------------------------------------ The third approach, LSDV3, produces coefficients that indicate how far the averaged group effect, the intercept of LSDV3, is away from the actual parameters (see Table 9). For example, the intercept 1,332.223 is computed as (153.624+1695.486+2147.559)/3. The coefficient of d3 815.33581 is 2,147.559 – 1,332.223. Note that the 6.14175E-13 in the last part of SAS output is virtually zero; this is the restriction.



Table 8. LSDV2 without One Dummy Variable Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 3, 35) = 7.46 Model | 38678749 3 12892916.3 Prob > F = 0.0005 Residual | 60484956.6 35 1728141.62 R-squared = 0.3900 -------------+------------------------------ Adj R-squared = 0.3378 Total | 99163705.6 38 2609571.2 Root MSE = 1314.6 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2151104 .0735702 2.92 0.006 .065755 .3644659 d1 | -1993.935 561.9429 -3.55 0.001 -3134.74 -853.1303 d2 | -452.0725 481.2018 -0.94 0.354 -1428.964 524.8192 _cons | 2147.559 397.9181 5.40 0.000 1339.742 2955.375 ------------------------------------------------------------------------------ Table 9. LSDV3 with Restriction Imposed (SAS output) The REG Procedure Model: MODEL1 Dependent Variable: rd2002 NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read 50 Number of Observations Used 39 Number of Observations with Missing Values 11 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 38678749 12892916 7.46 0.0005 Error 35 60484957 1728142 Corrected Total 38 99163706 Root MSE 1314.58800 R-Square 0.3900 Dependent Mean 2023.56410 Adj R-Sq 0.3378 Coeff Var 64.96399 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1332.22301 280.18308 4.75 <.0001 net2000 1 0.21511 0.07357 2.92 0.0060 d1 1 -1178.59917 333.36182 -3.54 0.0012 d2 1 363.26336 288.19307 1.26 0.2158 d3 1 815.33581 297.13197 2.74 0.0095 RESTRICT -1 6.14175E-13 . . . * Probability computed using beta distribution.

Table 10 summarizes the differences of three approaches discussed so far.



Table 10. Three Approaches to Running Dummy Variable Models (LSDVs) LSDV No intercept Dropping one dummy Imposing restriction Dummy included a

da dd −1 b

dbb dd −2,α c

dcc dd −1,α

Intercept No Yes Yes All dummy? Yes (d) No (d-1) Yes (d) Restriction? No No 0=∑ c

id *

Meaning of coefficient Fixed group effect How far away from the reference point (dropped)?

How far away from the average group effect?

Coefficients ad1 , ad 2 ,… add b

iba

i dd += α , ba

droppedd α=

ci

cai dd += α , where

∑= ai

c dd1

α

H0 of T-test 0=aid 0=− a

droppedai dd 0

1=− ∑ a

iai d

dd **

Source: http://mypage.iu.edu/~kucc625/documents/Panel_Data_Models.pdf 5.3 Comparing Statistics of the Three LSDVs The t-test for dummy variable parameters should be interpreted with cautions since three approaches have different meanings of the dummy coefficients (see Table 10). LSDV1 is easy to interpret these coefficients because they are actual intercepts. Keep in mind that LSDV2 examines the difference of the baseline intercept from an actual intercept, while LSDV3 checks how far the averaged intercept is away from an actual intercept. The null

hypotheses of LSDV1 through LSDV3 are 0=aid , 0=− a

droppedai dd , 0

1=− ∑ a

iai d

dd ,

respectively. Therefore, you may not conclude, for example, that intercept of the first group (telecommunications) is statistically significant, or the parameter of d1 is not zero, by referring the t-test of LSDV2 (t=-3.55 and p<.001). The t-test just tells that the intercept of telecommunication firms is substantially different from that of equipment and software companies; it does not tell if the intercept is close to zero or not because the reference point is not zero. Instead, you need to look at the t-test in LSDV1. The small t statistics .33 and large p-value .745 in Table 7 allows us not to reject the null hypothesis that the actual intercept of the telecommunication firm is zero: 01 =ad . Although the LSDV1 without intercept is easy to interpret, it has serious problems in reporting goodness of fit measures (see Table 11). This approach reports wrong SSM and MSM, thus R2 and F test for 0... 11 == −ndd . However, LSDV1 reports correct SSE, MSE, DFerror, and standard error of parameter estimates. By contrast, LSDV2 and LSDV3 report correct information at the cost of interpreting dummy coefficients in a complicated manner.

* This restriction reduces the number of parameters to be estimated, making model identified. ** In SAS, the H0 needs to be rearranged as 0)1( =−− ∑ a

jai ddd , where ji ≠



Table 11. Comparing Statistics of Three LSDVs LSDV1 LSDV2 LSDV3 R2 and adjusted R2 Incorrect Correct Correct F test Incorrect Correct Correct Standard error of b Correct Correct Correct SSM/MSM Incorrect Correct Correct SSE/MSE Correct Correct Correct DFerror

4 N-K N-K N-K 5.4 Software Issues All data analysis software supports the LSDV1 and LSDV2. Only SAS and LIMDEP support linear regression with restriction. However, LIMDEP reports a little bit different parameter estimates across approaches. Although providing various econometric models, LIMDEP is not good for working with data sets. SAS and STATA respectively have TSCSREG procedure and .xtreg command to run fixed/random effect models without dummies. The TSCSREG procedure works only on panel data. Table 12. Comparing Estimation of Three LSDVs LSDV1 LSDV2 LSDV3 SAS 9.1 REG w/ NOINT REG REG w/ RESTRICT STATA 8.2 .regress w/ nocon .regress N/A LIMDEP 8.0 Regress w/o ONE Regress w/ ONE Regress w/ CLS R 2.xx > lm() w/ -1 > lm() N/A SPSS 12.0 Regression w/ Origin Regression N/A

The following script runs LSDV3 using the RESTRICT statement of the REG procedures. PROC REG;

MODEL rd2002 = net2000 d1-d3; RESTRICT d1 + d2 + d3 = 0;

RUN;

The following STATA .xtreg command runs fixed within effect panel data model. 5 Note that the i(type2) option specifies the independent unit, and that the type2 is recoded from the type so that it has 1, 2, 3 for three firm types (d1 through d3). .xtreg rd2002 net2000, fe i(type2) 6. Regression with Dummies: Two-Way LSDVs The previous section addresses the one-way LSDV, in which only one group variable is considered. Now, let us move on to the two-way LSDV. 4 The K denotes the sum of the number of dummy variables, regressors, and the intercept included in the model. The N is the total number of observations used in the regression model. 5 Individual dummy coefficients need to be computed and their standard errors should be corrected (adjusted).



6.1 Data Structure and Estimation A new group variable is the area of firms’ ownership. Here is another set of three dummy variables g1, g2, g3. The g1 is set 1 if firms are owned by Asian countries and 0 otherwise. Similarly, the g2 and g3 are coded for European and American companies, respectively. Look at the data structure. Table 13. Data Structure of the Two-Way LSDV +----------------------------------------------------------------------------+ | firm type d1 d2 d3 area g1 g2 g3 | |----------------------------------------------------------------------------| | Samsung Electronics 0 1 0 Asia 1 0 0 | | AT&T Telecom 1 0 0 America 0 0 1 | | IBM IT Equipment 0 0 1 America 0 0 1 | | Siemens Electronics 0 1 0 Europe 0 1 0 | | Verizon Telecom 1 0 0 America 0 0 1 | | Microsoft Service & S/W 0 0 1 America 0 0 1 | | EDS Service & S/W 0 0 1 America 0 0 1 | … … … … … … … … … … …

Now, our model becomes a little bit messy since it has six dummy variables. In order to avoid the perfect multicollinearity, we have to 1) omit two dummy variables, one from each set of dummy variables, 2) omit one dummy variable for ownership areas and impose restriction for firm type, 3) omit one dummy variable for firm type and impose restriction for ownership areas, or 4) impose two restriction: one for firm type and the other for ownership areas. Note that you must not omit intercept in the two-way fixed effect model. The following is the simplest approach that omits two dummy variables.

iiiiiii ggddincomeresearch εγγδδββ ++++++= 2121 212110 Table 14. Two-Way Fixed Effect Model (LSDV2) Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 5, 33) = 6.19 Model | 47996204.2 5 9599240.84 Prob > F = 0.0004 Residual | 51167501.4 33 1550530.35 R-squared = 0.4840 -------------+------------------------------ Adj R-squared = 0.4058 Total | 99163705.6 38 2609571.2 Root MSE = 1245.2 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .3008584 .0830277 3.62 0.001 .1319374 .4697795 d1 | -2446.278 579.9832 -4.22 0.000 -3626.262 -1266.293 d2 | -923.931 503.678 -1.83 0.076 -1948.672 100.8097 g1 | 1375.542 579.5446 2.37 0.024 196.4499 2554.635 g2 | 907.2314 570.3879 1.59 0.121 -253.2315 2067.694 _cons | 1440.654 474.693 3.03 0.005 474.8843 2406.424 ------------------------------------------------------------------------------ Note that this model has many parameters to be estimated, compared to the number of observations available. We can draw nine regression equations depending on combinations of three firm types and three areas of ownership: 9 = 3 X 3.



(1) iii incomeresearch εββ ++++++= 000010 (American equipment & S/W firms) (2) iii incomeresearch εδββ ++++++= 000110 (American telecom. firms) … (8) iii incomeresearch εγδββ ++++++= 00 1210 (Asian electronics firms) (9) iii incomeresearch εγδββ ++++++= 2210 00 (European electronics firms) For example, the regression equation for Asian telecommunication companies is, Research = 369.918 + .301* Income = (1,440.654-2,446.278+1,375.542) + .301* Income 6.2 Full-Model versus Restricted Model Let us call this two-way fixed effect model as a full-model or unrestricted model. We have four restricted or nested models that have different subsets of independent variables. Note that the second and third models should be estimated by one of LSDV approaches. (1) no fixed effect at all: iii incomeresearch εββ ++= 10 (Model 1) (2) type effect only : iiii dincomeresearch εδββ +++= 10 (Model 2) (3) type effect only: iiiiii dddincomeresearch εδδδββ +++++= 321 32110 (4) area effect only: iiiiii gggincomeresearch εγγγββ +++++= 321 32110 Table 15. Fixed Area Effect Model Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 3, 35) = 3.02 Model | 20395250.9 3 6798416.97 Prob > F = 0.0426 Residual | 78768454.7 35 2250527.28 R-squared = 0.2057 -------------+------------------------------ Adj R-squared = 0.1376 Total | 99163705.6 38 2609571.2 Root MSE = 1500.2 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .2930783 .097524 3.01 0.005 .0950941 .4910626 g1 | 788.3469 633.0243 1.25 0.221 -496.7608 2073.455 g2 | -29.29548 631.1481 -0.05 0.963 -1310.594 1252.003 _cons | 996.9815 550.7665 1.81 0.079 -121.134 2115.097 ------------------------------------------------------------------------------

Which one is the best model? It is not good idea to compare F statistics and t-tests for individual parameter estimates. We may use so called “incremental F-test” to examine changes in goodness-of- fits of the full-model (or unrestricted model) and restricted models (Greene 2000; Fox 1997). This F-test requires the sum of squared of error (SSE), e’e, of the unrestricted and restricted models. The null hypothesis is that the parameters of added regressors (dummies here) are all zero (e.g., 0: 3210 === δδδH ).

The formula of the F-test is )/()1(

/)()/('

/)'(),( 2

2*

2*

'*

KNRJRR

KNeeJeeee

KNJF−−

−=

−−

=− ,



where *'*ee and 2

*R are respectively SSE and R2 of the restricted model. J is the number of dummy variables that were actually taken out of the full model (e.g., 2 for the second and third restricted model). Keep in mind that R2 in LSDV1 without the intercept is not well defined; so DO NOT plug R2 of LSDV1 in the second formula! Let us compare the full-model (Table 14) and fixed area effect model (Table 15). The F statistic of 8.9005 is large enough to reject the null hypothesis (p<.0008), signaling superiority of the full-model. Adding two dummy variables may reduce SSE (e’e) substantially.

(2,33) 8.9005)639/().451,167,501(

2/) .451,167,5017.454,768,78()/('

/)'(),( *

'* =

−−

=−

−=−

KNeeJeeee

KNJF

Now consider the full-model versus fixed type effect model (Table 8). The small F statistic indicates that the full-model does not improve goodness of fit significantly by including two more variables (p<.0633). Thus, we do not reject the null hypothesis in favor of the restricted model.

(2,33) 0046.3)639/().451,167,501(

2/) .451,167,5016.956,484,60()/('

/)'(),( *

'* =

−−

=−

−=−

KNeeJeeee

KNJF

How do we compare the fixed type effect models in Table 3 with one dummy (Model 2) and Table 8 with two dummy variables? In this case, the model with two dummies becomes the full-model. A large F statistic allows us reject the null hypothesis in favor of the full-model with two dummies (p<.0080).

(1,35) 7.9223)439/()6.956,484,60(

1/) 6.956,484,60.774,175,756()/('

/)'(),( *

'* =

−−

=−

−=−

KNeeJeeee

KNJF

7. Regression with Threshold Effect Let us consider the fixed effect of academic degree that are grouped into Ph.D. degree, Masters’ degree, B.A., and diploma. In general, B.A. degree holders have diploma as well; masters’ degree holders have B.A. degree as well as diploma; and so forth. Degree effect is cumulative. This effect is called as threshold effect. Suppose we want to know the threshold effects of academic degree on the annual income. Note that one of dummy variable, say t1, needs to be dropped in order to avoid perfect multicollinearity.

iiiiii ttteffortincome ετττββ +++++= 432 43210 The data structure of threshold effect is different from that of ordinary LSDVs. In table 15, compare the d1 through d4 and t1 through t4 to check how differently they recode



academic degree. For masters’ degree, for instance, only d3 is set to 1, while t1 through t3 are all coded as 1. Table 16. Data Structure for Threshold Effect Model +---------------------------------------------------------------------------+ | income effort degree d1 d2 d3 d4 t1 t2 t3 t4 | |---------------------------------------------------------------------------| | 13.242 1.44977 Diploma 1 0 0 0 1 0 0 0 | | 32.983 1.01713 B.A 0 1 0 0 1 1 0 0 | | 47.962 .67178 Masters 0 0 1 0 1 1 1 0 | | 52.048 2.11554 Ph.D. 0 0 0 1 1 1 1 1 | | 50.528 2.55896 B.A. 0 1 0 0 1 1 0 0 | | 17.179 .68774 Ph.D. 0 0 0 1 1 1 1 1 | … … … … … … … … … … … … … There are four regression equations depending on degrees. They share the same slope, of course. Note that the intercepts are cumulative in a sense that they are 0β actually 1τ ;

21 ττ + ; 321 τττ ++ ; and 4321 ττττ +++ , respectively. (1) 43210 τττββ ++++= ii effortincome for the Ph.D. degree holders (2) 03210 ++++= ττββ ii effortincome for the Masters’ degree holders (3) 00210 ++++= τββ ii effortincome for the B.A. degree holders (4) 00010 ++++= ii effortincome ββ for the diploma holders It is notable that iτ is used to capture the marginal value of the academic degree. For example, 3τ is the marginal value of the B.A. degree. We may say that Masters’ degree holders on average earn 3τ more income than B.A. degree holders, holding all others constant. 8. Regression with Interaction Effect We have discussed so far regression models with dummy variables that share the same slope. Only differences across groups lie in intercepts. Now, move on to regression models with dummies that have different slopes and/or intercepts. 8.1 Regression with Different Slope and Intercept Let us reconsider Model 2 discussed in section 2. This time we add one regressor, an interaction term between net income in 2000 and the dummy variable. Now, we have a revised regression model.

iiiii ddincincomeresearch εδβββ ++++= _210 The interaction term is a product of a regressor net2000 and the dummy variable d: inc_d=net2000 * d. Note that the interaction term is identical to the net2000 if the dummy variable is 1 and it otherwise has zero.



This model has two regression equations with different slopes and intercepts. You may compare them with those in section 3. Equipment and Software : Research = 2047.062 + .255*income 6 Telecom. and Electronics : Research = 1181.956 + .198*income Table 17. Data Structure for Interaction Effect Model +-------------------------------------------------------------------+ | firm type rd2002 net2000 inc_d d | |-------------------------------------------------------------------| | Samsung Electronics 2,500 4,768 0 0 | | AT&T Telecom 254 4,669 0 0 | | IBM IT Equipment 4,750 8,093 8093 1 | | Siemens Electronics 5,490 6,528 0 0 | | Verizon Telecom . 11,797 0 0 | | Microsoft Service & S/W 4,307 9,421 9421 1 | | EDS Service & S/W 0 1,143 1143 1 | … … … … … … … … The interaction effect turns out statistically insignificant at the .5 level (p<.738). Thus, we conclude that the slope of equipment and software companies is not substantially different from that of telecommunications and electronics firms. However, you may not conclude that the intercept of equipment and software firms is not statistically significant (close to zero) because of small t statistics (p<.186). Remember that the parameter indicates the difference of actual intercepts of the two types of firms. Table 18. Regression Model with Interaction Effect 1 Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 3, 35) = 3.98 Model | 25227993.1 3 8409331.02 Prob > F = 0.0153 Residual | 73935712.5 35 2112448.93 R-squared = 0.2544 -------------+------------------------------ Adj R-squared = 0.1905 Total | 99163705.6 38 2609571.2 Root MSE = 1453.4 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .1975142 .1015407 1.95 0.060 -.0086244 .4036527 net_d | .0571731 .1696052 0.34 0.738 -.2871437 .4014899 d | 865.1056 641.755 1.35 0.186 -437.7264 2167.938 _cons | 1181.956 376.7763 3.14 0.003 417.0599 1946.853 ------------------------------------------------------------------------------ 8.2 Regression with Different Slope and the Same Intercept Now, exclude the dummy variable so that only the regressor and the interaction term remain in the model. This model produces two regression equations with different slopes and the same intercept, which are less likely in the real world.

6 The equation of equipment & software firms is iii incomeresearch εββδβ ++++= )()( 210 .

Thus, intercept is 2,047.062=1,181.956+865.1056. The slope is .255=.1975142+.0571731.



iiii dincincomeresearch εβββ +++= _210 Equipment and Software : Research = 1480.15 + .353*income 7 Telecom. and Electronics : Research = 1480.15 + .146*income The t statistic of 1.59 for interaction term indicates that there is no statistically significant interaction effect (p<.120). Note that the SEE, square root of MSE, becomes larger than that of any other models discussed so far. Table 19. Regression Model with Interaction Effect 2 Source | SS df MS Number of obs = 39 -------------+------------------------------ F( 2, 36) = 4.95 Model | 21389278.1 2 10694639.1 Prob > F = 0.0126 Residual | 77774427.5 36 2160400.76 R-squared = 0.2157 -------------+------------------------------ Adj R-squared = 0.1721 Total | 99163705.6 38 2609571.2 Root MSE = 1469.8 ------------------------------------------------------------------------------ rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- net2000 | .1463857 .0952541 1.54 0.133 -.0467987 .33957 net_d | .2067402 .1297268 1.59 0.120 -.0563579 .4698383 _cons | 1480.15 308.4474 4.80 0.000 854.5894 2105.71 ------------------------------------------------------------------------------ Figure 3 compares the two regression models with interaction effects. The left plot depicts regression equations with different slopes and intercepts. The regression equations on the right have different slopes, but have the same intercept. Figure 3. Regression Model with Interaction Effect

8.3 Limitation and Further Direction The regression model with interaction effect is likely to sufferer from multicollinearity. It is because the interaction term tends to be highly correlated to the dummy variable.8 As 7 The equation of equipment & software firms is iii incomeresearch εβββ +++= )( 210 . Thus, the

slope is .353=.1463857+.2067402.



many interaction terms are included in the model, accordingly, it is more likely that multicollinearity problem becomes severe. If two groups show different disturbance variances, the pooled regression may result in one biased estimate of disturbance variances and the incorrect estimate of the covariance matrix (Greene 2000: 323). It is case for the model of groupwise heteroscadasticity, an example of random group effect model for panel data. 9. Spline Regression (Greene 2000: 324)

What is the Spline? Smith (1979) put it as following, “Splines are generally defined to be piecewise polynomials of degree n whose function values and first n-1 derivatives agree at points where they join. The abscissas of these joint points are called knots. Polynomial may be considered a special case of splines with no knots, and piecewise (sometimes also called grafted or segmented polynomials with fewer than the maximum number of continuity restrictions may also be considered splines. The number and degrees of polynomial pieces and the number and position of knots may vary in different situation” (Smith 1979: 57).

Suppose we know some threshold va lues, say age 19 and 27, which significantly change the intercepts and slopes in corresponding intervals of an independent variable. We need two dummy variables of d1 and d2. Individuals younger than 19, for example, are coded 0 in both d1 and d2; for those between 19 and 27, only d1 is set to 1; and those older than 27 have 1 in both d1 and d2. The regression equation will be,

iiiiiiiii ageddageddageincome εγδγδββ ++++++= 2211 221110 . That is,

iii ageincome εββ ++= 10 for those younger than 19

iii ageincome εδγββ ++++= 1110 )( for those between 19 and 27

iii ageincome εδδγγββ ++++++= 212110 )( for those older than 27

We need two conditions to make the regression function continuous.

(1) *11111

*110 )()( tt γβδβββ +++=+ at the age of 19

(2) *2211211

*21111 )()()()( tt γγβδδβγβδβ +++++=+++ at the age of 27

Note that *

1t and *2t respectively represent the threshold values, often called knots (19 and

27 in this case).

8 Interaction is different from correlation in a sense that regressors may jointly affect dependent variable no matter whether they are correlated or not (Fox 1997).



From (1) 0*111 == tγδ or *

111 tγδ −= From (2) 0*

222 == tγδ or *222 tγδ −=

Then, plug in the two results in the original regression model.

iiiiiiiii ageddtageddtageincome εγγγγββ ++−+−+= 2211 2*221

*1110

iiiiiii tagedtagedageincome εγγββ +−+−++= )(2)(1 *22

*1110

As shown in the last equation, we have to create two new variables: one for

)( *11 taged i − and the other for )( *

22 taged i − . Finally, run the OLS to estimate the spline regression model. We may test the hypotheses on the knots; 01 =γ , 02 =γ , or

021 == γγ . The SAS script for this spline regression will be: PROC REG;

MODEL income = age age19 age27; TEST age19=1, age27=0;

RUN; 10. Conclusion Using dummy variables in regression analysis is useful to capture fixed/random effects. This technique is able to explain how group/time differences affect models. However, it must be used with cautions. First, keep in mind that each LSDV has different interpretations of dummy parameters, and that the t-tests have different null hypotheses. Otherwise, you may be totally misleading, ending up with wrong conclusion. Second, be parsimonious by minimizing the number of dummies especially when you do not have many observations. Avoid the problem of “many parameters, small sample size.” Try to hit the highlights, focusing on your main arguments. Third is related to the second. Be careful not to be caught in the “dummy variable trap,” perfect multicollinearity. As you include many dummies, the likelihood of being in trouble will increase sharply. Finally, do not try to compare “monkey and lemon.” Categories should have something in common with each others so that comparison is meaningful from analytic and theoretic perspective. Comparing apple and pear is better than contrasting apple and onion. By the same token, telecommunications versus electronics firms makes much more sense than telecommunications firms versus universities.



References Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Wiley, John & Sons. Fox, John. 1997. Applied Regression Analysis, Linear Models, and Related

Methods. Newbury Park, CA: Sage. Freund, Rudolf J., and Ramon C. Littell. 2000. SAS System for Regression, 3rd ed. Cary,

NC: SAS Institute. Greene, William H. 2000. Econometric Analysis, 4th ed. Prentice Hall. SAS Institute. 2004. SAS 9.1 User’s Guide. Cary, NC: SAS Institute.

http://www.sas.com/ Smith, Petricia L. 1979. "Splines as a Useful and Convenient Statistical Tool." American

Statistician 33(2) (May): 57-62. STATA Press. 2003. STATA Base Reference Manual, Release 8. College Station, TX:

STATA Press. http://www.stata.com/ STATA Press. 2003. STATA Cross-Sectional Time-Series Reference Manual, Release 8.

College Station, TX: STATA Press. http://www.stata.com/ http://mypage.iu.edu/~kucc625/documents/Panel_Data_Models.pdf http://socserv.socsci.mcmaster.ca/jfox/Courses/soc740/lecture-5.pdf

dummy

Documents