Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

2.6 One-Way Analysis of Variance

Let us group social setting into categories.

First we will make a copy, which I'll call setting_g for social setting grouped. (Everyone has their own conventions for naming variables. I try to keep variable names short, lowercase, and hopefully not too cryptic. Because we are just starting I will emphasize the 'not too cryptic' part, otherwise I might have used ssg. Stata allows variable names to have up to 32 characters, but most commands print only 12, so it is best to stick to a maximum of 12.)

. generate setting_g = setting

Then we recode it into categories <70, 70-79, and 80+, thus creating a discrete factor with three levels.

. recode setting_g min/69=1 70/79=2 80/max=3
(setting_g: 20 changes made)

It might be good idea to label the new variable and its categories. I will define a new set of labels called setting_g and assign it to the values of the variable. The names of the variable and the label don't have to be the same. For example one could have a label called yesno assigned to the values of all variables that take "Yes" and "No" values. In this case it makes sense to use the same name.

. label var setting_g "Social Setting (Grouped)
 
. label define setting_g 1 "Low" 2 "Medium" 3 "High"
 
. label values setting_g setting_g

By the way one can shorten this process using options of the recode command as shown in Section 2.7 in this log, but I think it's good to see all the steps once.

Let us look at the mean response by level of social setting

. tabulate setting_g, summarize(change)
 
     Social | Summary of % Change in CBR between
    Setting |            1965 and 1975
  (Grouped) |        Mean   Std. Dev.       Freq.
------------+------------------------------------
        Low |   7.5714286   7.3452284           7
     Medium |         8.6   9.9398189           5
       High |       23.75   10.264363           8
------------+------------------------------------
      Total |        14.3   11.810343          20

We see substantially more fertility decline in countries with higher setting, but only a small difference between the low and medium categories.

Dummy Variables

Stata has an anova command that can fit linear models with discrete factors as predictors. We will use regress instead, to emphasize that all these models are in fact regression models. This will help us along when we move on to logit and Poisson models, which no longer make this distinction.

This means that we need to code the factor using dummy variables. This can be done in three different ways:

Let us start with the third way. To represent a factor with three categories we need only two dummy variables. I will choose low setting (< 70) as the reference category and create dummies for medium (70-79) and high (80+):

. gen setting_med  = setting_g==2 // or setting >= 70 & setting < 80
 
. gen setting_high = setting_g==3 // or setting >= 80 & !missing(setting)

We could have coded the conditions in terms of the original variable as shown in the comments above, with exactly the same result. I probably would have used that approach if the dummies were called setting70to79 and setting80plus.

We are now ready to fit the one-factor model:

. regress change setting_med setting_high
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  2,    17) =    6.97
       Model |  1193.78571     2  596.892857           Prob > F      =  0.0062
    Residual |  1456.41429    17  85.6714286           R-squared     =  0.4505
-------------+------------------------------           Adj R-squared =  0.3858
       Total |      2650.2    19  139.484211           Root MSE      =  9.2559
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 setting_med |   1.028571   5.419692     0.19   0.852    -10.40598    12.46312
setting_high |   16.17857   4.790376     3.38   0.004     6.071761    26.28538
       _cons |   7.571429   3.498396     2.16   0.045     .1904579     14.9524
------------------------------------------------------------------------------

We see that in countries with high setting fertility declined, on average, 16 percentage points more than in countries with low setting. Can you get these estimates from the means calculated earlier? Compare the parameter estimates with the values in Table 2.11 and the anova with Table 2.12 in the notes.

The test command can be used to generate the Wald test on page 32 of the notes. Stata automatically converts the criterion to an F-test for linear models. The result is, of course, the same as in the anova table: the differences by setting are significant at the one-percent level.

. test setting_med setting_high
 
 ( 1)  setting_med = 0
 ( 2)  setting_high = 0
 
       F(  2,    17) =    6.97
            Prob > F =    0.0062

Factor Variables

We now show how one to obtain exactly the same results using Stata's factor variables. The idea here is to use a i. prefix in the list of regressors to tell Stata when a predictor is in fact a discrete factor, which takes integer codes such as 0,1,2,…, and should be treated as a set of dummies rather than a linear effect.

Stata will then fit the model picking the lowest code as the reference cell. You can change the base category using ib#. instead of just i. as the prefix, where # is the code for the reference category. Thus, i.setting_g treats grouped setting as a factor with low as the baseline, whereas ib3.setting_g sets high as the baseline.

Here's the regression with the default reference cell:

. regress change i.setting_g
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  2,    17) =    6.97
       Model |  1193.78571     2  596.892857           Prob > F      =  0.0062
    Residual |  1456.41429    17  85.6714286           R-squared     =  0.4505
-------------+------------------------------           Adj R-squared =  0.3858
       Total |      2650.2    19  139.484211           Root MSE      =  9.2559
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   setting_g |
          2  |   1.028571   5.419692     0.19   0.852    -10.40598    12.46312
          3  |   16.17857   4.790376     3.38   0.004     6.071761    26.28538
             |
       _cons |   7.571429   3.498396     2.16   0.045     .1904579     14.9524
------------------------------------------------------------------------------

As you can see the results are exactly the same as before. The price to pay for this extraordinary convenience is that you have to remember what the codes 2 and 3 stand for. Here it's easy because the categories are ordered, but things can get trickier with variables such as ethnicity or type of health care provider.

You can even do the Wald test quite easily, but there is a twist. You can't test i.setting_g, which is what I first tried, because i.setting_g is not a term (or single variable) in the model. There is, however, an alternative command called testparm that lets you specify a variable list and then tests all the corresponding coefficients (read it as "test the parameters of…"). So the solution is

. testparm i.setting_g
 
 ( 1)  2.setting_g = 0
 ( 2)  3.setting_g = 0
 
       F(  2,    17) =    6.97
            Prob > F =    0.0062

As you can see from the output, Stata names the coefficients of a factor variable using the number of the level followed by a dot and the name of the factor, as in 2.setting_g. You could reproduce this F-test using the command test 2.setting_g 3.setting_g, which works fine because these are terms (single variables) in the model.

On a related matter, Stata stores the coefficients in a matrix called e(b), and you can list them using mat list e(b). This is how I first discovered the names of the coefficients representing factor variables.

Exercise: Obtain the parameter estimates and anova table for the model with family planning effort grouped into three categories: 0-4, 5-14 and 15+, labeled weak, moderate and strong.


Continue with 2.7 Two-Way Analysis of variance