2.6 One-Way Analysis of Variance
Let us group social setting into categories.
First we will make a copy, which I'll call setting_g
for social setting grouped. (Everyone has their own conventions
for naming variables. I try to keep variable names short, lowercase,
and hopefully not too cryptic.
Because we are just starting I will emphasize the 'not too cryptic'
part, otherwise I might have used ssg.
Stata allows variable names to have up to 32 characters, but most
commands print only 12, so it is best to stick to a maximum of 12.)
. generate setting_g = setting
Then we recode it into categories <70, 70-79, and 80+, thus creating a discrete factor with three levels.
. recode setting_g min/69=1 70/79=2 80/max=3 (setting_g: 20 changes made)
It might be good idea to label the new variable and its categories.
I will define a new set of labels called setting_g and
assign it to the values of the variable. The names of the variable
and the label don't have to be the same. For example one could have
a label called yesno assigned to the values of all
variables that take "Yes" and "No" values. In this case it makes
sense to use the same name.
. label var setting_g "Social Setting (Grouped) . label define setting_g 1 "Low" 2 "Medium" 3 "High" . label values setting_g setting_g
By the way one can shorten this process using options of the
recode command as shown in
Section 2.7 in this log,
but I think it's good to see all the steps once.
Let us look at the mean response by level of social setting
. tabulate setting_g, summarize(change)
Social | Summary of % Change in CBR between
Setting | 1965 and 1975
(Grouped) | Mean Std. Dev. Freq.
------------+------------------------------------
Low | 7.5714286 7.3452284 7
Medium | 8.6 9.9398189 5
High | 23.75 10.264363 8
------------+------------------------------------
Total | 14.3 11.810343 20
We see substantially more fertility decline in countries with higher setting, but only a small difference between the low and medium categories.
Dummy Variables
Stata has an anova command that can fit linear models
with discrete factors as predictors.
We will use regress instead, to emphasize that all
these models are in fact regression models.
This will help us along when we move on to logit and Poisson models,
which no longer make this distinction.
This means that we need to code the factor using dummy variables. This can be done in three different ways:
-
Stata 11 introduced factor variables, a very powerful
way of specifying main effects and interactions in regression models.
This supersedes the
xi:prefix that could be used with commands such asregressin versions 10 and earlier. This is the simplest and quickest way to proceed, but the results are not labeled. We'll have an example soon. -
A second way is to generate the dummies using the
genoption of thetabulatecommand. This generates a dummy or indicator variable for each category of the factor. You specify a "stem" or prefix for the names and Stata adds a sequence number. This makes it very easy to generate dummies called setting_g1, setting_g2, etc. -
My preferred way is to generate the dummies "by hand" using
gen, taking advantage of the fact that in Stata logical expressions take the value 1 when they are true and 0 when they are false. This leads to very readable code. Just one word of caution: you have to be careful with open-ended expressions such asif x > 100because Stata stores missing values as very large numbers, so the expression is true ifx = 200, but also true ifxis missing. The safe way to code this condition isif x > 100 & !missing(x).
Let us start with the third way. To represent a factor with three categories we need only two dummy variables. I will choose low setting (< 70) as the reference category and create dummies for medium (70-79) and high (80+):
. gen setting_med = setting_g==2 // or setting >= 70 & setting < 80 . gen setting_high = setting_g==3 // or setting >= 80 & !missing(setting)
We could have coded the conditions in terms of the original variable
as shown in the comments above, with exactly the same result.
I probably would have used that approach if the dummies were called
setting70to79 and setting80plus.
We are now ready to fit the one-factor model:
. regress change setting_med setting_high
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 2, 17) = 6.97
Model | 1193.78571 2 596.892857 Prob > F = 0.0062
Residual | 1456.41429 17 85.6714286 R-squared = 0.4505
-------------+------------------------------ Adj R-squared = 0.3858
Total | 2650.2 19 139.484211 Root MSE = 9.2559
------------------------------------------------------------------------------
change | Coef. Std. Err. T P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
setting_med | 1.028571 5.419692 0.19 0.852 -10.40598 12.46312
setting_high | 16.17857 4.790376 3.38 0.004 6.071761 26.28538
_cons | 7.571429 3.498396 2.16 0.045 .1904579 14.9524
------------------------------------------------------------------------------
We see that in countries with high setting fertility declined, on average, 16 percentage points more than in countries with low setting. Can you get these estimates from the means calculated earlier? Compare the parameter estimates with the values in Table 2.11 and the anova with Table 2.12 in the notes.
The test command can be used to generate the Wald test
on page 32 of the notes. Stata automatically converts the criterion
to an F-test for linear models. The result is, of course, the same
as in the anova table: the differences by setting are significant
at the one-percent level.
. test setting_med setting_high
( 1) setting_med = 0
( 2) setting_high = 0
F( 2, 17) = 6.97
Prob > F = 0.0062
Factor Variables
We now show how one to obtain exactly the same results using Stata's
factor variables. The idea here is to use a i.
prefix in the list of regressors to tell Stata when a predictor is in
fact a discrete factor, which takes integer codes such as 0,1,2,…,
and should be treated as a set of dummies rather than a linear effect.
Stata will then fit the model picking the lowest code as the reference
cell. You can change the base category using ib#.
instead of just i. as the prefix, where #
is the code for the reference category. Thus, i.setting_g
treats grouped setting as a factor with low as the baseline, whereas
ib3.setting_g sets high as the baseline.
Here's the regression with the default reference cell:
. regress change i.setting_g
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 2, 17) = 6.97
Model | 1193.78571 2 596.892857 Prob > F = 0.0062
Residual | 1456.41429 17 85.6714286 R-squared = 0.4505
-------------+------------------------------ Adj R-squared = 0.3858
Total | 2650.2 19 139.484211 Root MSE = 9.2559
------------------------------------------------------------------------------
change | Coef. Std. Err. T P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
setting_g |
2 | 1.028571 5.419692 0.19 0.852 -10.40598 12.46312
3 | 16.17857 4.790376 3.38 0.004 6.071761 26.28538
|
_cons | 7.571429 3.498396 2.16 0.045 .1904579 14.9524
------------------------------------------------------------------------------
As you can see the results are exactly the same as before. The price to pay for this extraordinary convenience is that you have to remember what the codes 2 and 3 stand for. Here it's easy because the categories are ordered, but things can get trickier with variables such as ethnicity or type of health care provider.
You can even do the Wald test quite easily, but there is a twist.
You can't test i.setting_g, which is what I first
tried, because i.setting_g is not a term (or single
variable) in the model. There is, however, an alternative command
called testparm that lets you specify a variable list
and then tests all the corresponding coefficients (read it as
"test the parameters of…"). So the solution is
. testparm i.setting_g
( 1) 2.setting_g = 0
( 2) 3.setting_g = 0
F( 2, 17) = 6.97
Prob > F = 0.0062
As you can see from the output, Stata names the coefficients of a
factor variable using the number of the level followed by a dot and
the name of the factor, as in 2.setting_g.
You could reproduce this F-test using the command
test 2.setting_g 3.setting_g, which works fine
because these are terms (single variables) in the model.
On a related matter, Stata stores the coefficients in a matrix
called e(b), and you can list them using
mat list e(b). This is how I first discovered
the names of the coefficients representing factor variables.
Exercise: Obtain the parameter estimates and anova table for the model with family planning effort grouped into three categories: 0-4, 5-14 and 15+, labeled weak, moderate and strong.
Continue with 2.7 Two-Way Analysis of variance

