Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

2.8 Analysis of Covariance Models

The anova command in Stata can also fit analysis of covariance model, but we will continue to use regression with dummy variables via regress.

Here's the model treating social setting as a covariate with a linear effect and program effort as a factor with three categories represented by two dummy variables

. regress change setting effort_mod effort_str
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  3,    16) =   21.55
       Model |  2124.50633     3  708.168776           Prob > F      =  0.0000
    Residual |  525.693673    16  32.8558546           R-squared     =  0.8016
-------------+------------------------------           Adj R-squared =  0.7644
       Total |      2650.2    19  139.484211           Root MSE      =   5.732
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     setting |   .1692677   .1055505     1.60   0.128    -.0544894    .3930247
  effort_mod |   4.143915   3.191179     1.30   0.213    -2.621082    10.90891
  effort_str |   19.44761   3.729295     5.21   0.000     11.54186    27.35336
       _cons |  -5.954036    7.16597    -0.83   0.418    -21.14521    9.237141
------------------------------------------------------------------------------
 
. estimates store ancova // save results for later use

Exactly the same model can be fitted using factor variables. We just tell Stata that effort_g is a factor and, by omission, that setting is a covariate. (There is a c. prefix for covariates, but it is implied in regression commands.)

. regress change setting i.effort_g
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  3,    16) =   21.55
       Model |  2124.50633     3  708.168776           Prob > F      =  0.0000
    Residual |  525.693673    16  32.8558546           R-squared     =  0.8016
-------------+------------------------------           Adj R-squared =  0.7644
       Total |      2650.2    19  139.484211           Root MSE      =   5.732
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     setting |   .1692677   .1055505     1.60   0.128    -.0544894    .3930247
             |
    effort_g |
          2  |   4.143915   3.191179     1.30   0.213    -2.621082    10.90891
          3  |   19.44761   3.729295     5.21   0.000     11.54186    27.35336
             |
       _cons |  -5.954036    7.16597    -0.83   0.418    -21.14521    9.237141
------------------------------------------------------------------------------

The results agree with Tables 2.23 and 2.24. We see that countries with strong programs show steeper declines, on average 19 percentage points more, than countries with weak programs and the same social setting.

This analysis has adjusted for linear effects of setting, whereas the analysis of Section 2.7 adjusted for difference by setting grouped in three categories. As it happens, both analyses lead to similar estimates of the difference between strong and weak programs at the same level of setting.

Plotting Observed and Fitted

Let us do Figure 2.5, a plot of change versus setting identifying the level of program effort corresponding to each point. I will also superimpose the three parallel lines corresponding to the fitted model.

This is easily done using the graph twoway command --which can be shortened to twoway-- to combine a scatter plot with three line plots, one for each category of effort.

. predict ancovafit
(option xb assumed; fitted values)
 
. 
. twoway  (scatter change setting, mlabel(effort_g))  ///
>         (line ancovafit setting if effort < 5, sort)    ///
>         (line ancovafit setting if effort_mod, sort)    ///
>         (line ancovafit setting if effort_str, sort)    ///
>         ,                                                                                   
>                                     ///
>         title("Figure 2.5: Fertility Change by Setting and Effort") ///
>         subtitle("Analysis of Covariance Fit") legend(off)
 
. graph export fig25.png, width(500)replace
(file fig25.png written in PNG format)

Note how each subplot can have its own if condition. We also specify the sort option, which determines the order in which the points are drawn, pretty important when they will be joined by lines. (Try line change setting to see what can happen if you do a line plot on unsorted data. With a linear fit like Figure 2.5 the consequence of leaving out sort is not apparent because all the points lie in a straight line.)

If you wanted to label the points using the letters w, m and s for weak, moderate and strong, as done in the notes, you would have to define a new label for effort_g.

Adjusted and Unadjusted Means

Let us turn to the comparison of adjusted and unadjusted declines in Table 2.26, a useful way to present regression results to a non-technical audience.

We start by restoring the ancova estimates and then predict fertility decline in each country at its observed level of effort, but with social setting set to 72.1, which is the sample mean:

. estimates restore ancova
(results ancova are active now)
 
. gen adj_change = _b[_cons] + effort_mod * _b[effort_mod] + ///
>         effort_str * _b[effort_str] + 72.1 * _b[setting]
 
. label var adj_change "Adjusted change at mean setting"

Here we accessed the coefficients from the last regression as _b[varname], but _coef[varname] also works. This is better than typing them from the output because it is less error prone and we don't lose precision.

Next we tabulate our data by level of effort and summarize observed and predicted change. We use the tabstat command because the tabulate, summarize command used ealier computes summary statistics for just one variable:

. tabstat change adj_change, by(effort_g) statistics(mean)
 
Summary statistics: mean
  by categories of: effort_g (RECODE of effort (Program Effort))
 
effort_g |    change  adj_ch~e
---------+--------------------
    Weak |         5  6.250163
Moderate |  9.333333  10.39408
  strong |  27.85714  25.69777
---------+--------------------
   Total |      14.3      14.3
------------------------------

Countries with strong program average a 28% decline in fertility, but they also tend to have higher settings; we estimate a slightly smaller decline of about 26% at average social setting. The estimate is based on the model, which adjusts linearly for setting and assumes that the slope is the same at all levels of effort. The next step will be to examine this assumption.

By the way Stata can automate this type of calculation using the margins command in version 11, or the adjust command in earlier versions. We proceeded from first principles because the calculations are easy to do and you can see exactly what is being one.

On the other hand, once you get used to the margins command you appreciate it's power. It understands factor variables, and has the added advantage of producing standard errors of the adjusted predictions. So here's how to predict fertility decline by levels of effort at mean setting:

. quietly regress change setting i.effort_g 
 
. margins i.effort_g, at( (mean) setting)
 
Adjusted predictions                              Number of obs   =         20
Model VCE    : OLS
 
Expression   : Linear prediction, predict()
at           : setting         =        72.1 (mean)
 
------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      Z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    effort_g |
          1  |   6.250162    2.30248     2.71   0.007     1.737384    10.76294
          2  |   10.39408   2.431767     4.27   0.000     5.627901    15.16025
          3  |   25.69777   2.550846    10.07   0.000      20.6982    30.69734
------------------------------------------------------------------------------

Wasn't that easy? I rerun the regression using factor variables, using quietly to supress the output, and then asked for the effort margin at mean setting … literally.

The Assumption of Parallelism

We will now allow the linear relationship between change and setting to vary with level of effort by introducing an interaction between setting and the indicators of effort. Before we do that we center the index of social setting by subtracting the mean, a practice I highly recommend to simplify the interpretation of 'main' effects when the model has interactions:

. quietly summarize setting
 
. gen setting_c = setting - r(mean)

Here we used the fact that the mean is available as r(mean) after summarize. (To see all the results that can be extracted this way type return list.)

We can now generate variables to represent the linear setting by effort interaction using the centered variable, and run the regression:

. gen se_c_mod = setting_c * effort_mod
 
. gen se_c_str = setting_c * effort_str
 
. regress change setting_c effort_mod effort_str se_c_mod se_c_str
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  5,    14) =   12.13
       Model |  2153.09941     5  430.619882           Prob > F      =  0.0001
    Residual |   497.10059    14   35.507185           R-squared     =  0.8124
-------------+------------------------------           Adj R-squared =  0.7454
       Total |      2650.2    19  139.484211           Root MSE      =  5.9588
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   setting_c |   .1835741   .1396981     1.31   0.210    -.1160486    .4831968
  effort_mod |   3.583729   3.662354     0.98   0.344    -4.271239     11.4387
  effort_str |    13.3332   8.209163     1.62   0.127    -4.273701     30.9401
    se_c_mod |  -.0868366   .2325831    -0.37   0.714    -.5856776    .4120045
    se_c_str |   .4567037   .6039241     0.76   0.462    -.8385847    1.751992
       _cons |   6.355826   2.477298     2.57   0.022      1.04255     11.6691
------------------------------------------------------------------------------

Compare the parameter estimates with Table 2.27 in the notes. You also have all the information required to produce the hierarchical anova in Table 2.28.

Because we centered setting, the coefficients for moderate and strong programs summarize differences by effort at mean setting, rather than at setting zero (which is well outside the range of the data). Thus, fertility decline averages 13 percentage points more under strong than under weak programs in countries with average social setting.

The interaction terms can be used to compute how these differences vary as we move away from the mean. For example in countries which are ten points above the mean social setting, the strong versus weak difference is almost five percentage points more than at the mean. These differences, however, are not significant, as we can't reject the hypothesis that the three slopes are equal:

. test se_c_mod se_c_str
 
 ( 1)  se_c_mod = 0
 ( 2)  se_c_str = 0
 
       F(  2,    14) =    0.40
            Prob > F =    0.6761

Factor Variables in Ancova

We can reproduce these results exactly using factor variables, using the double hash ## convention to request main effects and interactions, and the i. prefix to specify that grouped effort is to be treated as a factor. I discovered that we also need to use the c. prefix to remind Stata that centered setting is to be treated as a covariate. (This was not needed in the previous section, but here Stata will attempt to build indicators for every value of centered setting if you omit it.)

. regress change c.setting_c##i.effort_g
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  5,    14) =   12.13
       Model |  2153.09941     5  430.619882           Prob > F      =  0.0001
    Residual |   497.10059    14   35.507185           R-squared     =  0.8124
-------------+------------------------------           Adj R-squared =  0.7454
       Total |      2650.2    19  139.484211           Root MSE      =  5.9588
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   setting_c |   .1835741   .1396981     1.31   0.210    -.1160486    .4831968
             |
    effort_g |
          2  |   3.583729   3.662354     0.98   0.344    -4.271239     11.4387
          3  |    13.3332   8.209163     1.62   0.127    -4.273701     30.9401
             |
    effort_g#|
 c.setting_c |
          2  |  -.0868366   .2325831    -0.37   0.714    -.5856776    .4120045
          3  |   .4567037   .6039241     0.76   0.462    -.8385847    1.751992
             |
       _cons |   6.355826   2.477298     2.57   0.022      1.04255     11.6691
------------------------------------------------------------------------------

The testparm introduced earlier comes handy for testing the interaction term

. testparm c.setting_c#i.effort_g
 
 ( 1)  2.effort_g#c.setting_c = 0
 ( 2)  3.effort_g#c.setting_c = 0
 
       F(  2,    14) =    0.40
            Prob > F =    0.6761

Exercise. Plot the data and the regression lines implied by the model with a linear setting by effort interaction. Note how the difference between strong and weak programs increases with social setting. The interaction is not significant, however, so we have no evidence that the lines are not in fact parallel.


Continue with 2.9 Regression Diagnostics