Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs R Logs Datasets Problems Sets

3.4 The Comparison of Several Groups

These ideas extend easily to more than two groups. We will illustrate using the data on contraceptive use by age, where we compare four groups.

A k-by-two Table

These are the data on page 18 of the notes, entered as four age groups

. clear

. input ageg  users  n

          ageg      users          n
  1.         1    72   397
  2.         2   105   404
  3.         3   237   612
  4.         4    93   194
  5. end

. label define ageg 1 "<25" 2 "25-29" 3 "30-39" 4 "40-49"

. label values ageg ageg

We will also need dummy variables to represent the age groups. We could generate these using tab(ageg), gen(age), but we will calculate them explicity to use more descriptive names. We could also use factor variables, and will illustrate that approach below.

. gen age2529 = ageg == 2

. gen age30s  = ageg == 3

. gen age40s  = ageg == 4

The One-Factor Model

Here is the model treating age as a factor with four levels, which is of course saturated for the data:

. blogit users n age2529 age30s age40s

Logistic regression for grouped data              Number of obs   =       1607
                                                  LR chi2(3)      =      79.19
                                                  Prob > chi2     =     0.0000
Log likelihood = -962.25091                       Pseudo R2       =     0.0395

_outcome Coef. Std. Err. z P>|z| [95% Conf. Interval]
age2529 .4606758 .1727255 2.67 0.008 .1221401 .7992116
age30s 1.048293 .1544406 6.79 0.000 .7455952 1.350991
age40s 1.424638 .1939574 7.35 0.000 1.044488 1.804788
_cons -1.507159 .1302529 -11.57 0.000 -1.76245 -1.251868

Compare the parameter estimates with those on Table 3.5 of the notes. Can you obtain these estimates by hand directly from the raw frequencies? We see that the odds of using contraception increase steadily from one age group to the next. You could type blogit, or to convert from logit coefficients to odds ratios.

The chi2 statistic reported by Stata is the likelihood ratio chi-squared comparing the model at hand with the null model. The value of 79.19 on 3 d.f. means that we can reject the hypothesis that the probability of using contraception is the same in the four age groups. Stata's test command makes calculation of Wald tests easy. Here's the test for the age effect on page 20 of the notes:

. test age2529 age30s age40s

 ( 1)  [_outcome]age2529 = 0
 ( 2)  [_outcome]age30s = 0
 ( 3)  [_outcome]age40s = 0

           chi2(  3) =   74.36
         Prob > chi2 =    0.0000

Once again the likelihood ratio and Wald test are similar but not identical.

We will save the results for later use. Stata can store the estimates in memory using estimates store name or, save them to disk using estimates save filename. We'll just save them in memory as ageg

. estimates store ageg

Finally, we will compute the fitted logits, which we will need later. We can do this using the predict command, with the xb option to make predictions in the scale of the linear predictor, which in this case is the logit scale. (The default is to in the scale of the response, in this case counts.) We will name the prediction obslogit because it corresponds to the logits of the observed proportions

. predict obslogit, xb

Factor Variables

Before we leave this dataset, let us verify that we can obtain exactly the same results using Stata's factor variables, All we need is the i.prefix to request indicators of age groups:

. blogit users n i.ageg

Logistic regression for grouped data              Number of obs   =       1607
                                                  LR chi2(3)      =      79.19
                                                  Prob > chi2     =     0.0000
Log likelihood = -962.25091                       Pseudo R2       =     0.0395

_outcome Coef. Std. Err. z P>|z| [95% Conf. Interval]
25-29 .4606758 .1727255 2.67 0.008 .1221401 .7992116
30-39 1.048293 .1544406 6.79 0.000 .7455952 1.350991
40-49 1.424638 .1939574 7.35 0.000 1.044488 1.804788
_cons -1.507159 .1302529 -11.57 0.000 -1.76245 -1.251868

With Stata 12 and earlier we lose the labeling of the age groups but but don't have to worry about creating dummies. With Stata 13 we get the best of both worlds. Recall that we can get the Wald test using testparm:

. testparm i.ageg

 ( 1)  [_outcome]2.ageg = 0
 ( 2)  [_outcome]3.ageg = 0
 ( 3)  [_outcome]4.ageg = 0

           chi2(  3) =   74.36
         Prob > chi2 =    0.0000

We get exactly the same result as before.

A One-variate Model

We will now treat age as a covariate, using the mid-points of the four age groups; so we treat the group 15-24 as 20, 25-29 as 27.5, 30-39 as 35 and 40-49 as 45. (If these don't look like mid-points to you, note that age is usually reported in completed years, so 15-24 means between 15.0 and 25.0, and the mid-point is 20.0.) The easiest way to code the midpoints in this example is via the recode command

. recode ageg 1=20 2=27.5 3=35 4=45, gen(agem)
(4 differences between ageg and agem)

We can now fit the model on page 20 of the notes, which has a linear effect of age:

. blogit users n agem

Logistic regression for grouped data              Number of obs   =       1607
                                                  LR chi2(1)      =      76.79
                                                  Prob > chi2     =     0.0000
Log likelihood = -963.45258                       Pseudo R2       =     0.0383

_outcome Coef. Std. Err. z P>|z| [95% Conf. Interval]
agem .060671 .0071034 8.54 0.000 .0467486 .0745934
_cons -2.672667 .2332492 -11.46 0.000 -3.129827 -2.215507

We see that older women are more likely to use contraception, and that the odds of using contraception are about six percent higher for every year of age. (This comes from exponentiating the coefficient of age, which is now measured in years.)

We can formally test the assumption of linearity using a likelihood ratio test to compare this model with the saturated model of the previous section. The test can be calculated using Stata's lrtest command, which uses a dot to refer to the current model

. lrtest . ageg

Likelihood-ratio test                                 LR chi2(2)  =      2.40
(Assumption: . nested in ageg)                        Prob > chi2 =    0.3007

The statistic of 2.4 on one d.f. is not significant, indicating that we have no evidence against the assumption of linearity, and can happily save two degrees of freedom. This statistic is, of course, the deviance for the model with a linear effect of age, which we can compute using glm

. quietly glm users agem, family(binomial n)

. display e(deviance)

We can also calculate this "by hand" from first principles using the 'sum of observed times log(observed/expected)' formula . Just remember that you need to use observed and expected counts of both successes and failures, here users and non-users:

. predict pusers // predicted count of users
(option mu assumed; predicted mean users)

. gen di = 2*( users*log(users/pusers) + (n-users)*log((n-users)/(n-pusers)) )

. gen DI = sum(di)

. display "Deviance = " DI[_N]
Deviance = 2.4033537

Observed and Fitted Logits

The next step will be to compute fitted logits based on this model, and use them together with the observed logits calculated earlier to examine visually the adequacy of the linear specification, effectively reproducing Figure 3.2 in the notes. For added measure I will also consider a model with a quadratic term, centering age around 30 before squaring it, so the linear term reflects the slope at 30.

. predict lfit1, xb

. gen agemcsq = (agem-30)^2

. blogit users n  agem agemcsq

Logistic regression for grouped data              Number of obs   =       1607
                                                  LR chi2(2)      =      78.32
                                                  Prob > chi2     =     0.0000
Log likelihood = -962.68877                       Pseudo R2       =     0.0391

_outcome Coef. Std. Err. z P>|z| [95% Conf. Interval]
agem .0648033 .0079525 8.15 0.000 .0492167 .08039
agemcsq -.0010131 .0008214 -1.23 0.217 -.002623 .0005968
_cons -2.740736 .2434262 -11.26 0.000 -3.217842 -2.263629
. predict lfit2, xb . graph twoway (scatter obslogit agem) (line lfit1 agem) /// > (function f=_b[_cons]+_b[agem]*x+_b[agemcsq]*(x-30)^2, /// > range(20 45)) /// > , title("Figure 3.2: Observed and Fitted Logits By Age") /// > xtitle("age") ytitle("logit") legend(off) . graph export fig32.png, width(500) replace (note: file fig32.png not found) (file fig32.png written in PNG format)

Figure 3.2

The graph shows that the linear specification was adequate. There is a hint that a quadratic model might be better, particularly in terms of the fit for the oldest age group, but the quadratic term is not significant.

(You may wonder why I used the function plot type for the quadratic model. I could have predicted and plotted the logits just like I did for the linear model, but with only four points joined by straight lines we lose the curvilinearity. This also helps illustrate how to use estimation results in a plot.)

This analysis gives us a quick indication of whether we could treat age linearly if we were working with individual data and had the actual ages of the 1607 women. It is not equivalent, however, because we have grouped age, and therefore treated all women men aged 25-29 as if they were age 27.5. With individual data some would be 25, some 26, etc.

You may also wonder why we were able to do a likelihood ratio test, when a model treating age linearly is usually not nested in a model that treats it as a factor. The answer is that in this case both specifications are applied to grouped data. You can view the linear model as imposing constraints where the differences betwen the age groups are proportional to the difference in years between their midpoints. Alternatively, you can view the model that treats age as four groups as equivalent to having linear, quadratic, and cubic terms.

Continue with 3.5. Models With Two Predictors