Germán Rodríguez
Generalized Linear Models Princeton University

## 3. Logit Models in Stata

Stata has several commands that can be used to fit logistic regression models by maximum likelihood. The basic commands are `logit` for individual data and `blogit` for grouped data. There is also a `logistic` command that presents the results in terms of odd-ratios instead of log-odds and can produce a variety of summary and diagnostic statistics. Finally, one can fit a logistic regression model as a special case of a generalized linear model with Bernoulli or binomial errors and link logit, using the `glm` command. We will have occassion to use all of these commands but we will emphasize the first two, using `blogit` for grouped data in this log and `logit` for individual data in the problem sets.

### 3.3 The Comparison of Two Groups

Following the lecture notes we will consider comparing two groups and then move on to more than two.

#### A 2-by-2 Table

Consider the data on contraceptive use by desire for more children on Table 3.2 (page 14 of the notes). We can read these data into Stata as 2 binomial observations. To make life easier I will enter desire for more children as a dummy variable that takes the value 1 for women who want no more children and 0 otherwise

```. input nomore users n

nomore      users          n
1.         0   219   972
2.         1   288   635
3. end
```

#### Testing Homegeneity

let us start by fitting the null model. With `blogit` you specify the outcome in terms of the number of 'successes' and the binomial denominator, here `users` and `n`:

```. blogit users n

Logistic regression for grouped data              Number of obs   =       1607
LR chi2(0)      =       0.00
Prob > chi2     =          .
Log likelihood = -1001.8468                       Pseudo R2       =     0.0000

------------------------------------------------------------------------------
_outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons |  -.7745545   .0536794   -14.43   0.000    -.8797641   -.6693448
------------------------------------------------------------------------------
```

The estimate of the constant is simply the logit of the overall proportion using contraception, say p=y/n, and the standard error is the square root of 1/y + 1/(n-y). You may want to check these results by hand.

Stata's `blogit` does not calculate the model deviance, but we can obtain it 'by hand' using `predict` to obtain fitted counts:

```. predict pusers
(option n assumed; E(cases))

. gen di = 2*( users*log(users/pusers) + (n-users)*log((n-users)/(n-pusers)) )

. gen DI = sum(di)

. display "Deviance = " DI[_N]
Deviance = 91.674393
```

So the deviance is 91.67 on one d.f., providing ample evidence that the null model does not fit the data. Thus, we reject the hypothesis that the probability of using contraception is the same in the two groups.

Try a similar method to calculate Pearson's chi-squared, you should get 92.64. Alternatively, you can fit the model using `glm`, which reports both the deviance and Pearson's chi-squared by default. I'll do this quietly and just report the corresponding stored results, `e(deviance)` for the deviance and `e(deviance_p)` for Pearson's statistic.

```. quietly glm users, family(binomial n)

. display e(deviance), e(deviance_p)
91.674397 92.644243
```

#### The Odds Ratio

Let us now fit the model with 'want no more' children as the predictor. This model is saturated for this dataset, using two parameters to model two probabilities:

```. blogit users n nomore

Logistic regression for grouped data              Number of obs   =       1607
LR chi2(1)      =      91.67
Prob > chi2     =     0.0000
Log likelihood = -956.00957                       Pseudo R2       =     0.0458

------------------------------------------------------------------------------
_outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
nomore |   1.048629    .110672     9.48   0.000     .8317159    1.265542
_cons |  -1.234993   .0767739   -16.09   0.000    -1.385468   -1.084519
------------------------------------------------------------------------------

. di exp(_b[nomore])
2.8537363
```

The constant corresponds to the log-odds of using contraception among whomen who do want more children, and the coefficient of `nomore` is the difference in log-odds between the two groups.

Exponentiating this coefficient we get an odds ratio of about three. Contrary to popular belief, this does not mean that "women who want no more children are three times more likely to use contraception". There are two errors in this interpretation.

First, and more importantly, it is the odds of using contraception among women who want no more children that are three times those of women who want more, not the probability, which is what's usually understood by "likelihood". The interpretation would be approximately correct if the event under study was rare, because if p is small then 1-p is close to one and the odds ratio is approximately the same as the relative risk. Here the observed proportions are 0.454 and 0.225, and the ratio is 2.01, so women who want no more children are twice as likely to use contraception as those who want more.

Second, even if the probability was tripled, that would make the women three times as likely, or two times more likely, to use contraception, not three times more likely. In this case the probability is doubled, and that makes women twice as likely, not two times more likely.

#### Testing Significance

The z-statistic is as reported on page 16 of the notes. Let us square it:

```. di (_b[nomore]/_se[nomore])^2
89.777623
```

This is Wald's chi-squared statistic for the hypothesis that the coefficient of `nomore` is zero, or equivalently that the odds-ratio is one, and can be calculated more simply using Stata's `test` command:

```. test nomore

( 1)  [_outcome]nomore = 0

chi2(  1) =   89.78
Prob > chi2 =    0.0000
```

The `chi2` statistic reported by Stata in the second line of output is the likelihood ratio chi-squared comparing the current model with the null model. Can you explain why we get 91.67, which is the deviance of the null model? Hint: What's the deviance of this model?

A third test of the effect of want no more is given by Pearson's chi-squared statistic, which we calculated earlier as 92.64. This is equivalent to the standard z-test for comparing two proportions if you use the pooled proportion to estimate the standard error.

All three statistics are different, but they are asymptotically equivalent. In our example they are also close in value and lead to the same overwhelming rejection of the hypothesis that the probability of using contraception is the same in the two groups.

#### Confidence Intervals

Stata is kind enough to give us a 95% confidence interval for the logit coefficients. We can convert the interval for the coefficient of `nomore` into a 95% CI for the odds ratio by exponentiating the confidence bounds:

```. di exp(0.831716) "-" exp(1.265542)
2.2972575-3.5450136
```

An even easier way is to type `blogit, or`. The `blogit` command without any variables, like all estimation commands, simply retrieves the results of the last fit. The option `or` is short for odds-ratio and causes Stata to report exponentiated coefficients. (Versions 12 and earlier omit the constant but Stata 13 exponentiates that as well.)

```. blogit, or

Logistic regression for grouped data              Number of obs   =       1607
LR chi2(1)      =      91.67
Prob > chi2     =     0.0000
Log likelihood = -956.00957                       Pseudo R2       =     0.0458

------------------------------------------------------------------------------
_outcome | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
nomore |   2.853736   .3158288     9.48   0.000     2.297257    3.545015
_cons |   .2908367   .0223287   -16.09   0.000     .2502068    .3380642
------------------------------------------------------------------------------
```

So the odds of using contraception among women who want more kids are 0.291 to one, and for those who don't want more kids they are 2.85 times as high, or 0.830 to one.

The standard error of the odds ratio is calculated by the delta method, but the confidence bounds are calculated by exponentiating the bounds in the logit scale, not by addding and subtracting twice the standard error to the odds ratio. This is done because the normal approximation is more accurate (and makes more sense) in the logit scale, which has no range restrictions.

Exercise. Calculate the conventional z-test for comparing the proportions using contraception in the two groups and verify that the square coincides with Pearson's chi-squared statistic.