## 3. Logit Models in Stata

Stata has several commands that can be used to fit logistic regression
models by maximum likelihood. The basic commands are
`logit`

for individual data and
`blogit`

for grouped data.
There is also a `logistic`

command that presents the results
in terms of odd-ratios instead of log-odds and can produce a variety of
summary and diagnostic statistics.
Finally, one can fit a logistic regression model as a special case
of a generalized linear model with Bernoulli or binomial errors and
link logit, using the `glm`

command.
We will have occassion to use all of these commands but we will
emphasize the first two, using `blogit`

for grouped data
in this log and `logit`

for individual data in the problem
sets.

### 3.3 The Comparison of Two Groups

Following the lecture notes we will consider comparing two groups and then move on to more than two.

#### A 2-by-2 Table

Consider the data on contraceptive use by desire for more children on Table 3.2 (page 14 of the notes). We can read these data into Stata as 2 binomial observations. To make life easier I will enter desire for more children as a dummy variable that takes the value 1 for women who want no more children and 0 otherwise

. input nomore users n nomore users n 1. 0 219 972 2. 1 288 635 3. end

#### Testing Homegeneity

let us start by fitting the null model.
With `blogit`

you specify the outcome in terms of
the number of 'successes' and the binomial denominator,
here `users`

and `n`

:

. blogit users n Logistic regression for grouped data Number of obs = 1607 LR chi2(0) = 0.00 Prob > chi2 = . Log likelihood = -1001.8468 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ _outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | -.7745545 .0536794 -14.43 0.000 -.8797641 -.6693448 ------------------------------------------------------------------------------

The estimate of the constant is simply the logit of the overall proportion
using contraception, say *p=y/n*, and
the standard error is the square root of *1/y + 1/(n-y)*.
You may want to check these results by hand.

Stata's `blogit`

*does not* calculate the model deviance,
but we can obtain it 'by hand' using `predict`

to obtain
fitted counts:

. predict pusers (option n assumed; E(cases)) . gen di = 2*( users*log(users/pusers) + (n-users)*log((n-users)/(n-pusers)) ) . gen DI = sum(di) . display "Deviance = " DI[_N] Deviance = 91.674393

So the deviance is 91.67 on one d.f., providing ample evidence that the null model does not fit the data. Thus, we reject the hypothesis that the probability of using contraception is the same in the two groups.

Try a similar method to calculate Pearson's chi-squared,
you should get 92.64.
Alternatively, you can fit the model using `glm`

, which
reports both the deviance and Pearson's chi-squared by default.
I'll do this quietly and just report the corresponding stored results,
`e(deviance)`

for the deviance and
`e(deviance_p)`

for Pearson's statistic.

. quietly glm users, family(binomial n) . display e(deviance), e(deviance_p) 91.674397 92.644243

#### The Odds Ratio

Let us now fit the model with 'want no more' children as the predictor. This model is saturated for this dataset, using two parameters to model two probabilities:

. blogit users n nomore Logistic regression for grouped data Number of obs = 1607 LR chi2(1) = 91.67 Prob > chi2 = 0.0000 Log likelihood = -956.00957 Pseudo R2 = 0.0458 ------------------------------------------------------------------------------ _outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- nomore | 1.048629 .110672 9.48 0.000 .8317159 1.265542 _cons | -1.234993 .0767739 -16.09 0.000 -1.385468 -1.084519 ------------------------------------------------------------------------------ . di exp(_b[nomore]) 2.8537363

The constant corresponds to the log-odds of using contraception among
whomen who *do* want more children, and the coefficient of
`nomore`

is the difference in log-odds between the two
groups.

Exponentiating this coefficient we get an odds ratio of about three. Contrary to popular belief, this does not mean that "women who want no more children are three times more likely to use contraception". There are two errors in this interpretation.

First, and more importantly, it is the *odds* of using contraception
among women who want no more children that are three times those of women
who want more, not the probability, which is what's usually understood
by "likelihood".
The interpretation would be approximately correct if
the event under study was rare, because if *p* is small then
*1-p* is close to one and the odds ratio is approximately the
same as the relative risk.
Here the observed proportions are 0.454 and 0.225, and the ratio is 2.01,
so women who want no more children are twice as likely to use
contraception as those who want more.

Second, even if the probability was tripled, that would make the women
three times *as* likely, or two times *more* likely, to use
contraception, not three times more likely.
In this case the probability is doubled, and that makes women
twice *as* likely, not two times *more* likely.

#### Testing Significance

The z-statistic is as reported on page 16 of the notes. Let us square it:

. di (_b[nomore]/_se[nomore])^2 89.777623

This is Wald's chi-squared statistic for the hypothesis that the
coefficient of `nomore`

is zero, or equivalently that the
odds-ratio is one, and can be calculated more simply using Stata's
`test`

command:

. test nomore ( 1) [_outcome]nomore = 0 chi2( 1) = 89.78 Prob > chi2 = 0.0000

The `chi2`

statistic reported by Stata in the second line
of output is the likelihood ratio chi-squared comparing the current
model with the null model. Can you explain why we get 91.67, which
is the deviance of the null model?
*Hint:* What's the deviance of this model?

A third test of the effect of want no more is given by Pearson's chi-squared statistic, which we calculated earlier as 92.64. This is equivalent to the standard z-test for comparing two proportions if you use the pooled proportion to estimate the standard error.

All three statistics are different, but they are asymptotically equivalent. In our example they are also close in value and lead to the same overwhelming rejection of the hypothesis that the probability of using contraception is the same in the two groups.

#### Confidence Intervals

Stata is kind enough to give us a 95% confidence interval for
the logit coefficients. We can convert the interval for the
coefficient of `nomore`

into a 95% CI for the odds ratio
by exponentiating the confidence bounds:

. di exp(0.831716) "-" exp(1.265542) 2.2972575-3.5450136

An even easier way is to type `blogit, or`

.
The `blogit`

command without any variables, like all estimation
commands, simply retrieves the results of the last fit.
The option `or`

is short for __o__dds-__r__atio and
causes Stata to report exponentiated coefficients. (Versions 12 and earlier
omit the constant but Stata 13 exponentiates that as well.)

. blogit, or Logistic regression for grouped data Number of obs = 1607 LR chi2(1) = 91.67 Prob > chi2 = 0.0000 Log likelihood = -956.00957 Pseudo R2 = 0.0458 ------------------------------------------------------------------------------ _outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- nomore | 2.853736 .3158288 9.48 0.000 2.297257 3.545015 _cons | .2908367 .0223287 -16.09 0.000 .2502068 .3380642 ------------------------------------------------------------------------------

So the odds of using contraception among women who want more kids are 0.291 to one, and for those who don't want more kids they are 2.85 times as high, or 0.830 to one.

The standard error of the odds ratio is calculated by the delta method,
but the confidence bounds are calculated by exponentiating the bounds
in the logit scale, __not__ by addding and subtracting twice the
standard error to the odds ratio. This is done because the normal
approximation is more accurate (and makes more sense) in the logit scale,
which has no range restrictions.

*Exercise*. Calculate the conventional z-test for comparing the
proportions using contraception in the two groups and verify that the
square coincides with Pearson's chi-squared statistic.