3 Logit Models in Stata
Stata has several commands that can be used to fit logistic regression
models by maximum likelihood. The basic commands are
logit for individual data and
blogit for grouped data.
There is also a
logistic command that presents the results
in terms of odd-ratios instead of log-odds and can produce a variety of
summary and diagnostic statistics.
Finally, one can fit a logistic regression model as a special case
of a generalized linear model with Bernoulli or binomial errors and
link logit, using the
We will have occasion to use all of these commands,
but we will emphasize the first two.
3.3 The Comparison of Two Groups
Following the lecture notes we will consider comparing two groups and then move on to more than two.
A 2-by-2 Table
Consider the data on contraceptive use by desire for more children on Table 3.2 (page 14 of the notes). We can read these data into Stata as 2 binomial observations. To make life easier I will enter desire for more children as a dummy variable that takes the value 1 for women who want no more children and 0 otherwise
. input nomore users n nomore users n 1. 0 219 972 2. 1 288 635 3. end
let us start by fitting the null model.
blogit you specify the outcome in terms of
the number of 'successes' and the binomial denominator,
. blogit users n Logistic regression for grouped data Number of obs = 1607 LR chi2(0) = 0.00 Prob > chi2 = . Log likelihood = -1001.8468 Pseudo R2 = 0.0000
|_outcome||Coef. Std. Err. z P>|z| [95% Conf. Interval]|
|_cons||-.7745545 .0536794 -14.43 0.000 -.8797641 -.6693448|
The estimate of the constant is simply the logit of the overall proportion using contraception, say p=y/n, and the standard error is the square root of 1/y + 1/(n-y). You may want to check these results by hand.
blogit does not calculate the model deviance,
but we can obtain it 'by hand' using
predict to obtain
. predict pusers (option n assumed; E(cases)) . gen di = 2*( users*log(users/pusers) + (n-users)*log((n-users)/(n-pusers)) ) . gen DI = sum(di) . display "Deviance = " DI[_N] Deviance = 91.674393
So the deviance is 91.67 on one d.f., providing ample evidence that the null model does not fit the data. Thus, we reject the hypothesis that the probability of using contraception is the same in the two groups.
Try a similar method to calculate Pearson's chi-squared,
you should get 92.64.
Alternatively, you can fit the model using
reports both the deviance and Pearson's chi-squared by default.
I'll do this quietly and just report the corresponding stored results,
e(deviance) for the deviance and
e(deviance_p) for Pearson's statistic.
. quietly glm users, family(binomial n) . display e(deviance), e(deviance_p) 91.674397 92.644243
The Odds Ratio
Let us now fit the model with 'want no more' children as the predictor. This model is saturated for this dataset, using two parameters to model two probabilities:
. blogit users n nomore Logistic regression for grouped data Number of obs = 1607 LR chi2(1) = 91.67 Prob > chi2 = 0.0000 Log likelihood = -956.00957 Pseudo R2 = 0.0458
|_outcome||Coef. Std. Err. z P>|z| [95% Conf. Interval]|
|nomore||1.048629 .110672 9.48 0.000 .8317159 1.265542|
|_cons||-1.234993 .0767739 -16.09 0.000 -1.385468 -1.084519|
The constant corresponds to the log-odds of using contraception among
whomen who do want more children, and the coefficient of
nomore is the difference in log-odds between the two
Exponentiating this coefficient we get an odds ratio of about three. Contrary to popular belief, this does not mean that "women who want no more children are three times more likely to use contraception". There are two errors in this interpretation.
First, and more importantly, it is the odds of using contraception among women who want no more children that are three times those of women who want more, not the probability, which is what's usually understood by "likelihood". The interpretation would be approximately correct if the event under study was rare, because if p is small then 1-p is close to one and the odds ratio is approximately the same as the relative risk. Here the observed proportions are 0.454 and 0.225, and the ratio is 2.01, so women who want no more children are twice as likely to use contraception as those who want more.
Second, even if the probability was tripled, that would make the women three times as likely, or two times more likely, to use contraception, not three times more likely. In this case the probability is doubled, and that makes women twice as likely, not two times more likely.
The z-statistic is as reported on page 16 of the notes. Let us square it:
. di (_b[nomore]/_se[nomore])^2 89.777623
This is Wald's chi-squared statistic for the hypothesis that the
nomore is zero, or equivalently that the
odds-ratio is one, and can be calculated more simply using Stata's
. test nomore ( 1) [_outcome]nomore = 0 chi2( 1) = 89.78 Prob > chi2 = 0.0000
chi2 statistic reported by Stata in the second line
of output is the likelihood ratio chi-squared comparing the current
model with the null model. Can you explain why we get 91.67, which
is the deviance of the null model?
Hint: What's the deviance of this model?
A third test of the effect of want no more is given by Pearson's chi-squared statistic, which we calculated earlier as 92.64. This is equivalent to the standard z-test for comparing two proportions if you use the pooled proportion to estimate the standard error.
All three statistics are different, but they are asymptotically equivalent. In our example they are also close in value and lead to the same overwhelming rejection of the hypothesis that the probability of using contraception is the same in the two groups.
Stata is kind enough to give us a 95% confidence interval for
the logit coefficients. We can convert the interval for the
nomore into a 95% CI for the odds ratio
by exponentiating the confidence bounds:
. di exp(0.831716) "-" exp(1.265542) 2.2972575-3.5450136
An even easier way is to type
blogit command without any variables, like all estimation
commands, simply retrieves the results of the last fit.
or is short for odds-ratio and
causes Stata to report exponentiated coefficients. (Versions 12 and earlier
omit the constant but Stata 13 exponentiates that as well.)
. blogit, or Logistic regression for grouped data Number of obs = 1607 LR chi2(1) = 91.67 Prob > chi2 = 0.0000 Log likelihood = -956.00957 Pseudo R2 = 0.0458
|_outcome||Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]|
|nomore||2.853736 .3158288 9.48 0.000 2.297257 3.545015|
|_cons||.2908367 .0223287 -16.09 0.000 .2502068 .3380642|
So the odds of using contraception among women who want more kids are 0.291 to one, and for those who don't want more kids they are 2.85 times as high, or 0.830 to one.
The standard error of the odds ratio is calculated by the delta method, but the confidence bounds are calculated by exponentiating the bounds in the logit scale, not by addding and subtracting twice the standard error to the odds ratio. This is done because the normal approximation is more accurate (and makes more sense) in the logit scale, which has no range restrictions.
Exercise. Calculate the conventional z-test for comparing the proportions using contraception in the two groups and verify that the square coincides with Pearson's chi-squared statistic.