Germán Rodríguez
Generalized Linear Models Princeton University

3.8 Regression Diagnostics for Binary Data

We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.

Covariate Patterns

Recall that the deviance can be used as a goodness of fit test with grouped but not with individual data. An obvious approach with individual data is therefore to group them. One way to do this is to create a separate group for each distint combination of covariate values in the sample. Obviously this may not work if there are too many distinct values or covariate patterns in the data. We will discuss another way of creating groups further below.

In our application the point is moot, however, because the data are grouped. We have only 16 different groups of women, determined by 4 age groups, 2 education categories and 2 values of desire for more children. Moreover, the groups are fairly large, with all categories having more than 5 women, so the deviance can be treated safely as a chi-squared statistic. In our example the deviance of 29.92 on 10 d.f. categories indicates significant lack of fit.

Deviance and Pearson Residuals

The `residuals()` function can be used to obtain deviance and Pearson residuals (among others), by specifying `type="deviance"` or `type="pearson"`.

The default are deviance residuals, the signed square roots of the contributions to the deviance. The alternative, Pearson residuals, are the signed square roots of the contributions to Pearson's chi-squared statistic;

```> additive <- glm(Y ~ age + education + nomore, family=binomial, data=cuse)

> sum(dr^2)
[1] 29.91722

> sum(pr^2)
[1] 28.28834
```

Both are useful in identifying which groups contribute most to the lack of fit. Let us list the groups with deviance residuals with absolute values in excess of two:

```> cbind(
+   cuse[,c("age","education","nomore")],
+   obs = cuse\$using/cuse\$n,
+   dr,
+   pr
+ )[abs(dr)>2,]
age education nomore       obs       fit        dr        pr
4    <25      high      1 0.1666667 0.3082699 -2.514795 -2.375281
8  25-29      high      1 0.2934783 0.3967948 -2.065260 -2.025574
13 40-49       low      0 0.1463415 0.3149818 -2.491414 -2.324661
```

The additive model seems we do a particularly poor job with young educated women who want more children and older uneducated women who want no more; in all these cases the model overestimates contraceptive use.

It may be instructive to compute these residuals "by hand" using observed and predicted counts. Don't forget that for each group we need to consider the contributions from the "successes" and the "failures", which we add together to form a group residual. Conventionally the group residual takes the sign of the difference between observed and fitted probabilities.

```> obs <- cuse\$Y

> fit <- cbind(p * cuse\$n, (1-p) * cuse\$n)

> sgn <- sign(cuse\$using/cuse\$n - p)

> pr2 <- sqrt((  (obs-fit)^2/fit   ) %*% c(1,1)) * sgn

> dr2 <- sqrt(( 2*obs*log(obs/fit) ) %*% c(1,1)) * sgn

> all(abs(dr-dr2) < 1e-12)
[1] TRUE

> all(abs(pr-pr2) < 1e-12)
[1] TRUE
```

Leverage and Influence

Pregibon extended regression diagnostics to GLMs and introduced a weighted hat matrix. The diagonal elements or leverages can be calculated with `hatvalues()` and Cook's distance with `cooks.distance`. (Don't be surprised if these look like the same functions we used for linear models. Like many other R functions, these are generic functions; R looks at the class of the object and calls the appropriate function, depending on whether the object is a linear model fitted by `lm()` or a generalized linear model fitted by `glm()`.)

Let us calculate these diagnostics and list the groups that have potential and/or actual influence on the fit.

```> pfit <- p # for clarity

> pobs <- cuse\$using / cuse\$n

> i <- order(-lev) # sort descending

> cbind(cuse[,c("age","education","nomore")],pobs,pfit,lev,cd)[i[1:3],]
age education nomore      pobs      pfit       lev          cd
3    <25      high      0 0.1969697 0.1623053 0.6696332 2.385867366
7  25-29      high      0 0.2583732 0.2223899 0.5774811 0.843656898
14 40-49       low      1 0.5106383 0.5140024 0.5601446 0.002054933
```

The three cells with potentially the largest influence in the fit are young women with some education who want more children, and older women with no education who want no more children. The youngest group had the most influence on the fit, whereas older uneducated women had no actual influence at all, with a Cook's distance of zero.

Note: Cook's distance is calculated using Pregibon's one-step approximation, as described on page 49 of the notes. In short, it doesn't refit the model excluding an observation, but takes just one-step in the IRLS algorithm, starting from the full sample mle.

Standardized Residuals

The values of the (weighed) hat matrix can be used to compute standardized Pearson and deviance residuals, just as we did in linear models. These residuals take into account differences in residual variances originating from both the outcome and the fit.

```> spr <- pr/sqrt(1-lev)

> sdr <- dr/sqrt(1-lev)

> i <- order(-spr^2)

> cbind(cuse[,c("age","education","nomore")],pobs,pfit,spr,sdr)[i[1:3],]
age education nomore      pobs      pfit       spr       sdr
4    <25      high      1 0.1666667 0.3082699 -2.887789 -3.057406
13 40-49       low      0 0.1463415 0.3149818 -2.720763 -2.915930
8  25-29      high      1 0.2934783 0.3967948 -2.687572 -2.740228
```

We identify the same three observations picked up by the conventional residuals, but the absolute values are now closer to three, highlighting the lack of fit of these groups.

Goodness of Fit

Hosmer and Lemeshow have proposed a goodness of fit for logistic regression models that can be used with individual data. The basic idea is to create groups using predicted probabilities, and then compare observed and fitted counts of successes and failures on those groups using a chi-squared statistic. Simulation has shown that with g groups the large sample distribution of the test statistic is approximately chi-squared with g-2 degrees of freedom.

Let us apply this test to the Hosmer and Lemeshow low birth weight data, which happen to be available in Stata format on the Stata website. We can read the data using the `read.dta` function in the `foreign` package.

```> require(foreign)

> names(lbw)
[1] "id"    "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"    "ftv"
[11] "bwt"
```

The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother's age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here's the fit

```> model <- glm(low ~ age+lwt+race+smoke+ptl+ht+ui, family=binomial, data=lbw)

> summary(model)

Call:
glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui,
family = binomial, data = lbw)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.9116  -0.8175  -0.5224   0.9716   2.1805

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.461224   1.204575   0.383  0.70180
age         -0.027100   0.036450  -0.743  0.45718
lwt         -0.015151   0.006926  -2.188  0.02870 *
raceblack    1.262647   0.526405   2.399  0.01646 *
raceother    0.862079   0.439147   1.963  0.04964 *
smoke        0.923345   0.400821   2.304  0.02124 *
ptl          0.541837   0.346247   1.565  0.11761
ht           1.832518   0.691624   2.650  0.00806 **
ui           0.758513   0.459374   1.651  0.09870 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 201.45  on 180  degrees of freedom
AIC: 219.45

Number of Fisher Scoring iterations: 4
```

The sample of 189 observations has 182 different covariate patterns, so we can't really group this way. Instead we will compute predicted probabilities and create ten groups of approximately equal size by breaking at the deciles of the predicted probabilities. It pays to encapsulate the calculations in a function that can be reused:

```> hosmer <- function(y, fv, groups=10, table=TRUE, type=2) {
+ # A simple implementation of the Hosmer-Lemeshow test
+   q <- quantile(fv, seq(0,1,1/groups), type=type)
+   fv.g <- cut(fv, breaks=q, include.lowest=TRUE)
+   obs <- xtabs( ~ fv.g + y)
+   fit <- cbind( e.0 = tapply(1-fv, fv.g, sum), e.1 = tapply(fv, fv.g, sum))
+   if(table) print(cbind(obs,fit))
+   chi2 <- sum((obs-fit)^2/fit)
+   pval <- pchisq(chi2, groups-2, lower.tail=FALSE)
+   data.frame(test="Hosmer-Lemeshow",groups=groups,chi.sq=chi2,pvalue=pval)
+ }
```

We calculate quantiles, defaulting to deciles, and use these to create groups taking care to include all values. We then tabulate the observed and predicted outcomes, using the sum of predicted probablities as the expected number of "successes" in a group. There is an option to print a table of expected and observed counts. The function returns the chi-squared statistic, the d.f. and the p-value.

Here's the result of applying our function with all the defaults:

```> hosmer(lbw\$low, fitted(model))
0  1       e.0       e.1
[0.0273,0.0827] 19  0 17.822223  1.177777
(0.0827,0.128]  17  2 16.973902  2.026098
(0.128,0.201]   13  6 15.828544  3.171456
(0.201,0.243]   18  1 14.695710  4.304290
(0.243,0.279]   12  7 14.106205  4.893795
(0.279,0.314]   12  7 13.360124  5.639876
(0.314,0.387]   13  6 12.462805  6.537195
(0.387,0.483]   12  7 10.824166  8.175834
(0.483,0.594]    9 10  8.690142 10.309858
(0.594,0.839]    5 13  5.236180 12.763820
test groups   chi.sq    pvalue
1 Hosmer-Lemeshow     10 9.650683 0.2904041
```

We get a Hosmer-Lemeshow chi-squared value of 9.65 on 5 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.

Notes: Statistical packages differ in how they calculate quantiles. R implements 9 types; the default is 7 for compatibility with S and R < 2.0, but they recommend type 8. Type `?quantile` to learn more. We used type 2, the inverse of the empirical distribution function with averaging at discontinuities, for compatibility with Stata, but our function lets you try other types. A test using R's recommended definition of deciles yields a chi-squared of 10.55 on the same 8 d.f., with a p-value of 0.2283.

Paul Allison has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.