Germán Rodríguez

Generalized Linear Models
Princeton University
We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.

Recall that the deviance can be used as a goodness of fit test with grouped but not with individual data. An obvious approach with individual data is therefore to group them. One way to do this is to create a separate group for each distint combination of covariate values in the sample. Obviously this may not work if there are too many distinct values or covariate patterns in the data. We will discuss another way of creating groups further below.

In our application the point is moot, however, because the data are grouped. We have only 16 different groups of women, determined by 4 age groups, 2 education categories and 2 values of desire for more children. Moreover, the groups are fairly large, with all categories having more than 5 women, so the deviance can be treated safely as a chi-squared statistic. In our example the deviance of 29.92 on 10 d.f. categories indicates significant lack of fit.

The `residuals()`

function can be used to obtain deviance and
Pearson residuals (among others), by specifying `type="deviance"`

or `type="pearson"`

.

The default are deviance residuals, the signed square roots of the contributions to the deviance. The alternative, Pearson residuals, are the signed square roots of the contributions to Pearson's chi-squared statistic;

> additive <- glm(Y ~ age + education + nomore, family=binomial, data=cuse) > dr <- residuals(additive) > sum(dr^2) [1] 29.91722 > pr <- residuals(additive, "pearson") > sum(pr^2) [1] 28.28834

Both are useful in identifying which groups contribute most to the lack of fit. Let us list the groups with deviance residuals with absolute values in excess of two:

> cbind( + cuse[,c("age","education","nomore")], + obs = cuse$using/cuse$n, + fit = fitted(additive), + dr, + pr + )[abs(dr)>2,] age education nomore obs fit dr pr 4 <25 high 1 0.1666667 0.3082699 -2.514795 -2.375281 8 25-29 high 1 0.2934783 0.3967948 -2.065260 -2.025574 13 40-49 low 0 0.1463415 0.3149818 -2.491414 -2.324661

The additive model seems we do a particularly poor job with young educated women who want more children and older uneducated women who want no more; in all these cases the model overestimates contraceptive use.

It may be instructive to compute these residuals "by hand" using observed and predicted counts. Don't forget that for each group we need to consider the contributions from the "successes" and the "failures", which we add together to form a group residual. Conventionally the group residual takes the sign of the difference between observed and fitted probabilities.

> obs <- cuse$Y > p <- fitted(additive) > fit <- cbind(p * cuse$n, (1-p) * cuse$n) > sgn <- sign(cuse$using/cuse$n - p) > pr2 <- sqrt(( (obs-fit)^2/fit ) %*% c(1,1)) * sgn > dr2 <- sqrt(( 2*obs*log(obs/fit) ) %*% c(1,1)) * sgn > all(abs(dr-dr2) < 1e-12) [1] TRUE > all(abs(pr-pr2) < 1e-12) [1] TRUE

Pregibon extended regression diagnostics to GLMs and introduced a weighted
hat matrix. The diagonal elements or leverages can be calculated with
`hatvalues()`

and Cook's distance with `cooks.distance`

.
(Don't be surprised if these look like the same functions we used for linear
models. Like many other R functions, these are *generic* functions;
R looks at the class of the object and calls the appropriate function,
depending on whether the object is a linear model fitted by `lm()`

or a generalized linear model fitted by `glm()`

.)

Let us calculate these diagnostics and list the groups that have potential and/or actual influence on the fit.

> pfit <- p # for clarity > pobs <- cuse$using / cuse$n > lev <- hatvalues(additive) > cd <- cooks.distance(additive) > i <- order(-lev) # sort descending > cbind(cuse[,c("age","education","nomore")],pobs,pfit,lev,cd)[i[1:3],] age education nomore pobs pfit lev cd 3 <25 high 0 0.1969697 0.1623053 0.6696332 2.385867366 7 25-29 high 0 0.2583732 0.2223899 0.5774811 0.843656898 14 40-49 low 1 0.5106383 0.5140024 0.5601446 0.002054933

The three cells with potentially the largest influence in the fit are young women with some education who want more children, and older women with no education who want no more children. The youngest group had the most influence on the fit, whereas older uneducated women had no actual influence at all, with a Cook's distance of zero.

Note: Cook's distance is calculated using Pregibon's one-step approximation, as described on page 49 of the notes. In short, it doesn't refit the model excluding an observation, but takes just one-step in the IRLS algorithm, starting from the full sample mle.

The values of the (weighed) hat matrix can be used to compute standardized Pearson and deviance residuals, just as we did in linear models. These residuals take into account differences in residual variances originating from both the outcome and the fit.

> spr <- pr/sqrt(1-lev) > sdr <- dr/sqrt(1-lev) > i <- order(-spr^2) > cbind(cuse[,c("age","education","nomore")],pobs,pfit,spr,sdr)[i[1:3],] age education nomore pobs pfit spr sdr 4 <25 high 1 0.1666667 0.3082699 -2.887789 -3.057406 13 40-49 low 0 0.1463415 0.3149818 -2.720763 -2.915930 8 25-29 high 1 0.2934783 0.3967948 -2.687572 -2.740228

We identify the same three observations picked up by the conventional residuals, but the absolute values are now closer to three, highlighting the lack of fit of these groups.

Hosmer and Lemeshow have proposed a goodness of fit for logistic regression
models that can be used with individual data. The basic idea is to
create groups using predicted probabilities, and then compare observed and
fitted counts of successes and failures on those groups using a chi-squared
statistic. Simulation has shown that with *g* groups the large sample
distribution of the test statistic is approximately chi-squared with
*g-2* degrees of freedom.

Let us apply this test to the Hosmer and Lemeshow low birth weight data,
which happen to be available in Stata format on the Stata website. We
can read the data using the `read.dta`

function in the
`foreign`

package.

> require(foreign) > lbw <- read.dta("http://www.stata-press.com/data/r12/lbw.dta") > names(lbw) [1] "id" "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui" "ftv" [11] "bwt"

The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother's age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here's the fit

> model <- glm(low ~ age+lwt+race+smoke+ptl+ht+ui, family=binomial, data=lbw) > summary(model) Call: glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui, family = binomial, data = lbw) Deviance Residuals: Min 1Q Median 3Q Max -1.9116 -0.8175 -0.5224 0.9716 2.1805 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.461224 1.204575 0.383 0.70180 age -0.027100 0.036450 -0.743 0.45718 lwt -0.015151 0.006926 -2.188 0.02870 * raceblack 1.262647 0.526405 2.399 0.01646 * raceother 0.862079 0.439147 1.963 0.04964 * smoke 0.923345 0.400821 2.304 0.02124 * ptl 0.541837 0.346247 1.565 0.11761 ht 1.832518 0.691624 2.650 0.00806 ** ui 0.758513 0.459374 1.651 0.09870 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 on 188 degrees of freedom Residual deviance: 201.45 on 180 degrees of freedom AIC: 219.45 Number of Fisher Scoring iterations: 4

The sample of 189 observations has 182 different covariate patterns, so we can't really group this way. Instead we will compute predicted probabilities and create ten groups of approximately equal size by breaking at the deciles of the predicted probabilities. It pays to encapsulate the calculations in a function that can be reused:

> hosmer <- function(y, fv, groups=10, table=TRUE, type=2) { + # A simple implementation of the Hosmer-Lemeshow test + q <- quantile(fv, seq(0,1,1/groups), type=type) + fv.g <- cut(fv, breaks=q, include.lowest=TRUE) + obs <- xtabs( ~ fv.g + y) + fit <- cbind( e.0 = tapply(1-fv, fv.g, sum), e.1 = tapply(fv, fv.g, sum)) + if(table) print(cbind(obs,fit)) + chi2 <- sum((obs-fit)^2/fit) + pval <- pchisq(chi2, groups-2, lower.tail=FALSE) + data.frame(test="Hosmer-Lemeshow",groups=groups,chi.sq=chi2,pvalue=pval) + }

We calculate quantiles, defaulting to deciles, and use these to create groups taking care to include all values. We then tabulate the observed and predicted outcomes, using the sum of predicted probablities as the expected number of "successes" in a group. There is an option to print a table of expected and observed counts. The function returns the chi-squared statistic, the d.f. and the p-value.

Here's the result of applying our function with all the defaults:

> hosmer(lbw$low, fitted(model)) 0 1 e.0 e.1 [0.0273,0.0827] 19 0 17.822223 1.177777 (0.0827,0.128] 17 2 16.973902 2.026098 (0.128,0.201] 13 6 15.828544 3.171456 (0.201,0.243] 18 1 14.695710 4.304290 (0.243,0.279] 12 7 14.106205 4.893795 (0.279,0.314] 12 7 13.360124 5.639876 (0.314,0.387] 13 6 12.462805 6.537195 (0.387,0.483] 12 7 10.824166 8.175834 (0.483,0.594] 9 10 8.690142 10.309858 (0.594,0.839] 5 13 5.236180 12.763820 test groups chi.sq pvalue 1 Hosmer-Lemeshow 10 9.650683 0.2904041

We get a Hosmer-Lemeshow chi-squared value of 9.65 on 5 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.

Notes:
Statistical packages differ in how they calculate quantiles.
R implements 9 types; the default is 7 for compatibility with S and
R < 2.0, but they recommend type 8. Type `?quantile`

to
learn more. We used type 2, the inverse of the empirical distribution
function with averaging at discontinuities, for compatibility with Stata,
but our function lets you try other types.
A test using R's recommended definition of deciles yields a chi-squared
of 10.55 on the same 8 d.f., with a p-value of 0.2283.

Paul Allison has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.