Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets
PDF here

Problem Set 3

Hosmer and Lemeshow have data from a study of risk factor associated with low birth weight (< 2500 grams), collected at Baystate Medical Center, Springfield, Massachusetts, in 1986. I downloaded the data and created a Stata file that you can read directly from the course website with the command

use http://data.princeton.edu/wws509/datasets/lowbwt

Use desc to see a brief description of the variables.

[1] Low Birth Weight by Race

(a) Calculate the proportions with low birth weight by ethnicity and report them together with the sample sizes. Comment briefly on the observed differentials.

(b) Fit a logistic regression model using dummy variables for blacks and other, and interpret the ethnicity coefficients in terms of odds ratios.

(c) Construct a 95% confidence intervals for the ratio of the odds of low birth weight for blacks compared to whites and summarize it in one sentence.

(d) Verify 'by hand' the estimates of the constant and the coefficient for blacks.

(e) Test the significance of the ethnicity differentials using (i) a likelihood ratio test, and (ii) a Wald test.

[2] Adjusting for Other Characteristics

(a) Fit a model including controls for age, weight at last menstrual period, smoke, history of premature labor, hypertension, and uterine irritability. Following Hosmer and Lemeshow, treat premature labor as a simple dichotomy (because of small counts) and model weight at last menstrual period using an indicator for weight below 110 lbs rather than a linear term.

(b) Test the significance of the ethnicity coefficients using (i) a likelihood ratio test, (ii) a Wald test. Comment.

(c) Construct a 95% confidence interval for the odds ratio comparing blacks to whites.

(d) Produce a display similar to 1a but showing proportions with low birth weight by ethnicity adjusted for the covariates and comment briefly. We will try two ways of doing this:

(i) Predict the probability of low birth weight for each ethnic group with all other variables set to their means in the entire sample.

(ii) Predict the probability of low birth weight for each observation setting race first to white, then to black, and finally to other, leaving all other variables as they are.

Why two methods? One problem with (i) is that it doesn't make a lot of sense to set dummy variables to their means. What does it mean to predict with premature labor set to 0.16? With linear models this is justified because predicting at the mean gives the mean prediction, so it is equivalent to predicting with a history of premature labor 16% of the time and without the remaining 84%. Unfortunately with logit models this is no longer true, hence the need for method (ii), which predicts at actual values while equalizing the composition of the groups being compared.

Try to do this from first principles. If you get stuck see the Stata help at the end. I also show how you can verify your results using the margins command in Stata 11 or the adjust command in Stata 10 and earlier.

[3] Goodness of Fit

(a) One way to assess the goodness of fit of a model is to introduce additional terms. Hosmer and Lemeshow consider several interactions, including interactions between age and the indicator of mother's low weight (LWD), and between smoking and LWD. Verify that these two are not significant individually, and that a joint likelihood ratio test yields a p-value of 0.06. Unlike Hosmer and Lemeshow, we will not retain these terms.

(b) Try the estat gof command. How many covariate patterns did Stata find? Can the test be trusted? Repeat using 10 groups. The groups can be designed to have equal size or equal ranges of predicted probabilities. Which approach does Stata use?

(c) Calculate predicted probabilities and classify anyone with 0.5 or more as predicted to be low birth weight. Tabulate observed versus predicted birth weight status and verify your results using Stata's estat classif. What's the proportionate reduction in classification error, relative to the null model?

(d) Some authors report a pseudo R-squared, defined as the proportionate reduction in deviance compared to the null model. (Exactly analogous to R-squared in linear models, with the deviance playing the role of the RSS.) Check that this is in fact what Stata reports as Pseudo R2 and comment on the value.

[4] Latent and Manifest Variables

(a) Run a probit model using the same predictors as in 2a and interpret the coefficients for race in terms of a latent variable.

(b) The dataset actually includes birth weight in grams, which can be seen as analogous to the latent variable, except of course, that it is observed. Regress birth weight on the same predictors as in 2a and interpret the coefficients of race.

(c) Compare the logit, probit, and linear coefficients, standardizing themso they have the same sign and the same variance of the error term. Why would one use low birth weight rather than weight in grams as the outcome?

Stata Help for 2d

Show


Posted on Wednesday, October 21, 2009