Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

Problem Set 4: Volunteering
Due Friday November 18, 2011

For this problem set we will use data from the 1996 General Social Survey, analyzed by Hoffmann, and available from the course website at http://data.princeton.edu/wws509/datasets/gss96.dta.

We will be primarily interested in volteer, a variable representing the number of volunteer activities in the past year. Note that gender is a dummy for females best called female and race is a dummy for non-whites best called nonwhite; please rename the variables. Two other predictors of interest are education and income. These variables have lots of missing values that should be dropped at this stage.

[1] A Poisson Model

(a) Hoffmann fits a Poisson model to the number of volunteer activities using dummies for females and non-whites, a linear term on education, and a linear term on income. Fit the same model.

(b) Interpret the coefficients and comment briefly on their significance on the basis of the Wald test. (We could do likelihood ratio tests, but we'll stick to Wald tests for simplicity.)

(c) Add quadratic terms on education and income and an interaction between female and ethnicity. You should find that all three terms are significant.

(d) Does the Poisson model with these additions fit the data? Justify your answer with a likelihood ratio test.

(e) Compute Pearson residuals, defined as (y-m)/sqrt(var(y)) where var(y)=m. Any outliers? You may find a plot of residuals versus fitted values useful, but is not required.

(f) Verify that the model underestimates the probability of zero activities by 10 percentage points. (To do this compute the predicted mean for each person, use that to compute the predicted probability of zero activities for each person, and then compare the average with the proportion who have zero activities.)

[2] Over-Dispersed Poisson

(a) Refit the final model of part 1 using glm with the Poisson family and the scale(x2) option to correct for overdispersion. What's the estimate of the scale parameter?

(b) Revisit the question of which effects are significant using your corrected estimates. (We can't do likelihood ratio tests with over-dispersed data, so this time we have to use Wald tests.)

(c) Re-estimate the overdispersion parameter after dropping the square terms on education and income and the interaction. We exclude these terms from all subsequent models.

(d) Compute new Pearson residuals corrected for overdispersion, which we define as in 1e but with var(y)=f m. Any outliers? Again, a plot of residuals versus fitted values may be useful but is not required.

[3] Negative Binomial

(a) Fit a negative binomial model with the same predictors as in 1a and 2c, and compare your results to the over-dispersed Poisson model of 2c. Pay particular attention to differences in point estimates and standard errors.

(b) The variance in a negative binomial model is m(1 + s2 m). Use this information to compute Pearson residuals for the last model and comment on any outliers.

(c) Check how well this model estimates the probability of zero ativities by predicting the probability using the negative binomial formula and comparing with the observed proportion.

[4] Zero-Inflated Poisson (ZIP)

(a) Stata's zip command fits a zero-inflated Poisson, using a logit model to predict whether people volunteer at all, and then a Poisson model for the number of volunteer activities for people who do volunteer (including zero times). Fit this model using just a constant (or null model) for the "inflate" equation.

(b) Repeat the analysis, this time allowing all predictors to enter the logit equation for the probability of volunteering, and test the overall significance of this addition.

(c) Compute the predicted probability of no activities and show that this agrees very well with the observed proportion. (Note that zeroes come both from the logit and Poisson parts of the model.)

Posted Tuesday, November 8, 2011