Germán Rodríguez

Generalized Linear Models
Princeton University
Due Monday, October 24, 2016

Cameron and Trivedi (2010) have an interesting dataset based on wave 5 of the Health and Retirement Study (HRS), a survey conducted in 2002 as part of a panel sponsored by NIH. The sample consists of Medicare beneficiaries and the question of interest is whether or not they purchase supplemental insurance (`ins`

). The explanatory variables include socio-economic and demographic factors and an indicator of health status. The data are available from the Stata website, but for your convenience you will find `mus14data.dta`

in the datasets section.

(a) What's the proportion of respondents who have supplemental insurance? What are the odds of having insurance? What's the logit? Construct a 95% confidence interval for the logit and translate it back into corresponding intervals for the odds and probability.

(b) Hispanics are less likely to have supplemental insurance than others. Estimate the proportions with insurance among hispanics and others and calculate the difference. Estimate and interpret the odds ratio, and test its significance using a Wald test and a likelihood ratio test.

(c) Let us jump directly to the model used by Cameron and Trivedi. Fit a logit model using retirement status (`retire`

), `age`

, health status (`hstatusg`

, coded 1 for good, very good or excellent and 0 otherwise), household income (`hhincome`

), education in years (`eduyears`

), the indicator for `married`

, and the indicator of hispanic ethnicity (`hisp`

). You should find that all but age have significant effects at the five percent level.

(d) Interpret carefully the odds ratio for hispanics, comparing it with the result of part (b). Test the significance of the ethnicity effect using a Wald test and a likelihood ratio test.

(e) Turns out these estimates are sensitive to the specification of household income. An obvious alternative is log-income, but some households have no income. Instead we will group this variable into quartiles, which has been done in `qhhinc`

. Refit the model using this alternative specification and comment on the odds ratio for hispanics. Use this alternative specification in what follows.

We will estimate several types of marginal effects. I will refer to the calculation based on the derivative as "continuous" and to the unit change as "discrete".

(a) Estimate the marginal effect of hispanic using the continuous formula at the mean of all covariates. Try a quick approximation using the formula evaluated at the overall probability. This is known as the marginal effect at the mean.

(b) Predict insurance for hispanics and others with the other covariates set to their means, and calculate the difference. This is the unit change version of the marginal effect at the mean.

(c) Predict insurance for everyone with all variables as they are, and compute the marginal effect of hispanic for each respondent using the continuous approximation. This is known as the average marginal effect.

(d) Make a copy of `hisp`

for safekeeping. Now set `hisp`

to 1 for everyone, predict the probability of having insurance and average. Next set `hisp`

to 0 for everyone, predict again and average. The difference between these two means is the discrete version of the average marginal effect. Don't forget to set `hisp`

back to its true value.

(e) The last two approaches avoid setting variables such as `married`

to its mean of 0.773, which is not meaningful. Another approach is to select a combination of predictors of interest and predict for that case. Cameron and Trivedi use a 75-year old married person with good health status, 12 years of education and an income in the third quartile. Calculate the predicted probabilities if the person is hispanic and if not, and compute the difference.

Note: Stata's powerful `margins`

command can do these calculations, but you should do them "by hand", so you know exactly what's being done.

(a) With individual data the deviance does not have a *c**h**i*^{2} distribution, and even with household income grouped into categories there are too many covariate patterns to trust the asymptotics. Calculate the Hosmer-Lemeshor goodness of fit test using ten groups of approximately equal size based on predicted probabilities.

(b) Compute predicted probabilities of having insurance and predict that a respondent will have insurance if the predicted probablity is 0.5 or more. Tabulate predicted versus actual outcomes. What's the overall error rate? Comment on the numbers of false positives and false negatives. How well would be do if we just predicted that nobody would buy supplemental insurance? How much reduction in error did the model achieve? (Stata's `estat classification`

will report more details, but you must also do this part "by hand".)

(a) Fit the model using a probit specification, compute the continous marginal effect using the overall probability of having insurance, and compare with the results of part 2 (a).

(b) Fit the same model using OLS. The coefficient of `hisp`

is often justified as a quick estimate of the average marginal effect of ethnicity adjusted for the other variables. How does it compare with the logit and probit estimates?

Posted October 14, 2016