Problem Set 1: Birth Weight in Philadelphia
Due Friday September 30, 2011
The datasets section of the course website has data on birth weight for a 10% sample of all births in Philadelphia in 1990. In this problem set we will explore the relationship between birth weight and education adjusting for gestational age. For simplicity we focus on black mothers, who account for almost 60% of all births in that year. To access the data from Stata type
use http://data.princeton.edu/wws509/datasets/phbirths
Don't forget to drop non-black mothers.
[1] Simple Regressions
(a) Draw a scatterplot matrix for birth weight in grams, gestational age, and mother's education, and comment briefly on the relationship between each pair of variables.
(b) Run a simple linear regression of birth weight, measured in grams, on education, represented by years of schooling. Interpret the slope and test its significance using a t-test. Verify that the F-test reported by Stata for the model is simply the square of the t-statistic. Is the constant meaningful in this model?
(c) What proportion of the variation in birth weight is explained by education? How is this proportion related to Pearson's correlation coefficient?
(d) Run a regression of birth weight on gestational age, which is
measured in weeks, and interpret the slope. Stata stores the RSS in
e(rss); save it in a scalar for later use.
(e) Add a quadratic term on gestational age to test the linearity of this relationship. In general it is a good idea to center variables on the mean or a nearby value before squaring. This reduces collinearity and simplifies interpretation. You should conclude that we don't really need a quadratic term, but interpret the coefficient anyway.
[2] Multiple Regressions
(a) Run a regression of birth weight on gestational age and years of schooling and note that both slopes are highly significant. Interpret the estimate of the coefficient of education in this model. Compare it briefly with the estimate from the simple linear regression of 1.b and explain what you make of the fact that it is smaller.
(b) Construct an F-test for the net effect of education after adjusting for gestational age using the sums of squares you have already calculated. Verify that the test statistic coincides with the square of the t-test printed in the output. Save the RSS of the additive model for later use.
(c) Predict birth weight using education and gestational age, and calculate the simple linear correlation between observed and predicted values. How is this measure related to the R-squared reported by Stata for the multiple regression?
(d) Compute the proportion of the variation in birth weight
left unexplained by gestational age that can be attributed to education, and
verify that it is the square of the partial correlation coefficient, which
you may calculate using Stata's pcorr.
(e) Add a new variable equal to the product of gestational age and education. I recommend you center the variables around convenient values such as 38 weeks and 12 years before computing the product. You should find no evidence against the assumption of additivity. Setting aside issues of significance, how would you interpret the fact that the estimate is positive? How do you interpret the coefficient of education in this model? How would you interpret it if we hadn't centered the variables?
[3] Added-Variable Plots
(a) Regress birth weight on gestational age and save the raw residuals in a variable called gramsNetGest. Regress education on gestational age and save the raw residuals in a variable called educNetGest. (Stata is happiest with variable names of up to 12 letters. Feel free to use longer or shorter names than I have.)
(b) Plot birth weight net of gestional age against education net of gestational age. Do we have any indication that this relationship may not be linear? Use a lowess smoother to aid the eye.
Stata's lowess command does a locally weighted regression;
for each value of x it generates a fitted value using a line computed
from neighboring observations with weights that decline as you move away from x.
The size of the neighborhood is called the bandwidth, and is chosen as a tradeoff
between smoothness and goodness of fit.
(c) Compute the correlation between the constructed variables in 3.a, namely birth weight and education both net of gestational age, and verify that it is the same as the partial correlation of 2.d
(d) Regress birth weight net of gestational age on education net of gestational age. The estimated constant should be be essentially zero. Compare the estimated slope with the coefficient of education in 2.a.
(e) Construct an added-variable plot of birth weight on gestational age, both net of education, to check the linearity of that relationship in the multiple regression model.
Note: Stata's avplot command can do these calculations for you.
We do it 'by hand' here because it is instructive and it lets us use a lowess smoother,
not to mention the opportunity to verify the connection with partial correlations.
The idea extends easily to models with more than two predictors, you just plot y net of
all X's but one, against that predictor net of all other X's.
Posted Thursday September 22, 2011
