Solutions to Problem Set 1
Birth Weight in Philadelphia
The datasets section of the course website has data on birth weight for a 10% sample of all births in Philadelphia in 1990. In this problem set we will explore the relationship between birth weight and years of schooling adjusting for gestational age. For simplicity we focus on black mothers, who account for almost 60% of all births in that year. To access the data from Stata type
use http://data.princeton.edu/wws509/datasets/phbirths
. use http://data.princeton.edu/wws509/datasets/phbirths
(Births in Philadelphia in 1990)
. desc
Contains data from http://data.princeton.edu/wws509/datasets/phbirths.dta
obs: 1,115 Births in Philadelphia in 1990
vars: 5 16 Sep 2009 22:21
size: 22,300 (_dta has notes)
----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------
black float %9.0g yesno Mother is black
educ float %9.0g Mother's years of education
smoke float %9.0g yesno Whether mother smoked during pregnancy
gestate float %9.0g Gestational age in weeks
grams float %9.0g Birth weight in grams
----------------------------------------------------------------------------------------------------
Sorted by:
. keep if black
(453 observations deleted)
[1] Simple Regressions
(a) Draw a scatterplot matrix for birth weight in grams, gestational age, and mother's education, and comment briefly on the relationship between each pair of variables.
. graph matrix grams gestate educ . graph export ps1fig1.png, width(500) replace (file ps1fig1.png written in PNG format)

Birth weight increases with gestational age as one would expect. The association between birth weight and years of schooling is weaker and harder to ascertain from this graph, as is the association between gestational age and education.
(b) Run a simple linear regression of birth weight, measured in grams, on education, represented by years of schooling. Interpret the slope and test its significance using a t-test. Verify that the F-test reported by Stata for the model is simply the square of the t-statistic. Is the constant meaningful in this model?
. reg grams educ
Source | SS df MS Number of obs = 662
-------------+------------------------------ F( 1, 660) = 9.24
Model | 3723325.9 1 3723325.9 Prob > F = 0.0025
Residual | 265897137 660 402874.451 R-squared = 0.0138
-------------+------------------------------ Adj R-squared = 0.0123
Total | 269620463 661 407897.826 Root MSE = 634.72
------------------------------------------------------------------------------
grams | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | 40.9268 13.46254 3.04 0.002 14.49222 67.36137
_cons | 2593.515 163.6041 15.85 0.000 2272.268 2914.763
------------------------------------------------------------------------------
. di (_b[educ]/_se[educ])^2
9.2419013
The coefficient of 40.9 indicates that births to more educated mothers
tend to weigh more, on average about 41 grams per year of schooling.
The t-test of 3.04 is highly significant, with a p-value of 0.002.
The square is 9.24, which is exactly the reported F-test. The constant
represents the expected birth weight for black mothers with no education,
2.6 kg, and technically it is meaningful, although the graph shows that
we would be extrapolating outside the bulk of the data. (The central 90%
has between 9 and 16 years of schooling, and there is only one
mother with no schooling; in fact only one with less than 6 years, as
you can learn from Stata's sum educ, d.)
(c) What proportion of the variation in birth weight is explained by education? How is this proportion related to Pearson's correlation coefficient?
. di e(mss)/(e(mss)+e(rss))
.01380951
. cor grams educ
(obs=662)
| grams educ
-------------+------------------
grams | 1.0000
educ | 0.1175 1.0000
. di r(rho)^2
.01380951
Stata's R-sq, which we calculate using the stored model and residual sum of squares, shows that only 1.38% of the variation in birth weight can be attributed to education. The simple linear correlation of 0.1175 is the square root of the proportion of variance explained.
(d) Run a regression of birth weight on gestional age, which is
measured in weeks, and interpret the slope. Stata stores the RSS in
e(rss); save it in a scalar for later use.
. reg grams gestate
Source | SS df MS Number of obs = 662
-------------+------------------------------ F( 1, 660) = 839.54
Model | 150950894 1 150950894 Prob > F = 0.0000
Residual | 118669569 660 179802.378 R-squared = 0.5599
-------------+------------------------------ Adj R-squared = 0.5592
Total | 269620463 661 407897.826 Root MSE = 424.03
------------------------------------------------------------------------------
grams | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gestate | 160.2748 5.531529 28.97 0.000 149.4133 171.1363
_cons | -3078.608 213.3673 -14.43 0.000 -3497.569 -2659.647
------------------------------------------------------------------------------
. scalar rss_g = e(rss)
The coefficient shows that birth weight increases with gestational age, so a difference of one week in gestational age is associated with a gain of 160 grams.
(e) Add a quadratic term on gestational age to test the linearity of this relationship. In general it is a good idea to center variables on the mean or a nearby value before squaring. This reduces collinearity and simplifies interpretation. You should conclude that we don't really need a quadratic term, but interpret the coefficient anyway.
. gen gestcsq = (gestate-38)^2
. reg grams gestate gestcsq
Source | SS df MS Number of obs = 662
-------------+------------------------------ F( 2, 659) = 422.41
Model | 151468871 2 75734435.3 Prob > F = 0.0000
Residual | 118151593 659 179289.215 R-squared = 0.5618
-------------+------------------------------ Adj R-squared = 0.5605
Total | 269620463 661 407897.826 Root MSE = 423.43
------------------------------------------------------------------------------
grams | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gestate | 149.6621 8.336406 17.95 0.000 133.2929 166.0312
gestcsq | -1.38969 .8175982 -1.70 0.090 -2.995101 .2157217
_cons | -2657.839 326.6151 -8.14 0.000 -3299.171 -2016.508
------------------------------------------------------------------------------
The coefficient of -1.4 fails to meet the conventional 5% significance level, with a p-value of 0.09. The fact that it is negative suggest that the difference in birth weight by gestational age is smaller at later ages. At around 38 weeks a difference of one week is associated with a gain of 150 grams, but at 40 weeks the estimated difference is 144 grams.
[2] Multiple Regressions
(a) Run a regression of birth weight on gestational age and years of schooling and note that both slopes are highly significant. Interpret the estimate of the coefficient of education in this model. Compare it briefly with the estimate from the simple linear regression of 1.d and explain what you make of the fact that it is smaller.
. reg grams gestate educ
Source | SS df MS Number of obs = 662
-------------+------------------------------ F( 2, 659) = 431.79
Model | 152923162 2 76461581 Prob > F = 0.0000
Residual | 116697301 659 177082.399 R-squared = 0.5672
-------------+------------------------------ Adj R-squared = 0.5659
Total | 269620463 661 407897.826 Root MSE = 420.81
------------------------------------------------------------------------------
grams | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gestate | 159.489 5.494577 29.03 0.000 148.7 170.278
educ | 29.81428 8.933653 3.34 0.001 12.27243 47.35613
_cons | -3406.564 233.4393 -14.59 0.000 -3864.939 -2948.19
------------------------------------------------------------------------------
Both slopes are indeed significant, with p-values of 0.001 for education and less than 0.001 for gestational age. If we compare births with the same gestational age we find that birth weight is higher for mothers with more years of education, on average 30 grams per year of schooling. The estimate from the simple linear regression was 41 grams, indicating that part of the differences we see by schooling can be attributed to differences in gestational age. In other words, mothers with less education tend to have shorter gestational ages (i.e. are more likely to deliver premature births).
(b) Construct an F-test for the net effect of education after adjusting for gestational age using the sums of squares you have already calculated. Verify that the test statistic coincides with the square of the t-test printed in the output. Save the RSS of the additive model for later use.
We need the RSS for the models with gestational age only, which we saved
earlier, and that for gestational age and education, which is the
current e(rss):
. scalar rss_ge = e(rss) . scalar F = ( (rss_g - rss_ge)/1 ) / ( rss_ge/659 ) . di F, sqrt(F) 11.137572 3.3373001
The F-statistic is 11.14 on 1 and 659 d.f., and the square root is
3.34, the reported t-statistic. (You can verify the F-test using Stata's
test educ after fitting the multiple regression model.
(c) Predict birth weight using education and gestational age, and calculate the simple linear correlation between observed and predicted values. How is this measure related to the R-squared reported by Stata for the multiple regression?
. predict pgrams
(option xb assumed; fitted values)
. cor grams pgrams
(obs=662)
| grams pgrams
-------------+------------------
grams | 1.0000
pgrams | 0.7531 1.0000
. di r(rho)^2
.56717936
The simple linear correlation between observed and predicted values is 0.753. This is the same as the multiple correlation coefficient, and the square is, of course, R-squared. Our two predictors account for 57% of the variation in birth weight.
(d) Compute the proportion of the variation in birth weight
left unexplained by gestational age that can be attributed to education, and
verify that it is the square of the partial correlation coefficient (which
you may calculate using Stata's pcorr).
. scalar propex = (rss_g - rss_ge)/rss_g
. di sqrt(propex)
.12891792
. pcorr grams gestate educ
(obs=662)
Partial and semipartial correlations of grams with
Partial Semipartial Partial Semipartial Significance
Variable | Corr. Corr. Corr.^2 Corr.^2 Value
------------+-----------------------------------------------------------------
gestate | 0.7491 0.7439 0.5611 0.5534 0.0000
educ | 0.1289 0.0855 0.0166 0.0073 0.0009
(e) Add a new variable equal to the product of gestational age and education. I recommend you center the variables around convenient values such as 38 weeks and 12 years before computing the product. You should find no evidence against the assumption of additivity. Setting aside issues of significance, how would you interpret the fact that the estimate is positive? How do you interpret the coefficient of education in this model? How would you interpret it if we hadn't centered the variables?
. gen gestXeduc = (gestate-38) * (educ-12)
. reg grams gestate educ gestXeduc
Source | SS df MS Number of obs = 662
-------------+------------------------------ F( 3, 658) = 289.85
Model | 153480518 3 51160172.8 Prob > F = 0.0000
Residual | 116139945 658 176504.475 R-squared = 0.5692
-------------+------------------------------ Adj R-squared = 0.5673
Total | 269620463 661 407897.826 Root MSE = 420.12
------------------------------------------------------------------------------
grams | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gestate | 159.541 5.485682 29.08 0.000 148.7694 170.3125
educ | 30.68609 8.932546 3.44 0.001 13.14636 48.22583
gestXeduc | 4.011002 2.25717 1.78 0.076 -.4211222 8.443126
_cons | -3419.999 233.1806 -14.67 0.000 -3877.867 -2962.131
------------------------------------------------------------------------------
The interaction term is indeed not significant, with a p-value of 0.075. The fact that it is positive indicates that the estimated differences by education are larger at higher gestational ages. The coefficient of education shows that at 38 weeks a year of schooling is associated with a gain of 31 grams. At 40 weeks the estimated difference is 39 grams. If we had not centered the variables the coefficient of education would represent an estimated difference at zero gestational age, which doesn't make much sense.
[3] Added-Variable Plots
(a) Regress birth weight on gestational age and save the raw residuals in a variable called gramsNetGest. Regress education on gestational age and save the raw residuals in a variable called educNetGest. (Stata is happiest with variable names of up to 12 letters. Feel free to use longer or shorter names than I have.)
. quietly reg grams gestate . predict gramsNetGest, r . label var gramsNetGest "Birth weight net of gestational age" . quietly reg educ gestate . predict educNetGest, r . label var educNetGest "Years of schooling net of gestational age"
(b) Plot birth weight net of gestional age against education net of gestational age. Do we have any indication that this relationship may not be linear? Use a lowess smoother to aid the eye.
Stata's lowess command does a locally weighted regression;
for each value of x it generates a fitted value using a line computed
from neighboring observations with weights that decline as you move away from x.
The size of the neighborhood is called the bandwidth, and is chosen as a tradeoff
between smoothness and goodness of fit.
. lowess gramsNetGest educNetGest . graph export ps1fig2.png, width(400) replace (file ps1fig2.png written in PNG format)

The relationship looks reasonably linear, there is a hint that it may be
less pronounced at lower years of schooling, but this is largely due to the
woman with zero years of schooling. In fact we should probably exclude her;
try the same plot using if educ > 0.
(c) Compute the correlation between the constructed variables in 3.a, namely birth weight and education both net of gestational age, and verify that it is the same as the partial correlation of 2.e
. cor gramsNetGest educNetGest
(obs=662)
| gramsN~t educNe~t
-------------+------------------
gramsNetGest | 1.0000
educNetGest | 0.1289 1.0000
The corelation between the adjusted variables is 0.1289, which is exactly the partial correlation of 2.d
(d) Regress birth weight net of gestational age on education net of gestational age. The estimated constant should be be essentially zero. Compare the estimated slope with the regression coefficient of education in 2.a.
. reg gramsNetGest educNetGest
Source | SS df MS Number of obs = 662
-------------+------------------------------ F( 1, 660) = 11.15
Model | 1972267.99 1 1972267.99 Prob > F = 0.0009
Residual | 116697301 660 176814.093 R-squared = 0.0166
-------------+------------------------------ Adj R-squared = 0.0151
Total | 118669569 661 179530.362 Root MSE = 420.49
------------------------------------------------------------------------------
gramsNetGest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educNetGest | 29.81428 8.926882 3.34 0.001 12.28577 47.34279
_cons | -8.24e-08 16.34291 -0.00 1.000 -32.09037 32.09037
------------------------------------------------------------------------------
The constant is indeed zero and the estimated slope is the same as the coefficient of education in the multiple regression equation of 2.a, reinforcing their interpretation as adjusted coefficients. (The bit that may not be obvious about this is that you need to adjust both the outcome and the other predictor.)
(e) Construct an added-variable plot of birth weight on gestational age, both net of education, to check the linearity of that relationship in the multiple regression model.
We follow the sample steps as before
. quietly reg grams educ . predict gramsNetEduc, r . label var gramsNetGest "Birth weight net of education" . quietly reg gestate educ . predict gestNetEduc, r . label var educNetGest "Gestational age net of education" . lowess gramsNetEduc gestNetEduc . graph export ps1fig3.png, width(400) replace (file ps1fig3.png written in PNG format)

The relationship looks reasonably linear. There's also an indication that birth weight varies less at very low gestational ages.
Note: Stata's avplot command can do these calculations for you.
We do it 'by hand' here because it is instructive and it lets us use a lowess smoother,
not to mention the opportunity to verify the connection with partial correlations.
The idea extends easily to models with more than two predictors, you just plot y net of
all X's but one, against that predictor net of all other X's.
