Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

Solutions to Problem Set 1
Birth Weight in Philadelphia

The datasets section of the course website has data on birth weight for a 10% sample of all births in Philadelphia in 1990. In this problem set we will explore the relationship between birth weight and years of schooling adjusting for gestational age. For simplicity we focus on black mothers, who account for almost 60% of all births in that year. To access the data from Stata type

use http://data.princeton.edu/wws509/datasets/phbirths

.  use http://data.princeton.edu/wws509/datasets/phbirths
(Births in Philadelphia in 1990)

.  desc

Contains data from http://data.princeton.edu/wws509/datasets/phbirths.dta
  obs:         1,115                          Births in Philadelphia in 1990
 vars:             5                          16 Sep 2009 22:21
 size:        22,300                          (_dta has notes)
----------------------------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
----------------------------------------------------------------------------------------------------
black           float  %9.0g       yesno      Mother is black
educ            float  %9.0g                  Mother's years of education
smoke           float  %9.0g       yesno      Whether mother smoked during pregnancy
gestate         float  %9.0g                  Gestational age in weeks
grams           float  %9.0g                  Birth weight in grams
----------------------------------------------------------------------------------------------------
Sorted by:  

.  keep if black
(453 observations deleted)

[1] Simple Regressions

(a) Draw a scatterplot matrix for birth weight in grams, gestational age, and mother's education, and comment briefly on the relationship between each pair of variables.

. graph matrix grams gestate educ

. graph export ps1fig1.png, width(500) replace
(file ps1fig1.png written in PNG format)

Birth weight increases with gestational age as one would expect. The association between birth weight and years of schooling is weaker and harder to ascertain from this graph, as is the association between gestational age and education.

(b) Run a simple linear regression of birth weight, measured in grams, on education, represented by years of schooling. Interpret the slope and test its significance using a t-test. Verify that the F-test reported by Stata for the model is simply the square of the t-statistic. Is the constant meaningful in this model?

. reg grams educ

      Source |       SS       df       MS              Number of obs =     662
-------------+------------------------------           F(  1,   660) =    9.24
       Model |   3723325.9     1   3723325.9           Prob > F      =  0.0025
    Residual |   265897137   660  402874.451           R-squared     =  0.0138
-------------+------------------------------           Adj R-squared =  0.0123
       Total |   269620463   661  407897.826           Root MSE      =  634.72

------------------------------------------------------------------------------
       grams |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |    40.9268   13.46254     3.04   0.002     14.49222    67.36137
       _cons |   2593.515   163.6041    15.85   0.000     2272.268    2914.763
------------------------------------------------------------------------------

. di (_b[educ]/_se[educ])^2
9.2419013

The coefficient of 40.9 indicates that births to more educated mothers tend to weigh more, on average about 41 grams per year of schooling. The t-test of 3.04 is highly significant, with a p-value of 0.002. The square is 9.24, which is exactly the reported F-test. The constant represents the expected birth weight for black mothers with no education, 2.6 kg, and technically it is meaningful, although the graph shows that we would be extrapolating outside the bulk of the data. (The central 90% has between 9 and 16 years of schooling, and there is only one mother with no schooling; in fact only one with less than 6 years, as you can learn from Stata's sum educ, d.)

(c) What proportion of the variation in birth weight is explained by education? How is this proportion related to Pearson's correlation coefficient?

. di e(mss)/(e(mss)+e(rss))
.01380951

. cor grams educ
(obs=662)

             |    grams     educ
-------------+------------------
       grams |   1.0000
        educ |   0.1175   1.0000

. di r(rho)^2
.01380951

Stata's R-sq, which we calculate using the stored model and residual sum of squares, shows that only 1.38% of the variation in birth weight can be attributed to education. The simple linear correlation of 0.1175 is the square root of the proportion of variance explained.

(d) Run a regression of birth weight on gestional age, which is measured in weeks, and interpret the slope. Stata stores the RSS in e(rss); save it in a scalar for later use.

. reg grams gestate

      Source |       SS       df       MS              Number of obs =     662
-------------+------------------------------           F(  1,   660) =  839.54
       Model |   150950894     1   150950894           Prob > F      =  0.0000
    Residual |   118669569   660  179802.378           R-squared     =  0.5599
-------------+------------------------------           Adj R-squared =  0.5592
       Total |   269620463   661  407897.826           Root MSE      =  424.03

------------------------------------------------------------------------------
       grams |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     gestate |   160.2748   5.531529    28.97   0.000     149.4133    171.1363
       _cons |  -3078.608   213.3673   -14.43   0.000    -3497.569   -2659.647
------------------------------------------------------------------------------

. scalar rss_g = e(rss)

The coefficient shows that birth weight increases with gestational age, so a difference of one week in gestational age is associated with a gain of 160 grams.

(e) Add a quadratic term on gestational age to test the linearity of this relationship. In general it is a good idea to center variables on the mean or a nearby value before squaring. This reduces collinearity and simplifies interpretation. You should conclude that we don't really need a quadratic term, but interpret the coefficient anyway.

. gen gestcsq = (gestate-38)^2

. reg grams gestate gestcsq

      Source |       SS       df       MS              Number of obs =     662
-------------+------------------------------           F(  2,   659) =  422.41
       Model |   151468871     2  75734435.3           Prob > F      =  0.0000
    Residual |   118151593   659  179289.215           R-squared     =  0.5618
-------------+------------------------------           Adj R-squared =  0.5605
       Total |   269620463   661  407897.826           Root MSE      =  423.43

------------------------------------------------------------------------------
       grams |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     gestate |   149.6621   8.336406    17.95   0.000     133.2929    166.0312
     gestcsq |   -1.38969   .8175982    -1.70   0.090    -2.995101    .2157217
       _cons |  -2657.839   326.6151    -8.14   0.000    -3299.171   -2016.508
------------------------------------------------------------------------------

The coefficient of -1.4 fails to meet the conventional 5% significance level, with a p-value of 0.09. The fact that it is negative suggest that the difference in birth weight by gestational age is smaller at later ages. At around 38 weeks a difference of one week is associated with a gain of 150 grams, but at 40 weeks the estimated difference is 144 grams.

[2] Multiple Regressions

(a) Run a regression of birth weight on gestational age and years of schooling and note that both slopes are highly significant. Interpret the estimate of the coefficient of education in this model. Compare it briefly with the estimate from the simple linear regression of 1.d and explain what you make of the fact that it is smaller.

. reg grams gestate educ

      Source |       SS       df       MS              Number of obs =     662
-------------+------------------------------           F(  2,   659) =  431.79
       Model |   152923162     2    76461581           Prob > F      =  0.0000
    Residual |   116697301   659  177082.399           R-squared     =  0.5672
-------------+------------------------------           Adj R-squared =  0.5659
       Total |   269620463   661  407897.826           Root MSE      =  420.81

------------------------------------------------------------------------------
       grams |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     gestate |    159.489   5.494577    29.03   0.000        148.7     170.278
        educ |   29.81428   8.933653     3.34   0.001     12.27243    47.35613
       _cons |  -3406.564   233.4393   -14.59   0.000    -3864.939    -2948.19
------------------------------------------------------------------------------

Both slopes are indeed significant, with p-values of 0.001 for education and less than 0.001 for gestational age. If we compare births with the same gestational age we find that birth weight is higher for mothers with more years of education, on average 30 grams per year of schooling. The estimate from the simple linear regression was 41 grams, indicating that part of the differences we see by schooling can be attributed to differences in gestational age. In other words, mothers with less education tend to have shorter gestational ages (i.e. are more likely to deliver premature births).

(b) Construct an F-test for the net effect of education after adjusting for gestational age using the sums of squares you have already calculated. Verify that the test statistic coincides with the square of the t-test printed in the output. Save the RSS of the additive model for later use.

We need the RSS for the models with gestational age only, which we saved earlier, and that for gestational age and education, which is the current e(rss):

. scalar rss_ge = e(rss)

. scalar F = ( (rss_g - rss_ge)/1 ) / ( rss_ge/659 )

. di F, sqrt(F)
11.137572 3.3373001

The F-statistic is 11.14 on 1 and 659 d.f., and the square root is 3.34, the reported t-statistic. (You can verify the F-test using Stata's test educ after fitting the multiple regression model.

(c) Predict birth weight using education and gestational age, and calculate the simple linear correlation between observed and predicted values. How is this measure related to the R-squared reported by Stata for the multiple regression?

. predict pgrams
(option xb assumed; fitted values)

. cor grams pgrams
(obs=662)

             |    grams   pgrams
-------------+------------------
       grams |   1.0000
      pgrams |   0.7531   1.0000

. di r(rho)^2
.56717936

The simple linear correlation between observed and predicted values is 0.753. This is the same as the multiple correlation coefficient, and the square is, of course, R-squared. Our two predictors account for 57% of the variation in birth weight.

(d) Compute the proportion of the variation in birth weight left unexplained by gestational age that can be attributed to education, and verify that it is the square of the partial correlation coefficient (which you may calculate using Stata's pcorr).

. scalar propex = (rss_g - rss_ge)/rss_g

. di sqrt(propex)
.12891792

. pcorr grams gestate educ
(obs=662)

Partial and semipartial correlations of grams with

               Partial   Semipartial      Partial   Semipartial   Significance
   Variable |    Corr.         Corr.      Corr.^2       Corr.^2          Value
------------+-----------------------------------------------------------------
    gestate |   0.7491        0.7439       0.5611        0.5534         0.0000
       educ |   0.1289        0.0855       0.0166        0.0073         0.0009

(e) Add a new variable equal to the product of gestational age and education. I recommend you center the variables around convenient values such as 38 weeks and 12 years before computing the product. You should find no evidence against the assumption of additivity. Setting aside issues of significance, how would you interpret the fact that the estimate is positive? How do you interpret the coefficient of education in this model? How would you interpret it if we hadn't centered the variables?

. gen gestXeduc = (gestate-38) * (educ-12)

. reg grams gestate educ gestXeduc

      Source |       SS       df       MS              Number of obs =     662
-------------+------------------------------           F(  3,   658) =  289.85
       Model |   153480518     3  51160172.8           Prob > F      =  0.0000
    Residual |   116139945   658  176504.475           R-squared     =  0.5692
-------------+------------------------------           Adj R-squared =  0.5673
       Total |   269620463   661  407897.826           Root MSE      =  420.12

------------------------------------------------------------------------------
       grams |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     gestate |    159.541   5.485682    29.08   0.000     148.7694    170.3125
        educ |   30.68609   8.932546     3.44   0.001     13.14636    48.22583
   gestXeduc |   4.011002    2.25717     1.78   0.076    -.4211222    8.443126
       _cons |  -3419.999   233.1806   -14.67   0.000    -3877.867   -2962.131
------------------------------------------------------------------------------

The interaction term is indeed not significant, with a p-value of 0.075. The fact that it is positive indicates that the estimated differences by education are larger at higher gestational ages. The coefficient of education shows that at 38 weeks a year of schooling is associated with a gain of 31 grams. At 40 weeks the estimated difference is 39 grams. If we had not centered the variables the coefficient of education would represent an estimated difference at zero gestational age, which doesn't make much sense.

[3] Added-Variable Plots

(a) Regress birth weight on gestational age and save the raw residuals in a variable called gramsNetGest. Regress education on gestational age and save the raw residuals in a variable called educNetGest. (Stata is happiest with variable names of up to 12 letters. Feel free to use longer or shorter names than I have.)

. quietly reg grams gestate

. predict gramsNetGest, r

. label var gramsNetGest "Birth weight net of gestational age"

. quietly reg educ gestate

. predict educNetGest, r

. label var educNetGest "Years of schooling net of gestational age"

(b) Plot birth weight net of gestional age against education net of gestational age. Do we have any indication that this relationship may not be linear? Use a lowess smoother to aid the eye.

Stata's lowess command does a locally weighted regression; for each value of x it generates a fitted value using a line computed from neighboring observations with weights that decline as you move away from x. The size of the neighborhood is called the bandwidth, and is chosen as a tradeoff between smoothness and goodness of fit.

. lowess gramsNetGest educNetGest

. graph export ps1fig2.png, width(400) replace
(file ps1fig2.png written in PNG format)

The relationship looks reasonably linear, there is a hint that it may be less pronounced at lower years of schooling, but this is largely due to the woman with zero years of schooling. In fact we should probably exclude her; try the same plot using if educ > 0.

(c) Compute the correlation between the constructed variables in 3.a, namely birth weight and education both net of gestational age, and verify that it is the same as the partial correlation of 2.e

. cor gramsNetGest educNetGest
(obs=662)

             | gramsN~t educNe~t
-------------+------------------
gramsNetGest |   1.0000
 educNetGest |   0.1289   1.0000

The corelation between the adjusted variables is 0.1289, which is exactly the partial correlation of 2.d

(d) Regress birth weight net of gestational age on education net of gestational age. The estimated constant should be be essentially zero. Compare the estimated slope with the regression coefficient of education in 2.a.

. reg gramsNetGest educNetGest

      Source |       SS       df       MS              Number of obs =     662
-------------+------------------------------           F(  1,   660) =   11.15
       Model |  1972267.99     1  1972267.99           Prob > F      =  0.0009
    Residual |   116697301   660  176814.093           R-squared     =  0.0166
-------------+------------------------------           Adj R-squared =  0.0151
       Total |   118669569   661  179530.362           Root MSE      =  420.49

------------------------------------------------------------------------------
gramsNetGest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 educNetGest |   29.81428   8.926882     3.34   0.001     12.28577    47.34279
       _cons |  -8.24e-08   16.34291    -0.00   1.000    -32.09037    32.09037
------------------------------------------------------------------------------

The constant is indeed zero and the estimated slope is the same as the coefficient of education in the multiple regression equation of 2.a, reinforcing their interpretation as adjusted coefficients. (The bit that may not be obvious about this is that you need to adjust both the outcome and the other predictor.)

(e) Construct an added-variable plot of birth weight on gestational age, both net of education, to check the linearity of that relationship in the multiple regression model.

We follow the sample steps as before

. quietly reg grams educ

. predict gramsNetEduc, r

. label var gramsNetGest "Birth weight net of education"

. quietly reg gestate educ

. predict gestNetEduc, r

. label var educNetGest "Gestational age net of education"

. lowess gramsNetEduc gestNetEduc

. graph export ps1fig3.png, width(400) replace
(file ps1fig3.png written in PNG format)

The relationship looks reasonably linear. There's also an indication that birth weight varies less at very low gestational ages.

Note: Stata's avplot command can do these calculations for you. We do it 'by hand' here because it is instructive and it lets us use a lowess smoother, not to mention the opportunity to verify the connection with partial correlations. The idea extends easily to models with more than two predictors, you just plot y net of all X's but one, against that predictor net of all other X's.