Germán Rodríguez
Generalized Linear Models Princeton University

## Solutions to Problem Set 1: Mental Health, Stress and SES

Agresti and Finlay(1997) report data from a Florida study investigating the relationship between mental health and several explanatory variables using a random sample of 40 subjects. The outcome of interest is an index of mental impairment that incorporates measures of anxiety and depression. We will consider two predictors: a life-events score that combines the number and severity of various stressful life events, and an index of socio-economic status (SES).

The data are available on this website as `http://data.princeton.edu/wws509/datasets/afMentalHealth.dta`, a Stata file that can be read natively into Stata or from R using the `foreign` library. As usual we show solutions in Stata and R.

###  Simple Regressions

(a) Draw a scatterplot matrix with the three variables of interest, and comment briefly on the relationship between each pair of variables.

```. use http://data.princeton.edu/wws509/datasets/afMentalHealth.dta
(Agresti Finlay data on Mental Health)

. graph matrix mentalImp lifeEv ses

. graph export ps1fig1.png, width(600) replace
(file ps1fig1.png written in PNG format)
```
```> library(foreign)

> af <- read.dta("http://data.princeton.edu/wws509/datasets/afMentalHealth.dta")

> png("ps1fig1r.png", width=600, height=600, units = "px")

> pairs(af)

> dev.off()
windows
2
```  Mental impairment shows a positive association with stressful life events and a negative association with SES, with no obvious evidence of non-linearities. Life events and SES show a rather weak positive relationship.

(b) Run a simple linear regression of the mental impairment index on the life events index, interpret the slope, and test its significance using a t-test.

```. reg mentalImp lifeEv

Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(1, 38)        =      6.11
Model |  161.048429         1  161.048429   Prob > F        =    0.0180
Residual |  1001.35157        38  26.3513571   R-squared       =    0.1385
-------------+----------------------------------   Adj R-squared   =    0.1159
Total |      1162.4        39  29.8051282   Root MSE        =    5.1334

------------------------------------------------------------------------------
mentalImpair |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lifeEvents |   .0898257   .0363349     2.47   0.018     .0162696    .1633818
_cons |   23.30949   1.806751    12.90   0.000     19.65192    26.96707
------------------------------------------------------------------------------

. di e(F), (_b[lifeEv]/_se[lifeEv])^2
6.1115801 6.1115801

. sum lifeEv

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
lifeEvents |         40      44.425    22.62276          3         97
```
```> m1 <- lm(mentalImpair ~ lifeEvents, data = af)

> summary(m1)

Call:
lm(formula = mentalImpair ~ lifeEvents, data = af)

Residuals:
Min       1Q   Median       3Q      Max
-10.4415  -3.6899  -0.5973   3.6215  13.2890

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.30949    1.80675  12.901 1.85e-15 ***
lifeEvents   0.08983    0.03633   2.472    0.018 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.133 on 38 degrees of freedom
Multiple R-squared:  0.1385,	Adjusted R-squared:  0.1159
F-statistic: 6.112 on 1 and 38 DF,  p-value: 0.01802

> 2.472^2
 6.110784
```

A one point difference in the life events index is associated with an average difference of 0.09 points in the mental impairment index. The t-test printed in the output is 2.47 on 38 d.f. and is significant at the usual five percent level. Squaring this value gives 6.11, the same as the F test. The constant represents the average mental impairment score among people who score zero in the life events index, which combines number and severity of events. Assuming that people who score zero have experienced no stressful life events the constant would be meaningful, even if just outside the range of the data, which goes from 3 to 97.

(c) What proportion of the variation across subjects in the index of mental health is explained by the life events index? How is this proportion related to Pearson's correlation coefficient?

```. di e(mss)/(e(mss) + e(rss))
.1385482

. quietly cor mentalImp lifeEv

. di r(rho)^2
.1385482
```
```> m0 <- lm(mentalImpair ~ 1, data = af)

> rss <- function(model) sum(residuals(model)^2)

> c(p, sqrt(p))
 0.8614518 0.9281443
```

The proportion of variance explained, calculated above from the sums of squares, is 0.1385 or 14%, and is of course the same as R-squared, which for simple regression is the squared of Pearson's r.

(d) Check the linearity of this relationship by adding a quadratic term on the index of life events. In general it is a good idea to center variables on their mean before squaring; this reduces collinearity and simplifies interpretation. Either way you should find that we don't really need a quadratic term.

```. quietly sum lifeEv

. gen lifeCsq = (lifeEv - r(mean))^2

. reg mentalImp lifeEv lifeCsq

Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(2, 37)        =      3.04
Model |  164.054361         2  82.0271805   Prob > F        =    0.0599
Residual |  998.345639        37  26.9823146   R-squared       =    0.1411
-------------+----------------------------------   Adj R-squared   =    0.0947
Total |      1162.4        39  29.8051282   Root MSE        =    5.1945

------------------------------------------------------------------------------
mentalImpair |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lifeEvents |   .0943156   .0391509     2.41   0.021     .0149884    .1736428
lifeCsq |  -.0004038   .0012098    -0.33   0.740     -.002855    .0020474
_cons |   23.31152   1.828264    12.75   0.000      19.6071    27.01593
------------------------------------------------------------------------------
```
```> library(dplyr)

> af <- mutate(af, lifeCsq = (lifeEvents - mean(lifeEvents))^2)

> m2 <- lm(mentalImpair ~ lifeEvents + lifeCsq, data = af)

> summary(m2)

Call:
lm(formula = mentalImpair ~ lifeEvents + lifeCsq, data = af)

Residuals:
Min       1Q   Median       3Q      Max
-10.6490  -3.2865  -0.3971   3.7625  13.0755

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.3115159  1.8282640  12.751 4.14e-15 ***
lifeEvents   0.0943156  0.0391509   2.409   0.0211 *
lifeCsq     -0.0004038  0.0012098  -0.334   0.7404
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.194 on 37 degrees of freedom
Multiple R-squared:  0.1411,	Adjusted R-squared:  0.09471
F-statistic:  3.04 on 2 and 37 DF,  p-value: 0.05993
```

The t-statistic of -0.33 on 37 d.f. for the quadratic term is clearly not significant, so we have no evidence against the linearity of the association. By centering the square term the coefficient of the linear term represents the slope at the mean.

(e) Regress the index of mental impairment on SES, to verify the hypothesis that whether or not money buys happiness, it is certainly associated with better mental health. Calculate Pearson's correlation as a summary of the association. Interpretation of the regression coefficient is hampered by the arbitrary nature of both scales. Rerun the regression standardizing both indices and interpret the resulting slope. Compare it with Pearson's r.

```. reg mentalImp ses

Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(1, 38)        =      7.18
Model |  184.654398         1  184.654398   Prob > F        =    0.0109
Residual |  977.745602        38  25.7301474   R-squared       =    0.1589
-------------+----------------------------------   Adj R-squared   =    0.1367
Total |      1162.4        39  29.8051282   Root MSE        =    5.0725

------------------------------------------------------------------------------
mentalImpair |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ses |  -.0860779   .0321317    -2.68   0.011    -.1511251   -.0210308
_cons |   32.17201   1.987649    16.19   0.000     28.14823     36.1958
------------------------------------------------------------------------------

. scalar rssSes = e(rss) // for use in 2.b

. cor mentalImp ses
(obs=40)

| mental~r      ses
-------------+------------------
mentalImpair |   1.0000
ses |  -0.3986   1.0000

. egen mentalStd = std(mentalImp)

. egen sesStd = std(ses)

. reg mentalStd sesStd

Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(1, 38)        =      7.18
Model |  6.19539031         1  6.19539031   Prob > F        =    0.0109
Residual |  32.8046105        38  .863279224   R-squared       =    0.1589
-------------+----------------------------------   Adj R-squared   =    0.1367
Total |  39.0000008        39  1.00000002   Root MSE        =    .92913

------------------------------------------------------------------------------
mentalStd |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
sesStd |  -.3985676   .1487796    -2.68   0.011    -.6997562   -.0973791
_cons |   5.07e-09   .1469081     0.00   1.000    -.2973998    .2973998
------------------------------------------------------------------------------
```
```> m3 <- lm(mentalImpair ~ ses, data = af)

> cor(af\$mentalImpair, af\$ses)
 -0.3985676

> std <- function(x) (x - mean(x))/sqrt(var(x))

> m3s <- lm(std(mentalImpair) ~ std(ses), data = af)

> summary(m3s)

Call:
lm(formula = std(mentalImpair) ~ std(ses), data = af)

Residuals:
Min       1Q   Median       3Q      Max
-1.99304 -0.50321  0.05384  0.50155  2.79954

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.244e-16  1.469e-01   0.000   1.0000
std(ses)    -3.986e-01  1.488e-01  -2.679   0.0109 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9291 on 38 degrees of freedom
Multiple R-squared:  0.1589,	Adjusted R-squared:  0.1367
F-statistic: 7.177 on 1 and 38 DF,  p-value: 0.01085
```

The estimated regression coefficient shows that high SES tends to be associated with lower mental impairment, respondents score on average 0.086 impairment pointw lower for every point in the SES scale. Pearson's correlation coefficient is –0.40, indicating a moderate negative association.

If we standardize the variables subtracting the mean and dividing by the standard deviation, the constant becomes zero and the slope coincides with Pearson's correlation coefficient. Subjects who are one standard deviation above the mean SES are, on average, 0.4 standard deviations below the mean in mental impairment. (BTW Stata will do these calculations if you use the option `beta`.)

####  Multiple Regression

(a) Run a regression of the index of mental impairment on both SES and the index of live events and note that both slopes are highly significant. Interpret briefly the estimate of the coefficient of life events, and compare it with the estimate from the simple linear regression of 1.b.

```. reg mentalImp ses lifeEv

Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(2, 37)        =      9.49
Model |  394.238399         2  197.119199   Prob > F        =    0.0005
Residual |  768.161601        37  20.7611244   R-squared       =    0.3392
-------------+----------------------------------   Adj R-squared   =    0.3034
Total |      1162.4        39  29.8051282   Root MSE        =    4.5564

------------------------------------------------------------------------------
mentalImpair |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ses |  -.0974755   .0290848    -3.35   0.002    -.1564069   -.0385441
lifeEvents |   .1032595   .0324995     3.18   0.003     .0374093    .1691096
_cons |   28.22981   2.174222    12.98   0.000     23.82442    32.63521
------------------------------------------------------------------------------
```
```> m4 <- lm(mentalImpair ~ lifeEvents + ses, data = af)

> summary(m4)

Call:
lm(formula = mentalImpair ~ lifeEvents + ses, data = af)

Residuals:
Min     1Q Median     3Q    Max
-8.678 -2.494 -0.336  2.886 10.891

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.22981    2.17422  12.984 2.38e-15 ***
lifeEvents   0.10326    0.03250   3.177  0.00300 **
ses         -0.09748    0.02908  -3.351  0.00186 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.556 on 37 degrees of freedom
Multiple R-squared:  0.3392,	Adjusted R-squared:  0.3034
F-statistic: 9.495 on 2 and 37 DF,  p-value: 0.0004697
```

We find highly significant net effects of both SES and life events, as evidenced by t-statistics of -3.35 and 3.18, respectively, both on 37 d.f. Comparing subjects with the same SES, those who score one point higher in the stressful life events index score on average one-tenth of a point higher in the mental impairment index. The estimate is very similar to the value in 1.b, 0.10 versus 0.09, with the very slight increase indicating that a small part of the initial difference could be attributed to the fact that people with higher life events scores tend to have lower SES.

(b) Construct an F-test for the net effect of life events after adjusting for SES using the sums of squares you have already calculated. Verify that it coincides with the square of the t-test in 2.a.

```. scalar rssAdditive = e(rss)

10.095022

. di (_b[lifeEv]/_se[lifeEv])^2
10.095022
```
```> F <- (rss(m3) - rss(m4))/(rss(m4)/m4\$df) # why we don't use F for FALSE :)

> F; sqrt(F)
 10.09502
 3.177266
```

(c) What proportion of the variation in mental health has been 'explained' by the two variables together? What's the square root of this value?

```. di e(r2)
.33915898

. di sqrt(e(r2))
.58237357
```
```> R2 <- 1 -  rss(m4)/rss(m0)

> R2; sqrt(R2)
 0.339159
 0.5823736
```

The proportion of variance explained is of course R-squared. The square root it the multiple correlation coefficient, in this case 0.5824.

(d) Compute fitted values for this model and calculate Pearson's correlation between observed and fitted values. Does this number look familiar?

```. predict fv
(option xb assumed; fitted values)

. corr mentalImp fv
(obs=40)

| mental~r       fv
-------------+------------------
mentalImpair |   1.0000
fv |   0.5824   1.0000
```
```> fv <- fitted(m4)

> cor(af\$mentalImpair, fv)
 0.5823736
```

The correlation between observed and fitted values is 0.5824, the same as the multiple correlation coefficient.

(e) What proportion of the variation left unexplained by SES can be attributed to life events? How is this number related to the partial correlation between mental impairment and life events given SES?

```. di (rssSes - rssAdditive)/rssSes
.21435433

. pcorr mentalImp lifeEv ses
(obs=40)

Partial and semipartial correlations of mentalImpair with

Partial   Semipartial      Partial   Semipartial   Significance
Variable |    Corr.         Corr.      Corr.^2       Corr.^2          Value
------------+-----------------------------------------------------------------
lifeEvents |   0.4630        0.4246       0.2144        0.1803         0.0030
ses |  -0.4826       -0.4479       0.2329        0.2006         0.0019

. mata st_matrix("r(p_corr)")[1,1]^2
.2143543273
```
```> library(ppcor) # for partial correlation

> pr2; sqrt(pr2)
 0.2143543
 0.4629842

> pcor(af[, -4])\$estimate # skip lifeCsq
mentalImpair lifeEvents        ses
mentalImpair    1.0000000  0.4629842 -0.4825715
lifeEvents      0.4629842  1.0000000  0.3191731
ses            -0.4825715  0.3191731  1.0000000
```

The partial correlation between mental impairment and life events adjusting for SES is 0.463, and its square is 0.214, the proportion of variance in mental impairment left unexplained by SES that can be attributed to the life events index.

####  Added Variable Plots

(a) Regress the index of mental impairment on SES and save the raw residuals in a variable called mentalNetSes. Regress the index of life events on SES and save the raw residuals in a variable valled lifeNetSes.

```. quietly reg mentalImp ses

. predict mentalNetSes, r

. quietly reg lifeEv  ses

. predict lifeNetSes, r
```
```> ms <- lm(mentalImpair ~ ses, data = af)

> af\$mentalNetSes = residuals(ms)

> ls <- lm(lifeEvents ~ ses, data = af)

> af\$lifeNetSes = residuals(ls)
```

(b) Plot mental impairment net of SES against life events net of SES. Do we have any indication that this relationship may not be linear?

```. scatter mentalNetSes lifeNetSes || lfit mentalNetSes lifeNetSes ///
>      , title(Mental Impairment by Life Events both Net of SES) legend(off) ///
>         xtitle(Life events net of SES) ytitle(Mental Imp net of SES)

. graph export ps1fig2.png, width(500) replace
(file ps1fig2.png written in PNG format)
```
```> library(ggplot2)

> ggplot(af, aes(mentalNetSes, lifeNetSes)) + geom_point() + geom_smooth(method="lm")

> ggsave("ps1fig2r.png", width = 500/72, height = 400/72, dpi = 72)
```  We see no evidence of non-linearities. (See note on visual aids at the end.)

(c) Compute the correlation between the constructed variables mental impairment net of SES and life events net of SES, and verify that it is the same as the partial correlation of 2.e.

```. cor mentalNetSes lifeNetSes
(obs=40)

| mental~s lifeNe~s
-------------+------------------
mentalNetSes |   1.0000
lifeNetSes |   0.4630   1.0000
```
```> cor(af\$mentalNetSes , af\$lifeNetSes)
 0.4629842
```

This is indeed the same 0.463 we obtained in 2.d; the life events index accounts for 21.4% of the variation in mental impairment after adjusting for linear effects of SES.

(d) Regress mental impairment net of SES on life events net of SES. The estimated constant should be be essentially zero. Compare the estimated slope with the regression coefficient of life events in 2.a.

```. reg mentalNetSes lifeNetSes

Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(1, 38)        =     10.37
Model |  209.583996         1  209.583996   Prob > F        =    0.0026
Residual |  768.161621        38  20.2147795   R-squared       =    0.2144
-------------+----------------------------------   Adj R-squared   =    0.1937
Total |  977.745617        39  25.0704004   Root MSE        =    4.4961

------------------------------------------------------------------------------
mentalNetSes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lifeNetSes |   .1032595    .032069     3.22   0.003     .0383392    .1681797
_cons |   6.48e-09   .7108934     0.00   1.000    -1.439129    1.439129
------------------------------------------------------------------------------
```
```> lm(mentalNetSes ~ lifeNetSes, data = af)

Call:
lm(formula = mentalNetSes ~ lifeNetSes, data = af)

Coefficients:
(Intercept)   lifeNetSes
2.211e-16    1.033e-01
```

The constant is indeed zero and the slope is exactly the same as the coefficient of life events in the multiple regression equation. This result helps illustrate the usual interpretation of multiple regression coefficients as representing differences net of all other variables.

(e) Construct an added-variable plot of mental impairment net of life events versus SES net of life events, to check the linearity of that relationship after adjusting for the index of life events.

```. quietly regress mentalImp lifeEv

. predict mentalNetLife, r

. quietly regress ses lifeEv

. predict sesNetLife, r

. scatter mentalNetLife sesNetLife || lfit mentalNetLife sesNetLife ///
>         , title(Mental Impairment by SES both Net of Life Events) legend(off) ///
>         xtitle(SES net of life events) ytitle(Mental Imp net of life events)

. graph export ps1fig3.png, width(500) replace
(file ps1fig3.png written in PNG format)
```
```> af\$mentalNetLife = residuals(lm(mentalImpair ~ lifeEvents, data=af))

> af\$sesNetLife = residuals(lm(ses ~ lifeEvents, data=af))

> ggplot(af, aes(mentalNetLife, sesNetLife)) + geom_point() + geom_smooth(method="lm")

> ggsave("ps1fig3r.png", width = 500/72, height = 400/72, dpi = 72)
```  The relationship looks reasonably linear, with a slight hint that it might be a bit steeper at lower SES levels after adjusting for the frequency and severity of life events.

Note: Stata's `avplot` and R's `avPlot()` in the `car` package can do added variable plots, but here you have to do them “by hand”, using the software only to compute and plot estimates and fitted values. This is instructive and allows us to verify the connection with partial correlations. It also opens the door to using scatterplot smoothers to help judge the linearity of the relationship.

Stata's `lowess` command R's `lowess() function does a locally weighted regression; for each value of x it generates a fitted value using a line computed from neighboring observations with weights that decline as you move away from x. The size of the neighborhood is called the bandwidth, and is chosen to obtain a balance between smoothness and goodness of fit. Try the commands `lowess mentalNetSes lifeNetSes` and `lowess mentalNetLife sesNetLife` as supplements to the analyses in 3.b and 3.e. See also `loess()`. Or just use the default smoother in `ggplot`! Try this on 3.b and 3.e.