Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

Solutions for Problem Set 2

The first task is to read the data. I cut and pasted the data from the web page into a text file called pubexpend.raw and then read it into Stata using the command

 
. infile ex ecab met grow young old west str2 state using pubexpend.raw
(48 observations read)

[1] Expenditure and ability to pay.

(a) Regress per capita expenditure on the economic ability index and interpret the resulting slope. Plot the data and the regression line. (No need to turn in this plot until you do part e.)

 
. reg ex ecab
 
      Source |       SS       df       MS              Number of obs =      48
-------------+------------------------------           F(  1,    46) =   34.72
       Model |  69887.8068     1  69887.8068           Prob > F      =  0.0000
    Residual |  92583.1724    46  2012.67766           R-squared     =  0.4302
-------------+------------------------------           Adj R-squared =  0.4178
       Total |  162470.979    47  3456.82934           Root MSE      =  44.863
 
------------------------------------------------------------------------------
          ex |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ecab |   1.732872   .2940715     5.89   0.000     1.140937    2.324807
       _cons |   118.9832   29.18019     4.08   0.000     60.24656    177.7199
------------------------------------------------------------------------------
 
. quietly sum ecab, d
 
. di r(p25), r(p75)
85.299999 105.9
 
. di _b[ecab] * (r(p75)-r(p25))
35.697168

Public expenditure per capita is higher in states that score higher in the economic ability index, with an estimated difference of $1.73 per index point . It may be easier to appreciate the magnitude of the effect if we note that the estimated difference between the first and third quartiles of economic ability is $35.70 in per capita expenditure. The plot is shown below.

(b) Is Nevada an outlier? Compute the jack-knifed residuals and list any states with residuals exceeding two in absolute value. Surprised?

Nevada has an economic ability index much higher than all other states, with 205 points (the next highest is Wyoming at 126) but that doesn't necessarily make it an outlier, a concept that pertains to the outcome, not the predictor. Here are the jack-knifed residuals

 
. predict jack, rstu
 
. list state ex ecab jack if abs(jack) > 2 
 
     +--------------------------------+
     | state    ex    ecab       jack |
     |--------------------------------|
 30. |    ND   369    93.4   2.054942 |
 42. |    WY   454   125.8   2.893478 |
     +--------------------------------+
 
. list state ex ecab jack if state == "NV"
 
     +--------------------------------+
     | state    ex   ecab        jack |
     |--------------------------------|
 47. |    NV   421    205   -1.758696 |
     +--------------------------------+

We see two states with jack-knifed residuals in excess of two: Wyoming and North Dakota. The residual for Nevada is less than two in magnitude. You may be surprised because Nevada is far from the other states in economic ability but that doesn't make it an outlier. It does, however, give it a lot of leverage, as we are about to see.

(c) Does Nevada have leverage? Compute the leverages and list any values exceeding 2p/n where p is the number of predictors and n the number of observations.

 
. predict lev, lev
 
. scalar levrot = 2*2/_N // rule of thumb
 
. list state ex ecab lev if lev > levrot
 
     +-------------------------------+
     | state    ex   ecab        lev |
     |-------------------------------|
 26. |    MS   231   57.4    .087378 |
 47. |    NV   421    205   .5242805 |
     +-------------------------------+

We see that Nevada has a lot of leverage, with a value of 0.52. The next highest is Mississippi but its leverage is only 0.09, far behind. This makes Nevada potentially very influential. Did it have actual influence?

(d) Do you think Nevada influenced the results? Compute Cook's distance and list any values exceeding 4/(n-p). Fit the model without Nevada and add the regression line to the plot of part (a). Comment.

 
. predict cook, cook
 
. scalar cookrot = 4/(_N-2) // rule of thumb
 
. list state ex ecab cook if cook > cookrot
 
     +--------------------------------+
     | state    ex    ecab       cook |
     |--------------------------------|
 42. |    WY   454   125.8   .2184154 |
 47. |    NV   421     205   1.630198 |
     +--------------------------------+

We see that two states had influence in the fit, Nevada with a Cook's distance of 1.63 and Wyoming, far behind with 0.22. We fit the model without Nevada:

 
. reg ex ecab if state != "NV"
 
      Source |       SS       df       MS              Number of obs =      47
-------------+------------------------------           F(  1,    45) =   29.82
       Model |  57407.0137     1  57407.0137           Prob > F      =  0.0000
    Residual |  86628.8587    45  1925.08575           R-squared     =  0.3986
-------------+------------------------------           Adj R-squared =  0.3852
       Total |  144035.872    46  3131.21462           Root MSE      =  43.876
 
------------------------------------------------------------------------------
          ex |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ecab |   2.253207   .4126138     5.46   0.000      1.42216    3.084254
       _cons |   70.96941   39.49381     1.80   0.079    -8.575204     150.514
------------------------------------------------------------------------------

We see that the slope increases from $1.78 to $2.25 when we exclude Nevada. The difference is almost three times the standard error of the coefficient and substantively important. Expenditure rises faster with economic ability than previously estimated.

The last step is to construct the graph. In addition to the scatter and lines requested I will identify the three states that came up so far:

 
. twoway (scatter ex ecab)  ///
>         (lfit ex ecab) (lfit ex ecab if state!="NV", lpat(dash)) ///
>         (scatter ex ecab if inlist(state,"NV","WY","MS"), mlab(state) mlabp(9
> )) ///
>   , title(Public Expenditure and Ability to Pay) ///
>         xtitle(Economic Ability Index) ytitle(Public Expenditure Per Capita) 
> ///
>         legend(order(2 "All" 3 "w/o NV") ring(0) pos(5))
 
. graph export set2fig1.png, width(500) replace   
(file set2fig1.png written in PNG format)

Nevada has substantial leverage (because of its position on the index) and actual influence (pulling the line down), ensuring it wouldn't be much of an outlier. Wyoming is a bit of an outlier, with much higher expenditure than expected from its economic ability index, even after excluding Nevada. Mississippi has leverage because it has the lowest score in economic ability, and some influence. In fact it worked in tandem with Nevada. With Nevada gone, however, it doesn't have much influence at all, as you can verify by excluding it as well.

(e) Fit the model to all the data adding a dummy variable for Nevada. Does the t-test for the coefficient of this dummy variable look familiar? How about the constant and the coefficient of ability to pay?

 
. gen nv = state == "NV"
 
. reg ex ecab nv
 
      Source |       SS       df       MS              Number of obs =      48
-------------+------------------------------           F(  2,    45) =   19.70
       Model |  75842.1205     2  37921.0603           Prob > F      =  0.0000
    Residual |  86628.8587    45  1925.08575           R-squared     =  0.4668
-------------+------------------------------           Adj R-squared =  0.4431
       Total |  162470.979    47  3456.82934           Root MSE      =  43.876
 
------------------------------------------------------------------------------
          ex |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ecab |   2.253207   .4126138     5.46   0.000      1.42216    3.084254
          nv |  -111.8769   63.61354    -1.76   0.085    -240.0012    16.24736
       _cons |   70.96941   39.49381     1.80   0.079    -8.575204     150.514
------------------------------------------------------------------------------

The t-test for the Nevada dummy is exactly the same as the jack-knifed residual, verifying the equivalence between these two. The constant and slope are exactly the same as in the regression without Nevada. This result makes sense if you think that the dummy for Nevada, by giving that state its own parameter, effectively takes it out of the line fitting procedure. We now drop Nevada

 
. drop if state == "NV"
(1 observation deleted)

[2] Modeling Log Expenditure

(a) Take the log of per capita expenditure and regress it on the index of economic ability, the indicator of western states, and linear and quadratic terms on the percent of the population living in metropolitan areas. Center the percent metropolitan around the mean or a central value such as 50 before squaring it. Can you drop the coefficient of the linear term on percent metropolitan, given that it is not significant?

 
. gen logex = log(ex)
 
. gen metcsq = (met-50)^2
 
. reg logex ecab west met metcsq
 
      Source |       SS       df       MS              Number of obs =      47
-------------+------------------------------           F(  4,    42) =   21.51
       Model |  1.21318321     4  .303295803           Prob > F      =  0.0000
    Residual |  .592173934    42  .014099379           R-squared     =  0.6720
-------------+------------------------------           Adj R-squared =  0.6408
       Total |  1.80535715    46  .039246894           Root MSE      =  .11874
 
------------------------------------------------------------------------------
       logex |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ecab |   .0080679   .0013498     5.98   0.000     .0053438     .010792
        west |   .1459827   .0373659     3.91   0.000     .0705752    .2213902
         met |    .000195   .0009251     0.21   0.834     -.001672     .002062
      metcsq |   .0000846   .0000296     2.86   0.007     .0000249    .0001443
       _cons |   4.725246   .1106459    42.71   0.000     4.501954    4.948538
------------------------------------------------------------------------------
 
. estimates store additive // needed in part d

You can, but I wouldn't. The problem is that dropping the linear term may change the shape of the fitted curve, so at least one would need to check that a pure quadratic makes sense. In general we prefer hierarchical models that keep the linear term when adding a quadratic, and keep main effects when adding interactions.

(b) Interpret the coefficient for western states and its 95% confidence interval in terms of per capita expenditure (rather than the log).

 
. di exp(_b[west]), exp(_b[west]-1.96*_se[west]),exp(_b[west]+1.96*_se[west])
1.1571762 1.0754568 1.2451051

The exponentiated coefficient shows that on average western states spend about 14.8% more than no-western states with the same economic ability and percent metropolitan. Exponentiating the confidence bounds we estimate with 95% confidence that the relative difference in expenditure betwees western and no-western states with the same economic ability and percent metropolitan is between six and 24.3%.

It is customary with a logged outcome to interpret coefficients as relative changes, but this is only a first-order approximation, so it should be used only for small coefficients or small changes. Here the approximation gives 13.8 instead of 14.8, so it's not too far off. In terms of the 95% confidence interval it works much better for the lower bound (5.6 vs. 6.0) than for the upper bound (22.0 vs. 24.3).

(c) Is there any evidence that the difference in expenditure per capita between western and non-western states with the same economic ability and percent metropolitan varies with economic ability? Justify your answer with a likelihood ratio test.

The question is whether there is an interaction between the indicator of western states and the index of economic ability. (It would have been even clearer if I had asked whether the relative difference between western and non-western states varies with economic ability, because the absolute difference in expenditure varies even without the interaction term. Interactions are alwasys specific to the scale one is working on.)

As usual, we center the continuous variable before computing an interaction.

 
. quietly sum ecab
 
. di r(mean)
94.451064
 
. gen westXecabc   = west * (ecab - r(mean) )
 
. reg logex ecab west met metcsq westXecabc
 
      Source |       SS       df       MS              Number of obs =      47
-------------+------------------------------           F(  5,    41) =   17.44
       Model |   1.2279246     5  .245584921           Prob > F      =  0.0000
    Residual |  .577432541    41  .014083721           R-squared     =  0.6802
-------------+------------------------------           Adj R-squared =  0.6412
       Total |  1.80535715    46  .039246894           Root MSE      =  .11867
 
------------------------------------------------------------------------------
       logex |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ecab |   .0069115     .00176     3.93   0.000     .0033571    .0104659
        west |   .1499681   .0375478     3.99   0.000     .0741388    .2257974
         met |   .0004364   .0009542     0.46   0.650    -.0014907    .0023635
      metcsq |   .0000894   .0000299     2.98   0.005     .0000289    .0001499
  westXecabc |   .0024196    .002365     1.02   0.312    -.0023566    .0071958
       _cons |   4.819771   .1441015    33.45   0.000     4.528752     5.11079
------------------------------------------------------------------------------
 
. test westXecabc
 
 ( 1)  westXecabc = 0
 
       F(  1,    41) =    1.05
            Prob > F =    0.3123

Because this is a linear model the Wald and likelihood ratio tests are the same, and given that we are testing a single term we can simply look at the output: the t-test of 1.02 on 41 d.f. is nowhere near significant. The square of this is the F statistic reported by Stata's test command, 1.05 on 1 and 41 d.f.

If you thought of using Stata's lrtest command you should know that Stata compares log-likelihoods maximized with respect to both β and σ2 and gives a chi-squared of 1.18 on one d.f. The usual approach with GLMs, however, is to treat the variance as a nuisance parameter and compare log-likelihoods maximized only w.r.t. β, using the estimate of σ as a scaling factor. This leads to a chi-squared of 1.05 on one d.f., which we treat as an F on one and 41 d.f. The two approaches as asymptotically equivalent and, in this case, give very similar results. This point is discussed at the end of section 2.3.2 in the notes (p.13).

(d) Compute jack-knifed residuals and plot them against fitted values using the model of part (a). Any interesting observations?

 
. estimates restore additive
(results additive are active now)
 
. predict fv
(option xb assumed; fitted values)
 
. predict jack2, rstu
 
. graph twoway (scatter jack2 fv) , legend(off) ///
>    title(Residual Plot for Model of Log Expenditure)  
 
. graph export set2fig2.png, width(500) replace
(note: file set2fig2.png not found)
(file set2fig2.png written in PNG format)

The plot shows jack-knifed residuals mostly between -2 and 2 with no pattern indicating reasons to be concerned.

(e) How large would a jack-knifed residual have to be in this dataset before you could conclude with 95% confidence that it is an outlier?

We can use the Bonferroni approximation to produce an upper bound. We have 47 states and the t-test would have 41 d.f.:

 
. di invttail(41, .025/47)
3.5227954

We would conclude with at least 95% confidence that an observation is an outlier if its jack-knifed residual exceeded 3.52 in magnitude. As the plot shows, we have no residuals anywhere near that large.

[3] Box-Cox Transformations

(a) Check whether taking logs was a good idea by formally testing the reciprocal, log and identity transformations in a Box-Cox framework using the model with economic ability, western states, and linear and quadratic terms on percent metropolitan.

 
. boxcox ex ecab west met metcsq
Fitting comparison model
 
Iteration 0:   log likelihood = -255.34036  
Iteration 1:   log likelihood = -254.60344  
Iteration 2:   log likelihood = -254.60278  
Iteration 3:   log likelihood = -254.60278  
 
Fitting full model
 
Iteration 0:   log likelihood = -229.40211  
Iteration 1:   log likelihood = -228.38537  
Iteration 2:   log likelihood = -228.38445  
Iteration 3:   log likelihood = -228.38445  
 
                                                  Number of obs   =         47
                                                  LR chi2(4)      =      52.44
Log likelihood = -228.38445                       Prob > chi2     =      0.000
 
------------------------------------------------------------------------------
          ex |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      /theta |   .2286441   .5487097     0.42   0.677    -.8468071    1.304095
------------------------------------------------------------------------------
 
Estimates of scale-variant parameters
----------------------------
             |      Coef.
-------------+--------------
Notrans      |
        ecab |   .0291118
        west |   .5264081
         met |   .0005376
      metcsq |   .0003101
       _cons |   8.226761
-------------+--------------
      /sigma |   .4058387
----------------------------
 
---------------------------------------------------------
   Test         Restricted     LR statistic      P-value
    H0:       log likelihood       chi2       Prob > chi2
---------------------------------------------------------
theta = -1      -230.74013         4.71           0.030
theta =  0      -228.47037         0.17           0.678
theta =  1      -229.40211         2.04           0.154
---------------------------------------------------------

The estimates show that the optimal transformation has λ=0.23, which is very close to the log. The tests listed at show that we can not reject the log transformation. It is also the case that we can not reject the hypothesis that no transformation is needed, although the likelihood is somewhat lower. We can, however, reject the reciprocal transformation. In conclusion it looks like taking logs was a good idea, although we could have gotten away with no transformation at all.

(b) Verify your results by computing and plotting the Box-Cox profile log-likelihood for values of the transformation parameter in the range from -2 to 2.

We follow the procedures outlined in the Stata logs, storing the power transformation and resulting log-likelihood in two variables.

 
. scalar max_logL = e(ll)
 
. gen lambda = .
(47 missing values generated)
 
. gen logL = .
(47 missing values generated)
 
. local i = 0
 
. forvalues p=-2(.2)2 {
  2.   quietly boxcox ex ecab west met metcsq, iterate(0) from(`p', copy)
  3.   local i = `i' + 1
  4.   quietly replace lambda = `p' in `i'
  5.   quietly replace logL = e(ll) in `i'
  6. }
 
. line logL lambda  || function y=max_logL-3.84/2, range(-1.1 1.5) ///
>   , title(Box-Cox Profile Log-Likelihood) legend(off)
 
. graph export set2fig3.png, width(500) replace
(file set2fig3.png written in PNG format)

Looks like transformations strictly between the reciprocal and a power of about 1.4 would be aceptable.

(c) Use the Box-Tidwell procedure described in the notes to get an indication of whether we should have taken the log of the economic ability index as well, assuming we continue to work with the log of expenditure.

The procedure involves adding an auxiliary variable equal to the product of the variable of interest and its log.

 
. gen aux = ecab * log(ecab)
 
. reg logex west met metcsq ecab aux
 
      Source |       SS       df       MS              Number of obs =      47
-------------+------------------------------           F(  5,    41) =   18.43
       Model |  1.24937888     5  .249875776           Prob > F      =  0.0000
    Residual |  .555978267    41  .013560446           R-squared     =  0.6920
-------------+------------------------------           Adj R-squared =  0.6545
       Total |  1.80535715    46  .039246894           Root MSE      =  .11645
 
------------------------------------------------------------------------------
       logex |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        west |   .1378212   .0369838     3.73   0.001      .063131    .2125114
         met |   .0002988   .0009095     0.33   0.744    -.0015379    .0021356
      metcsq |   .0000933   .0000295     3.16   0.003     .0000337    .0001528
        ecab |   .1020919   .0575655     1.77   0.084     -.014164    .2183479
         aux |  -.0170506   .0104364    -1.63   0.110    -.0381273     .004026
       _cons |    3.18415   .9494957     3.35   0.002     1.266605    5.101696
------------------------------------------------------------------------------
 
. di 1 + _b[aux]/_b[ecab]
.83298755

The procedure suggests a transformation with power 0.83, which is close enough to unity to conclude that a transformation is not needed. (It turns out this approximation is not very good, as explained below.)

Just for fun: you may want to try Stata's Box-Cox command to find optimal transformations of both expenditure and economic ability but not the dummy for western states or the polynomial terms on percent metropolitan.

The command you would need to transform expenditure and economic ability using possibly different powers while including west, metropolitan and its square without transforming them is boxcox ex ecab, notrans(west met metcsq) model(theta). The optimal Box-Cox transformation has power -0.07 for expenditure, suggesting again the log, and -0.99 for economic ability, suggesting the reciprocal.

I was surprised that the Box-Tidwell approximation did not pick this up, and went back to the original sources. It turns out you get much better results if you divide by the coefficient of the variable of interest in the original model, not the model with the auxiliary variable. Doing this yields λ = -1.11, suggesting the reciprocal tranformation in agreement with the boxcox command. I will revise the notes accordingly.