Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

2.9 Regression Diagnostics

All of the diagnostic measures discussed in the lecture notes can be calculated in Stata, some in more than one way. In particular, you may want to read about the command predict after regress in the Stata manual.

In this section we will be working with the additive analysis of covariance model of the previous section. To save typing the model each time we need it, we can define a local macro

. local predictors "setting effort_mod effort_str"

Now we can fit our model using the following command

. regress change `predictors'
 
      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  3,    16) =   21.55
       Model |  2124.50633     3  708.168776           Prob > F      =  0.0000
    Residual |  525.693673    16  32.8558546           R-squared     =  0.8016
-------------+------------------------------           Adj R-squared =  0.7644
       Total |      2650.2    19  139.484211           Root MSE      =   5.732
 
------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     setting |   .1692677   .1055505     1.60   0.128    -.0544894    .3930247
  effort_mod |   4.143915   3.191179     1.30   0.213    -2.621082    10.90891
  effort_str |   19.44761   3.729295     5.21   0.000     11.54186    27.35336
       _cons |  -5.954036    7.16597    -0.83   0.418    -21.14521    9.237141
------------------------------------------------------------------------------

Let us start with the residuals. The easiest way to get them is as options of the predict command. Specify the option res for the raw residuals, rstand for the standardized residuals, and rstud for the studentized (or jackknifed) residuals. Let us obtain all three:

. predict ri, res
 
. predict si, rsta 
 
. predict ti, rstu
 
. label var ti "Jack-knifed residuals"

To get the hat matrix and Cook's distance we use two more options of predict, hat and cook:

. predict hii, hat
 
. predict di, cook

We are now ready to print Table 2.29 in the notes

. list country ri si ti hii di, clean
 
              country          ri          si          ti        hii         di  
  1.          Bolivia   -.8322767   -.1689738   -.1637543   .2616128    .002529  
  2.           Brazil    3.428229    .6573142     .645213   .1720945   .0224529  
  3.            Chile    .4416054    .0834989    .0808651   .1486769   .0003044  
  4.         Colombia   -1.527183   -.2913581   -.2828576   .1637904   .0041569  
  5.        CostaRica    1.287944     .242732    .2354582   .1431063   .0024599  
  6.             Cuba    11.44161    2.163383    2.490349   .1486769   .2043412  
  7.     DominicanRep    11.29992    2.161597    2.487445   .1682585   .2363079  
  8.          Ecuador   -10.03862   -1.925296   -2.126719   .1725536   .1932498  
  9.       ElSalvador    4.654061    .8956616    .8898143    .178205   .0434895  
 10.        Guatemala     -3.4996   -.6853749   -.6735727    .206462    .030554  
 11.            Haiti    .0296676    .0069303    .0067103   .4422478   9.52e-06  
 12.         Honduras    .1774703    .0355449    .0344175   .2412746   .0001004  
 13.          Jamaica   -7.219859   -1.361729   -1.402245   .1444142   .0782469  
 14.           Mexico      .90482    .1830367    .1774104   .2562359   .0028855  
 15.        Nicaragua    1.443835    .2726553    .2646128   .1465179   .0031905  
 16.           Panama   -5.712056   -1.076521   -1.082269   .1431063   .0483857  
 17.         Paraguay   -.5717711    -.109629   -.1061877   .1720945   .0006246  
 18.             Peru   -4.402503   -.8410965   -.8330122   .1661363   .0352372  
 19.   TrinidadTobago    1.287944     .242732    .2354582   .1431063   .0024599  
 20.        Venezuela   -2.593236   -.5752294   -.5628135   .3814295    .051009  

Here is an easy way to find the cases highlighted in Table 2.29, those with standardized or jackknifed residuals greater than 2 in magnitude:

. list country ri si ti hii di if abs(si) > 2 | abs(ti) > 2, clean
 
            country         ri         si         ti        hii         di  
  6.           Cuba   11.44161   2.163383   2.490349   .1486769   .2043412  
  7.   DominicanRep   11.29992   2.161597   2.487445   .1682585   .2363079  
  8.        Ecuador  -10.03862  -1.925296  -2.126719   .1725536   .1932498  

We will use a scalar to calculate the maximum acceptable leverage, which is 2p/n in general, and then list the cases exceeding that value (if any).

. scalar hiimax = 2*4/20
 
. list country ri si ti hii di if hii > hiimax, clean 
 
       country         ri         si         ti        hii         di  
 11.     Haiti   .0296676   .0069303   .0067103   .4422478   9.52e-06  

So, Haiti has a lot of leverage, but very little actual influence. Let us list the six most influential countries. I will do this by sorting the data in descending order of influence and then listing the first six. Stata's regular sort command sorts only in ascending order, but gsort can do descending if you specify -di.

. gsort -di
 
. list country di in 1/6, clean
 
            country         di  
  1.   DominicanRep   .2363079  
  2.           Cuba   .2043412  
  3.        Ecuador   .1932498  
  4.        Jamaica   .0782469  
  5.      Venezuela    .051009  
  6.         Panama   .0483857  

So, the D.R., Cuba, and Ecuador are fairly influential observations. Try refitting the model without the D.R. to verify what I say on page 57 of the lecture notes.

Residual Plots

On to plots! Here is the standard residual plot in Figure 2.6, produced using the following commands:

. predict yhat
(option xb assumed; fitted values)
 
. label var yhat "Fitted values"
 
. scatter ti yhat, title("Figure 2.6: Residual Plot for Ancova Model") 
 
. graph export fig26.png, width(500) replace
(file fig26.png written in PNG format)

Q-Q Plots

Now for that lovely Q-Q-plot in Figure 2.7 of the notes:

. qnorm ti, title("Figure 2.7: Q-Q Plot for Residuals of Ancova Model")
 
. graph export fig27.png, width(500) replace
(file fig27.png written in PNG format)

Wasn't that easy? Stata's qnorm evaluates the inverse normal cdf at i/(n+1) rather than at (i-3/8)/(n+1/4) or some of the other approximations discussed in the notes. Of course you can use any approximation you want, at the expense of doing a bit more work. I will illustrate the general idea by calculating Filliben's approximation to the expected order statistics or rankits, using Stata's built-in system variables _n for the observation number and _N for the number of cases.

. sort si
 
. gen pi = (_n-0.3175)/(_N+0.365)
 
. replace pi = 1-0.5^(1/_N) if _n == 1
(1 real change made)
 
. replace pi = 0.5^(1/_N)   if _n ==_N 
(1 real change made)
 
. gen filliben = invnorm(pi)
 
. corr si filliben
(obs=20)
 
             |       si filliben
-------------+------------------
          si |   1.0000
    filliben |   0.9655   1.0000

As you can see, the Filliben correlation agrees with the value in the notes: 0.9655. I will skip the graph because it looks almost identical to the one produced by qnorm.


Continue with 2.10 Transforming the Data