## 2.9 Regression Diagnostics

All of the diagnostic measures discussed in the lecture notes can be calculated in Stata, some in more than one way. In particular, you may want to read about the command `predict` after `regress` in the Stata manual.

In this section we will be working with the additive analysis of covariance model of the previous section. To save typing the model each time we need it, we can define a local macro

```. local predictors "setting effort_mod effort_str"
```

Now we can fit our model using the following command

```. regress change `predictors'

Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  3,    16) =   21.55
Model |  2124.50633     3  708.168776           Prob > F      =  0.0000
Residual |  525.693673    16  32.8558546           R-squared     =  0.8016
Total |      2650.2    19  139.484211           Root MSE      =   5.732

------------------------------------------------------------------------------
change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
setting |   .1692677   .1055505     1.60   0.128    -.0544894    .3930247
effort_mod |   4.143915   3.191179     1.30   0.213    -2.621082    10.90891
effort_str |   19.44761   3.729295     5.21   0.000     11.54186    27.35336
_cons |  -5.954036    7.16597    -0.83   0.418    -21.14521    9.237141
------------------------------------------------------------------------------
```

Let us start with the residuals. The easiest way to get them is as options of the `predict` command. Specify the option `res` for the raw residuals, `rstand` for the standardized residuals, and `rstud` for the studentized (or jackknifed) residuals. Let us obtain all three:

```. predict ri, res

. predict si, rsta

. predict ti, rstu

. label var ti "Jack-knifed residuals"
```

To get the hat matrix and Cook's distance we use two more options of `predict`, `hat` and `cook`:

```. predict hii, hat

. predict di, cook
```

We are now ready to print Table 2.29 in the notes

```. list country ri si ti hii di, clean

country          ri          si          ti        hii         di
1.          Bolivia   -.8322767   -.1689738   -.1637543   .2616128    .002529
2.           Brazil    3.428229    .6573142     .645213   .1720945   .0224529
3.            Chile    .4416054    .0834989    .0808651   .1486769   .0003044
4.         Colombia   -1.527183   -.2913581   -.2828576   .1637904   .0041569
5.        CostaRica    1.287944     .242732    .2354582   .1431063   .0024599
6.             Cuba    11.44161    2.163383    2.490349   .1486769   .2043412
7.     DominicanRep    11.29992    2.161597    2.487445   .1682585   .2363079
8.          Ecuador   -10.03862   -1.925296   -2.126719   .1725536   .1932498
9.       ElSalvador    4.654061    .8956616    .8898143    .178205   .0434895
10.        Guatemala     -3.4996   -.6853749   -.6735727    .206462    .030554
11.            Haiti    .0296676    .0069303    .0067103   .4422478   9.52e-06
12.         Honduras    .1774703    .0355449    .0344175   .2412746   .0001004
13.          Jamaica   -7.219859   -1.361729   -1.402245   .1444142   .0782469
14.           Mexico      .90482    .1830367    .1774104   .2562359   .0028855
15.        Nicaragua    1.443835    .2726553    .2646128   .1465179   .0031905
16.           Panama   -5.712056   -1.076521   -1.082269   .1431063   .0483857
17.         Paraguay   -.5717711    -.109629   -.1061877   .1720945   .0006246
18.             Peru   -4.402503   -.8410965   -.8330122   .1661363   .0352372
19.   TrinidadTobago    1.287944     .242732    .2354582   .1431063   .0024599
20.        Venezuela   -2.593236   -.5752294   -.5628135   .3814295    .051009
```

Here is an easy way to find the cases highlighted in Table 2.29, those with standardized or jackknifed residuals greater than 2 in magnitude:

```. list country ri si ti hii di if abs(si) > 2 | abs(ti) > 2, clean

country         ri         si         ti        hii         di
6.           Cuba   11.44161   2.163383   2.490349   .1486769   .2043412
7.   DominicanRep   11.29992   2.161597   2.487445   .1682585   .2363079
8.        Ecuador  -10.03862  -1.925296  -2.126719   .1725536   .1932498
```

We will use a scalar to calculate the maximum acceptable leverage, which is 2p/n in general, and then list the cases exceeding that value (if any).

```. scalar hiimax = 2*4/20

. list country ri si ti hii di if hii > hiimax, clean

country         ri         si         ti        hii         di
11.     Haiti   .0296676   .0069303   .0067103   .4422478   9.52e-06
```

So, Haiti has a lot of leverage, but very little actual influence. Let us list the six most influential countries. I will do this by sorting the data in descending order of influence and then listing the first six. Stata's regular `sort` command sorts only in ascending order, but `gsort` can do descending if you specify `-di`.

```. gsort -di

. list country di in 1/6, clean

country         di
1.   DominicanRep   .2363079
2.           Cuba   .2043412
4.        Jamaica   .0782469
5.      Venezuela    .051009
6.         Panama   .0483857
```

So, the D.R., Cuba, and Ecuador are fairly influential observations. Try refitting the model without the D.R. to verify what I say on page 57 of the lecture notes.

#### Residual Plots

On to plots! Here is the standard residual plot in Figure 2.6, produced using the following commands:

```. predict yhat
(option xb assumed; fitted values)

. label var yhat "Fitted values"

. scatter ti yhat, title("Figure 2.6: Residual Plot for Ancova Model")

. graph export fig26.png, width(500) replace
(file fig26.png written in PNG format)
```

#### Q-Q Plots

Now for that lovely Q-Q-plot in Figure 2.7 of the notes:

```. qnorm ti, title("Figure 2.7: Q-Q Plot for Residuals of Ancova Model")

. graph export fig27.png, width(500) replace
(file fig27.png written in PNG format)
```

Wasn't that easy? Stata's `qnorm` evaluates the inverse normal cdf at i/(n+1) rather than at (i-3/8)/(n+1/4) or some of the other approximations discussed in the notes. Of course you can use any approximation you want, at the expense of doing a bit more work. I will illustrate the general idea by calculating Filliben's approximation to the expected order statistics or rankits, using Stata's built-in system variables `_n` for the observation number and `_N` for the number of cases.

```. sort si

. gen pi = (_n-0.3175)/(_N+0.365)

. replace pi = 1-0.5^(1/_N) if _n == 1

. replace pi = 0.5^(1/_N)   if _n ==_N

. gen filliben = invnorm(pi)

. corr si filliben
(obs=20)

|       si filliben
-------------+------------------
si |   1.0000
filliben |   0.9655   1.0000
```

As you can see, the Filliben correlation agrees with the value in the notes: 0.9655. I will skip the graph because it looks almost identical to the one produced by `qnorm`.

Continue with 2.10 Transforming the Data