2.9 Regression Diagnostics
All of the diagnostic measures discussed in the lecture notes can
be calculated in Stata, some in more than one way.
In particular, you may want to read about the command
predict after regress in the Stata manual.
In this section we will be working with the additive analysis of covariance model of the previous section. To save typing the model each time we need it, we can define a local macro
. local predictors "setting effort_mod effort_str"
Now we can fit our model using the following command
. regress change `predictors'
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 3, 16) = 21.55
Model | 2124.50633 3 708.168776 Prob > F = 0.0000
Residual | 525.693673 16 32.8558546 R-squared = 0.8016
-------------+------------------------------ Adj R-squared = 0.7644
Total | 2650.2 19 139.484211 Root MSE = 5.732
------------------------------------------------------------------------------
change | Coef. Std. Err. T P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
setting | .1692677 .1055505 1.60 0.128 -.0544894 .3930247
effort_mod | 4.143915 3.191179 1.30 0.213 -2.621082 10.90891
effort_str | 19.44761 3.729295 5.21 0.000 11.54186 27.35336
_cons | -5.954036 7.16597 -0.83 0.418 -21.14521 9.237141
------------------------------------------------------------------------------
Let us start with the residuals.
The easiest way to get them is as options of the
predict command. Specify the option
res for the raw residuals,
rstand for the standardized residuals, and
rstud for the studentized (or jackknifed) residuals.
Let us obtain all three:
. predict ri, res . predict si, rsta . predict ti, rstu . label var ti "Jack-knifed residuals"
To get the hat matrix and Cook's distance we use two more options of
predict, hat and cook:
. predict hii, hat . predict di, cook
We are now ready to print Table 2.29 in the notes
. list country ri si ti hii di, clean
country ri si ti hii di
1. Bolivia -.8322767 -.1689738 -.1637543 .2616128 .002529
2. Brazil 3.428229 .6573142 .645213 .1720945 .0224529
3. Chile .4416054 .0834989 .0808651 .1486769 .0003044
4. Colombia -1.527183 -.2913581 -.2828576 .1637904 .0041569
5. CostaRica 1.287944 .242732 .2354582 .1431063 .0024599
6. Cuba 11.44161 2.163383 2.490349 .1486769 .2043412
7. DominicanRep 11.29992 2.161597 2.487445 .1682585 .2363079
8. Ecuador -10.03862 -1.925296 -2.126719 .1725536 .1932498
9. ElSalvador 4.654061 .8956616 .8898143 .178205 .0434895
10. Guatemala -3.4996 -.6853749 -.6735727 .206462 .030554
11. Haiti .0296676 .0069303 .0067103 .4422478 9.52e-06
12. Honduras .1774703 .0355449 .0344175 .2412746 .0001004
13. Jamaica -7.219859 -1.361729 -1.402245 .1444142 .0782469
14. Mexico .90482 .1830367 .1774104 .2562359 .0028855
15. Nicaragua 1.443835 .2726553 .2646128 .1465179 .0031905
16. Panama -5.712056 -1.076521 -1.082269 .1431063 .0483857
17. Paraguay -.5717711 -.109629 -.1061877 .1720945 .0006246
18. Peru -4.402503 -.8410965 -.8330122 .1661363 .0352372
19. TrinidadTobago 1.287944 .242732 .2354582 .1431063 .0024599
20. Venezuela -2.593236 -.5752294 -.5628135 .3814295 .051009
Here is an easy way to find the cases highlighted in Table 2.29, those with standardized or jackknifed residuals greater than 2 in magnitude:
. list country ri si ti hii di if abs(si) > 2 | abs(ti) > 2, clean
country ri si ti hii di
6. Cuba 11.44161 2.163383 2.490349 .1486769 .2043412
7. DominicanRep 11.29992 2.161597 2.487445 .1682585 .2363079
8. Ecuador -10.03862 -1.925296 -2.126719 .1725536 .1932498
We will use a scalar to calculate the maximum acceptable leverage, which is 2p/n in general, and then list the cases exceeding that value (if any).
. scalar hiimax = 2*4/20
. list country ri si ti hii di if hii > hiimax, clean
country ri si ti hii di
11. Haiti .0296676 .0069303 .0067103 .4422478 9.52e-06
So, Haiti has a lot of leverage, but very little actual influence.
Let us list the six most influential countries. I will do
this by sorting the data in descending order of influence
and then listing the first six. Stata's regular sort
command sorts only in ascending order, but gsort
can do descending if you specify -di.
. gsort -di
. list country di in 1/6, clean
country di
1. DominicanRep .2363079
2. Cuba .2043412
3. Ecuador .1932498
4. Jamaica .0782469
5. Venezuela .051009
6. Panama .0483857
So, the D.R., Cuba, and Ecuador are fairly influential observations. Try refitting the model without the D.R. to verify what I say on page 57 of the lecture notes.
Residual Plots
On to plots! Here is the standard residual plot in Figure 2.6, produced using the following commands:
. predict yhat
(option xb assumed; fitted values)
. label var yhat "Fitted values"
. scatter ti yhat, title("Figure 2.6: Residual Plot for Ancova Model")
. graph export fig26.png, width(500) replace
(file fig26.png written in PNG format)

Q-Q Plots
Now for that lovely Q-Q-plot in Figure 2.7 of the notes:
. qnorm ti, title("Figure 2.7: Q-Q Plot for Residuals of Ancova Model")
. graph export fig27.png, width(500) replace
(file fig27.png written in PNG format)

Wasn't that easy?
Stata's qnorm evaluates the inverse normal cdf at
i/(n+1) rather than at (i-3/8)/(n+1/4) or some
of the other approximations discussed in the notes.
Of course you can use any approximation you want, at the
expense of doing a bit more work.
I will illustrate the general idea by calculating Filliben's
approximation to the expected order statistics or rankits,
using Stata's built-in system variables
_n for the observation number and
_N for the number of cases.
. sort si
. gen pi = (_n-0.3175)/(_N+0.365)
. replace pi = 1-0.5^(1/_N) if _n == 1
(1 real change made)
. replace pi = 0.5^(1/_N) if _n ==_N
(1 real change made)
. gen filliben = invnorm(pi)
. corr si filliben
(obs=20)
| si filliben
-------------+------------------
si | 1.0000
filliben | 0.9655 1.0000
As you can see, the Filliben correlation agrees with
the value in the notes: 0.9655.
I will skip the graph because it looks almost identical to the one
produced by qnorm.
Continue with 2.10 Transforming the Data

