![]() |
|
![]() | ||
|
|
||||
Anscombe (1973) has a nice example where he uses a constructed dataset to emphasize the importance of using graphs in statistical analysis. If you have access to JSTOR you can get the article at the following link: http://links.jstor.org/sici?sici=0003-1305%28197302%2927%3A1%3C17%3AGISA%3E2.0.CO%3B2-J.
The data are available in the Stata bookstore as part of the support for Kohler and Kreuter's Data Analysis Using Stata, and can be read using the following command
. use http://www.stata-press.com/data/kk/anscombe (synthetical data (Anscombe 1973))
There are 8 variables, representing four pairings of an outcome and a predictor. All sets have 11 observations, the same mean of x (9) and y (7.5), the same fitted regression line (y = 3 + 0.5 x), the same regression and residual sum of squares and therefore exactly the same multiple R-squared (0.667).
For example here is the regression for the first pair.
. regress y1 x1
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 17.99
Model | 27.5100011 1 27.5100011 Prob > F = 0.0022
Residual | 13.7626904 9 1.52918783 R-squared = 0.6665
-------------+------------------------------ Adj R-squared = 0.6295
Total | 41.2726916 10 4.12726916 Root MSE = 1.2366
------------------------------------------------------------------------------
y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | .5000909 .1179055 4.24 0.002 .2333701 .7668117
_cons | 3.000091 1.124747 2.67 0.026 .4557369 5.544445
------------------------------------------------------------------------------
You should try the other three. The interesting fact is that the relationship are very different, each one illustrating a different effect, which I have labeled pure error, lack of fit, outlier, and influence.
The following commands will plot each outcome versus each predictor with the corresponding regresison line, and will then combine all four graphs in one.
. twoway scatter y1 x1 || lfit y1 x1, title("Pure Error") legend(off) name(p1)
. twoway scatter y2 x2 || lfit y2 x2, title("Lack of Fit") legend(off) name(p2)
. twoway scatter y3 x3 || lfit y3 x3, title("An Outlier") legend(off) name(p3)
. twoway scatter y4 x4 || lfit y4 x4, title("Influence") legend(off) name(p4)
. graph combine p1 p2 p3 p4, rows(2) cols(2) title("Anscombe's Datasets") scale(.8)
. graph export anscombe.png, replace
(file anscombe.png written in PNG format)

Hopefully this graph will persuade you of the importance of looking at the data. Anscombe noted some of the difficulties involved in producing plots like this back in 1973, but with software such as Stata there really isn't any excuse today.