The Anscombe Datasets
Anscombe (1973) has a nice example where he uses a constructed dataset to emphasize the importance of using graphs in statistical analysis. The data are available in the Stata bookstore as part of the support for Kohler and Kreuter's Data Analysis Using Stata, and can be read using the following command
. use http://www.stata-press.com/data/kk/anscombe (synthetical data (Anscombe 1973))
There are 8 variables, representing four pairings of an outcome and a predictor. All sets have 11 observations, the same mean of x (9) and y (7.5), the same fitted regression line (y = 3 + 0.5 x), the same regression and residual sum of squares and therefore the same multiple R-squared of 0.67. But they represent very different situations, as you will see by clicking on each dataset:
| dataset 1 | dataset 2 | dataset 3 | dataset 4 | |
|
Here's the first dataset:
This is an example of pure error, the observations appear randomly distributed around the regression line, just as the model assumes. For the record, here is the regression output.
. regress y1 x1
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 17.99
Model | 27.5100011 1 27.5100011 Prob > F = 0.0022
Residual | 13.7626904 9 1.52918783 R-squared = 0.6665
-------------+------------------------------ Adj R-squared = 0.6295
Total | 41.2726916 10 4.12726916 Root MSE = 1.2366
------------------------------------------------------------------------------
y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | .5000909 .1179055 4.24 0.002 .2333701 .7668117
_cons | 3.000091 1.124747 2.67 0.026 .4557369 5.544445
------------------------------------------------------------------------------
The graph was produced using the following command:
. twoway scatter y1 x1 || lfit y1 x1, title("Pure Error") ytitle(y1) legend(off)
The data in the second dataset look very different, even though the R-squared is the same:
This is an example of lack of fit, the model assumes a linear relationship but the dependence is in fact curvilinear. If you add a quadratic term you can increase R-squared to 1. For the record this is the regression for the second pair:
. reg y2 x2
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 17.97
Model | 27.5000024 1 27.5000024 Prob > F = 0.0022
Residual | 13.776294 9 1.53069933 R-squared = 0.6662
-------------+------------------------------ Adj R-squared = 0.6292
Total | 41.2762964 10 4.12762964 Root MSE = 1.2372
------------------------------------------------------------------------------
y2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x2 | .5 .1179638 4.24 0.002 .2331475 .7668526
_cons | 3.000909 1.125303 2.67 0.026 .4552978 5.54652
------------------------------------------------------------------------------
The graph was produced using this command:
. twoway scatter y2 x2 || lfit y2 x2, title("Lack of Fit") ytitle(y2) legend(off)
Here is the graph for the third pair. R-squared is still 0.666, but the data look very different.
This is an example of an outlier, the model specifies a linear relationship and all points but one follow it exactly. You could increase R-squared to 1 by omitting the outlier. Here's the regression output
. reg y3 x3
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 17.97
Model | 27.4700075 1 27.4700075 Prob > F = 0.0022
Residual | 13.7561905 9 1.52846561 R-squared = 0.6663
-------------+------------------------------ Adj R-squared = 0.6292
Total | 41.2261979 10 4.12261979 Root MSE = 1.2363
------------------------------------------------------------------------------
y3 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x3 | .4997273 .1178777 4.24 0.002 .2330695 .7663851
_cons | 3.002455 1.124481 2.67 0.026 .4587014 5.546208
------------------------------------------------------------------------------
The graph was produced using this command:
. twoway scatter y3 x3 || lfit y2 x2, title("An Outlier") ytitle(y3) legend(off)
Finally, this is the graph for the fourth dataset. Once again, R-squared is 0.666, the same as in the previous three cases:
This is an example of influence, the slope is completely determined by the observation on the right. Without that observation we would not be able to estimate the regression, as all x's would be the same. Here's the regression:
. reg y4 x4
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 18.00
Model | 27.4900007 1 27.4900007 Prob > F = 0.0022
Residual | 13.7424908 9 1.52694342 R-squared = 0.6667
-------------+------------------------------ Adj R-squared = 0.6297
Total | 41.2324915 10 4.12324915 Root MSE = 1.2357
------------------------------------------------------------------------------
y4 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x4 | .4999091 .1178189 4.24 0.002 .2333841 .7664341
_cons | 3.001727 1.123921 2.67 0.026 .4592411 5.544213
------------------------------------------------------------------------------
The graph was produced using this command:
. twoway scatter y4 x4 || lfit y4 x4, title("Influence") ytitle(y4) legend(off)
| ||||
Hopefully these examples will persuade you of the importance of looking at the data. Anscombe noted some of the difficulties involved in producing plots like this back in 1973, but with software such as Stata there really isn't any excuse these days.
Reference: F. J. Anscombe (1973). Graphs in Statistical Analysis. The American Statistician, 27(1):17-21. If you have access to JSTOR you can get the article at the following link: http://www.jstor.org/stable/2682899.
