## The Anscombe Datasets

Anscombe (1973) has a nice example where he uses a constructed dataset
to emphasize the importance of using graphs in statistical analysis.
The data are available in the Stata bookstore as part of the support for
Kohler and Kreuter's *Data Analysis Using Stata*, and can be read
using the following command

. use http://www.stata-press.com/data/kk/anscombe (synthetical data (Anscombe 1973))

There are 8 variables, representing four pairings of an outcome and a predictor. All sets have 11 observations, the same mean of x (9) and y (7.5), the same fitted regression line (y = 3 + 0.5 x), the same regression and residual sum of squares and therefore the same multiple R-squared of 0.67. But they represent very different situations, as you will see by clicking on each dataset:

Here's the first dataset:

This is an example of *pure error*,
the observations appear randomly distributed around the regression line,
just as the model assumes.

For the record, here is the regression output.

. regress y1 x1 Source | SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 17.99 Model | 27.5100011 1 27.5100011 Prob > F = 0.0022 Residual | 13.7626904 9 1.52918783 R-squared = 0.6665 -------------+------------------------------ Adj R-squared = 0.6295 Total | 41.2726916 10 4.12726916 Root MSE = 1.2366 ------------------------------------------------------------------------------ y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | .5000909 .1179055 4.24 0.002 .2333701 .7668117 _cons | 3.000091 1.124747 2.67 0.026 .4557369 5.544445 ------------------------------------------------------------------------------

The graph was produced using the following command:

. twoway scatter y1 x1 || lfit y1 x1, title("Pure Error") ytitle(y1) legend(off)

The data in the second dataset look very different, even though the R-squared is the same:

This is an example of *lack of fit*,
the model assumes a linear relationship but the dependence is in fact
curvilinear. If you add a quadratic term you can increase R-squared to 1.

For the record this is the regression for the second pair:

. reg y2 x2 Source | SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 17.97 Model | 27.5000024 1 27.5000024 Prob > F = 0.0022 Residual | 13.776294 9 1.53069933 R-squared = 0.6662 -------------+------------------------------ Adj R-squared = 0.6292 Total | 41.2762964 10 4.12762964 Root MSE = 1.2372 ------------------------------------------------------------------------------ y2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x2 | .5 .1179638 4.24 0.002 .2331475 .7668526 _cons | 3.000909 1.125303 2.67 0.026 .4552978 5.54652 ------------------------------------------------------------------------------

The graph was produced using this command:

. twoway scatter y2 x2 || lfit y2 x2, title("Lack of Fit") ytitle(y2) legend(off)

Here is the graph for the third pair. R-squared is still 0.666, but the data look very different.

This is an example of *an outlier*,
the model specifies a linear relationship and all points but one follow it
exactly. You could increase R-squared to 1 by omitting the outlier.

Here's the regression output

. reg y3 x3 Source | SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 17.97 Model | 27.4700075 1 27.4700075 Prob > F = 0.0022 Residual | 13.7561905 9 1.52846561 R-squared = 0.6663 -------------+------------------------------ Adj R-squared = 0.6292 Total | 41.2261979 10 4.12261979 Root MSE = 1.2363 ------------------------------------------------------------------------------ y3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x3 | .4997273 .1178777 4.24 0.002 .2330695 .7663851 _cons | 3.002455 1.124481 2.67 0.026 .4587014 5.546208 ------------------------------------------------------------------------------

The graph was produced using this command:

. twoway scatter y3 x3 || lfit y2 x2, title("An Outlier") ytitle(y3) legend(off)

Finally, this is the graph for the fourth dataset. Once again, R-squared is 0.666, the same as in the previous three cases:

This is an example of *influence*,
the slope is completely determined by the observation on the right.
Without that observation we would not be able to estimate the regression,
as all x's would be the same.

Here's the regression:

. reg y4 x4 Source | SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 18.00 Model | 27.4900007 1 27.4900007 Prob > F = 0.0022 Residual | 13.7424908 9 1.52694342 R-squared = 0.6667 -------------+------------------------------ Adj R-squared = 0.6297 Total | 41.2324915 10 4.12324915 Root MSE = 1.2357 ------------------------------------------------------------------------------ y4 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x4 | .4999091 .1178189 4.24 0.002 .2333841 .7664341 _cons | 3.001727 1.123921 2.67 0.026 .4592411 5.544213 ------------------------------------------------------------------------------

The graph was produced using this command:

. twoway scatter y4 x4 || lfit y4 x4, title("Influence") ytitle(y4) legend(off)

*Reference*: F. J. Anscombe (1973). Graphs in Statistical Analysis.
*The American Statistician*, **27**(1):17-21.
If you have access to JSTOR you can get the article at the following link:
http://www.jstor.org/stable/2682899.