Germán Rodríguez
Generalized Linear Models Princeton University

## The Anscombe Datasets

Anscombe (1973) has a nice example where he uses a constructed dataset to emphasize the importance of using graphs in statistical analysis. The data are available in the Stata bookstore as part of the support for Kohler and Kreuter's Data Analysis Using Stata, and can be read using the following command

```. use http://www.stata-press.com/data/kk/anscombe
(synthetical data (Anscombe 1973))
```

There are 8 variables, representing four pairings of an outcome and a predictor. All sets have 11 observations, the same mean of x (9) and y (7.5), the same fitted regression line (y = 3 + 0.5 x), the same regression and residual sum of squares and therefore the same multiple R-squared of 0.67. But they represent very different situations, as you will see by clicking on each dataset:

Here's the first dataset: This is an example of pure error, the observations appear randomly distributed around the regression line, just as the model assumes.

For the record, here is the regression output.

```. regress y1 x1

Source |       SS       df       MS              Number of obs =      11
-------------+------------------------------           F(  1,     9) =   17.99
Model |  27.5100011     1  27.5100011           Prob > F      =  0.0022
Residual |  13.7626904     9  1.52918783           R-squared     =  0.6665
Total |  41.2726916    10  4.12726916           Root MSE      =  1.2366

------------------------------------------------------------------------------
y1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |   .5000909   .1179055     4.24   0.002     .2333701    .7668117
_cons |   3.000091   1.124747     2.67   0.026     .4557369    5.544445
------------------------------------------------------------------------------
```

The graph was produced using the following command:

```. twoway scatter y1 x1 || lfit y1 x1, title("Pure Error") ytitle(y1) legend(off)
```

The data in the second dataset look very different, even though the R-squared is the same: This is an example of lack of fit, the model assumes a linear relationship but the dependence is in fact curvilinear. If you add a quadratic term you can increase R-squared to 1.

For the record this is the regression for the second pair:

```. reg y2 x2

Source |       SS       df       MS              Number of obs =      11
-------------+------------------------------           F(  1,     9) =   17.97
Model |  27.5000024     1  27.5000024           Prob > F      =  0.0022
Residual |   13.776294     9  1.53069933           R-squared     =  0.6662
Total |  41.2762964    10  4.12762964           Root MSE      =  1.2372

------------------------------------------------------------------------------
y2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x2 |         .5   .1179638     4.24   0.002     .2331475    .7668526
_cons |   3.000909   1.125303     2.67   0.026     .4552978     5.54652
------------------------------------------------------------------------------
```

The graph was produced using this command:

```. twoway scatter y2 x2 || lfit y2 x2, title("Lack of Fit") ytitle(y2) legend(off)
```

Here is the graph for the third pair. R-squared is still 0.666, but the data look very different. This is an example of an outlier, the model specifies a linear relationship and all points but one follow it exactly. You could increase R-squared to 1 by omitting the outlier.

Here's the regression output

```. reg y3 x3

Source |       SS       df       MS              Number of obs =      11
-------------+------------------------------           F(  1,     9) =   17.97
Model |  27.4700075     1  27.4700075           Prob > F      =  0.0022
Residual |  13.7561905     9  1.52846561           R-squared     =  0.6663
Total |  41.2261979    10  4.12261979           Root MSE      =  1.2363

------------------------------------------------------------------------------
y3 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x3 |   .4997273   .1178777     4.24   0.002     .2330695    .7663851
_cons |   3.002455   1.124481     2.67   0.026     .4587014    5.546208
------------------------------------------------------------------------------
```

The graph was produced using this command:

```. twoway scatter y3 x3 || lfit y2 x2, title("An Outlier") ytitle(y3) legend(off)
```

Finally, this is the graph for the fourth dataset. Once again, R-squared is 0.666, the same as in the previous three cases: This is an example of influence, the slope is completely determined by the observation on the right. Without that observation we would not be able to estimate the regression, as all x's would be the same.

Here's the regression:

```. reg y4 x4

Source |       SS       df       MS              Number of obs =      11
-------------+------------------------------           F(  1,     9) =   18.00
Model |  27.4900007     1  27.4900007           Prob > F      =  0.0022
Residual |  13.7424908     9  1.52694342           R-squared     =  0.6667
Total |  41.2324915    10  4.12324915           Root MSE      =  1.2357

------------------------------------------------------------------------------
y4 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x4 |   .4999091   .1178189     4.24   0.002     .2333841    .7664341
_cons |   3.001727   1.123921     2.67   0.026     .4592411    5.544213
------------------------------------------------------------------------------
```

The graph was produced using this command:

```. twoway scatter y4 x4 || lfit y4 x4, title("Influence") ytitle(y4) legend(off)
```

Reference: F. J. Anscombe (1973). Graphs in Statistical Analysis. The American Statistician, 27(1):17-21. If you have access to JSTOR you can get the article at the following link: http://www.jstor.org/stable/2682899.