Models for Clustered and Panel Data
We will illustrate the analysis of clustered or panel data using three examples, two dealing with linear models and with with logits models. The linear model examples use clustered school data on IQ and language ability, and longitudinal state-level data on Aid to Families with Dependent Children (AFDC).
Example 1: IQ and Language Ability
Snijders and Boskers (1999), Multilevel Analysis, have data for 2287 8-th grade children in 131 schools in The Netherlands. The data are available from http://stat.gamma.rug.nl/snijders, follow the link to the ML book. The data are in the file MLBOOK1.DAT, which includes variable names as well as the data. I split that into two separate files and made all variable names lowercase.
. infile schoolnr pupilnr iq_verb iq_perf sex minority repeatgr /// > aritpret classnr aritpost langpret langpost ses denomina schoolses /// > satiprin natitest meetings currmeet mixedgra percmino aritdiff /// > homework classsiz groupsiz using snijders.dat (2287 observations read)
OLS
We are interested in the relationship between verbal IQ and the score in a language test. OLS gives a highly significant coefficient of 2.65 with a standard error of 0.072:
. reg langpost iq_verb
Source | SS df MS Number of obs = 2287
-------------+------------------------------ F( 1, 2285) = 1352.84
Model | 68915.7639 1 68915.7639 Prob > F = 0.0000
Residual | 116401.529 2285 50.941588 R-squared = 0.3719
-------------+------------------------------ Adj R-squared = 0.3716
Total | 185317.293 2286 81.0661822 Root MSE = 7.1373
------------------------------------------------------------------------------
langpost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
iq_verb | 2.653896 .0721541 36.78 0.000 2.512401 2.79539
_cons | 9.528484 .8668206 10.99 0.000 7.828646 11.22832
------------------------------------------------------------------------------
Random Effects
We consider the fact that the observations are probably correlated within each school because of unobserved school characteristics that affect language scores (such as a good language teacher).
. xtreg langpost iq_verb, i(schoolnr) mle
Fitting constant-only model:
Iteration 0: log likelihood = -8259.3698
Iteration 1: log likelihood = -8143.3601
Iteration 2: log likelihood = -8127.2437
Iteration 3: log likelihood = -8126.6128
Iteration 4: log likelihood = -8126.6092
Fitting full model:
Iteration 0: log likelihood = -7629.2356
Iteration 1: log likelihood = -7625.8966
Iteration 2: log likelihood = -7625.8865
Iteration 3: log likelihood = -7625.8865
Random-effects ML regression Number of obs = 2287
Group variable (i): schoolnr Number of groups = 131
Random effects u_i ~ Gaussian Obs per group: min = 4
avg = 17.5
max = 35
LR chi2(1) = 1001.45
Log likelihood = -7625.8865 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
langpost | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
iq_verb | 2.488094 .0705261 35.28 0.000 2.349865 2.626323
_cons | 11.16511 .8822371 12.66 0.000 9.435956 12.89426
-------------+----------------------------------------------------------------
/sigma_u | 3.081719 .2552303 12.07 0.000 2.581476 3.581961
/sigma_e | 6.498244 .0991428 65.54 0.000 6.303928 6.69256
-------------+----------------------------------------------------------------
rho | .1836084 .0255577 .137803 .237875
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)= 225.92 Prob>=chibar2 = 0.000
The coefficient of verbal IQ is 2.49 with a standard error of 0.071 and is still highly significant. We have also learned that the language scores are correlated within schools, in fact 18.3% of the variation in language scores net of verbal IQ can be attributed to the schools (the rest is due to the pupils). The intra-class correlation is highly significant, as shown by a test statistic of 225.9 (conservatively a chi-squared with 1 d.f.)
Fixed-Effects (Within)
We now consider a fixed-effects model that allows for the possibility of a correlation between unobserved school characteristics and verbal IQ (the school with the good teacher attracts brighter students):
. xtreg langpost iq_verb, i(schoolnr) fe
Fixed-effects (within) regression Number of obs = 2287
Group variable (i): schoolnr Number of groups = 131
R-sq: within = 0.3452 Obs per group: min = 4
between = 0.5985 avg = 17.5
overall = 0.3719 max = 35
F(1,2155) = 1135.95
corr(u_i, Xb) = 0.1463 Prob > F = 0.0000
------------------------------------------------------------------------------
langpost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
iq_verb | 2.414772 .0716466 33.70 0.000 2.274269 2.555276
_cons | 12.35828 .858667 14.39 0.000 10.67438 14.04219
-------------+----------------------------------------------------------------
sigma_u | 3.7161754
sigma_e | 6.4913354
rho | .2468383 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(130, 2155) = 4.67 Prob > F = 0.0000
Our results are very robust, the coefficient of verbal IQ is 2.41 with a standard error of 0.071. We feel pretty confident on our conclusions. Note that we get an F-test for school effects, which are highly significant.
Group Means (Between)
If you are not deterred by the ecological fallacy you could have
analyzed group means. Stata makes this easy with the be
option. We also use wls to weight schools in proportion
to the number of students (not that it makes a huge difference):
. xtreg langpost iq_verb, i(schoolnr) be wls
Between regression (regression on group means) Number of obs = 2287
Group variable (i): schoolnr Number of groups = 131
R-sq: within = 0.3452 Obs per group: min = 4
between = 0.5137 avg = 17.5
overall = 0.3719 max = 35
F(1,129) = 136.29
sd(u_i + avg(e_i.))= 3.173519 Prob > F = 0.0000
------------------------------------------------------------------------------
langpost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
iq_verb | 3.899369 .3340076 11.67 0.000 3.238527 4.560211
_cons | -5.210525 3.962379 -1.31 0.191 -13.05019 2.62914
------------------------------------------------------------------------------
This gives a much larger coefficient of 3.90, albeit with a larger standard error of 0.334. Clearly working with aggregate data would overestimate the relationship between verbal IQ and language scores. Note that the random-effects estimate is between the within and between estimates (it always is).
The following figure (which I did in R because I couldn't figure out how to do it in Stata without a lot of work) shows the data, separate regression fits for each of the 131 schools, and the between, within, and random-effects estimates.

Continue with Example 2: Longitudinal Linear Model

