Likelihood via Quadrature
These notes document the January 2001 version of lr3, a library of S (and R) functions for fitting two and threelevel logistic regression models using GaussHermite quadrature. I am keeping this page here for the record, but these functions are no longer needed in light of the excellent adaptive quadrature routines implemented in Stata.
1. Overview
Please refer to Section 4 for installation instructions. Section 2 describes fitting twolevel models and serves to introduce the software. Section 3 deals with issues that are specific to threelevel models and therefore builds on the previous section.
2. TwoLevel Models
The first two sections describe the main fitting procedure and present an example. The remaining (starred) sections deal with more technical issues and may be omitted at first reading.
2.1 Model Fitting with lr2
The toplevel function here is lr2
.
It's structure is very similar to lm
and glm
:
lr2(formula, level2.id, data = sys.parent(), frame = F, quad.pts = 12, beta = NULL, sigma = NULL, ...)
You specify the model using a model formula, a variable that represents the level2 id, and optionally a data frame to be used in evaluating the outcome, predictors and id. The data frame defaults to the session frame.
The frame
parameter is used to indicate that the
function should set up all data structures to be used in estimation
but stop short of actually fitting the model. (This option is
useful in complex problems when one may want to call the function
optimizer directly, as described in Section 2.4 below.)
Note that the data must be sorted by id. The function will check that the id's are indeed sorted before constructing a frame for estimation.
If a model is to be fitted then one may specify the number of quadrature points to be used, which defaults to 12. One may use fewer points to obtain initial estimates. One may also fit the model with different numbers of points to check if the quadrature procedure is working, as done by Stata.
Optionally one may provide starting values for the fixed effects. If this parameter is omitted, the function will use glm to fit an ordinary logit model and use the resulting estimates as starting values.
One may also provide starting values for the standard deviation of the random effects. If this parameter is omitted the function uses 0.6. Note that 0 is not an acceptable starting value, this parameter must be nonnegative. Also, the iterative procedure works internally with the log of sigma, to avoid problems with small values becoming negative in the course of iteration.
The final dieresis ...
refers to parameters that
are passed directly to the iterative procedure. In S the procedure
used is ms
. I recommend setting trace=T
to trace the iterations; it provides reassurance that the computer
is actually doing something.
When the iterative procedure completes we make a final call to the evaluation routine to obtain not just the loglikelihood and its first derivatives, but also the second derivatives, which are needed for computing standard errors. This evaluation tends to take much longer than those that bypass the hessian calculation.
The return value of the function is an object of class "lr2".
This is basically the object returned by ms
, except
that we try to tidy up the names of the parameters and add a
couple of additional elements to the list, including the
number of quadrature points used. The class has its own print
and summary methods, as illustrated below.
2.2 Example: Hospital Deliveries
The example we use is one of the samples in Lillard and Panis (2000),
Chapter 3. The data are available as a data frame that can be read
into S with dget
, and has 5 variables: id, hospital,
income, distance, and education (1= less than high school, 2= high school,
3 = college). Here is a listing of the first three lines:
> hosp[1:3,] id educ income distance hospital 1 4 3 76 1.7 0 2 17 2 275 7.9 0 3 17 2 200 1.8 1
The following instruction fits a twolevel logit model using the
natural log of income and introducing dummies for dropouts and
college graduates, as done in Lillard and Panis (2000). We
assign the resulting object to h12
.
I suggest you use the trace=T
option when you
try this.
> h12 < lr2(hospital~log(income)+distance+(educ==1)+(educ==3), + level2.id=id, data=hosp) RELATIVE FUNCTION CONVERGENCE.
Typing the name of the object prints it:
> h12 2level logit model observations: 1060 501 logL = 522.654827451429 Coefficients: (Intercept) log(income) distance educ == 1 educ == 3 log(sigma) sigma 3.368158 0.5620118 0.07661271 1.997531 1.033676 0.2174697 1.242928
In the spirit of S, if you want a more conventional table with
estimates and standard errors, use the summary
method:
> summary(h12) n2level logit model observations: 1060 501 logL = 522.654827451429 Coef Std. Error t value (Intercept) 3.36815845 0.47892802 7.032703 log(income) 0.56201176 0.07269351 7.731251 distance 0.07661271 0.03234586 2.368548 educ == 1 1.99753084 0.25573359 7.810983 educ == 3 1.03367611 0.38846467 2.660927 log(sigma) 0.21746975 0.15736486 1.381946 sigma 1.24292783 0.19559316 6.354659
The results agree with aML, and agree even better with the output of Stata's xtlogit procedure, shown below for the same data. Note that Stata uses log(sigma^2) as the parameter. Note also that aML uses a different method for estimating standard errors, which gives results close to, but distinct from ours and Stata's (not shown).
Randomeffects logit Number of obs = 1060 Group variable (i) : id Number of groups = 501 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 2.1 max = 10 Wald chi2(4) = 110.08 Log likelihood = 522.65483 Prob > chi2 = 0.0000  hospital  Coef. Std. Err. z P>z [95% Conf. Interval] + login  .5620118 .0726935 7.731 0.000 .4195351 .7044885 distance  .0766127 .0323459 2.369 0.018 .1400094 .013216 dropout  1.997532 .2557338 7.811 0.000 2.498761 1.496303 college  1.033677 .3884649 2.661 0.008 .2723 1.795055 _cons  3.368159 .4789281 7.033 0.000 4.306841 2.429477 + /lnsig2u  .4349417 .3147296 1.382 0.167 .1819169 1.0518 + sigma_u  1.242929 .1955933 .9130556 1.691981 rho  .6070531 .0750755 .4546458 .7411205  Likelihood ratio test of rho=0: chi2(1) = 29.61 Prob > chi2 = 0.0000
*2.3 Strategies for Convergence
Good starting values can help. You may want to try
starting the procedure using 6 quadrature points or so,
perhaps imposing a limit of 10 iterations.
You can do this with the parameters
control = ms.control(maxiter = 10), trace= T
,
which are passed directly to ms
.
How do you get the previous estimates, so you can
start from there? You extract them from the
"lr2" (or "ms") object using the coef()
function, or directly as the par
component.
For 2level models, if np
is the length
of the parameter vector then object$par[np]
gives the betas or fixed effects (everything but the
last parameter), and exp(object$par[np])
gives the sigma.
*2.4 Data Structures
If you examine the code for lr2
you will notice that it
does two main things: it sets up some data structures and then calls
S's function minimizer, which is turn calls back a function called
lr2.g
to compute (minus) the loglikelihood and its
derivative.
The structures are simple: we need a vector y with the outcome and
a model matrix X that codes all factors using dummy variables and
has an extra column for the random effect. You could obtain
a base X matrix by calling lm
or glm
with
the model formula and x=T
, and then use
cbind(X,0)
to add a new column.
The function also tabulates the number of level1 units in each
level2 unit, something you could do yourself with
table(id2)
. Setting frame=T
returns
a list with y, X and the tabulation, called n12
,
which are what the criterion function actually needs.
The other thing this function does is calculate the quadrature
abscissas and weights, by calling gauher(quad.pts)
.
You could do this yourself, and save oftenused objects
such as gq12=gauher(12)
so you don't recalculate
them all the time. A quadrature object is a data frame with
a row for each quadrature point and two columns called x and w.
Once you have y, X, n12 and a quadrature object you can call the actual criterion function yourself:
lr2.g( y, X, theta, n12, gq12)
Where theta
is a vector of parameter estimates with
the betas followed by log(sigma). This function calls C code in
the lr3 library, and returns the (negative) loglikelihood
with the gradient as an attribute. It's twin lr2.h
also calculates and returns the hessian as an attribute.
Minimizing the function is a simple call to ms
, with
an indication of the parameters and their starting values:
ms(~lr2.g(y, X, par, n12, gq12), start=list(par=c(b,ls)))
Where c(b, ls)
is a vector of starting values for
the betas and log(sigma). The minimizer takes the names of the
parameters from the name used in the call (in the example par)
and the names, if any, of the starting values, which will be
informative if they come from a glm fit.
Note that if you are close to the m.l.e.'s, you might try
using second derivatives by calling lr2.h
instead of lr2.g
. In practice, however, you
are unlikely to gain much: each iteration will take longer
and you might not save as many iterations as you might expect.
When you call ms
yourself the resulting object is
of class "ms", not "lr2",
so you get the print, summary and coef methods for ms objects.
You can assign the "lr2" class to the object or call
the lr2.print
and lr2.summary
methods,
but this is not guaranteed to work.
3. ThreeLevel Models
Please read the discussion of twolevel models if you haven't done so already, as we describe only what's new with threelevel models.
3.1 Model Fitting with lr3
The toplevel function for fitting threelevel models is lr3
,
with arguments as follows:
lr3(formula, level2.id, level3.id, data = sys.parent(), frame = F, quad.pts = 12, beta = NULL, sigma = NULL, ...)
The structure of the function and the arguments are eactly the same as
for the twolevel model, except that we now have a level2.id
and a level.3
, and obviously sigma
must have
two components, representing variation at levels 2 and 3 in that order.
3.2 Example: Immunization in Guatemala
The example we will use is the model of complete immunization
in Guatemala used in RodrÃguez and Goldman(2001). The
data are provided as an S data frame that can be read with
dget
. I called the data frame guimr
.
Here are the first three rows:
> guimr[1:3,] kid mom cluster numvac kid2p mom25p orderc eth wed hed everwork rural pcind81 1 2 2 1 1 T F 1 Ladino Sec+ Sec+ F F 0.1075042 2 269 185 36 0 T F 23 Ladino Prim Prim T F 0.0437295 3 272 186 36 0 T F 1 Ladino Prim Sec+ T F 0.0437295
Note that in the spirit of S, categorical variables are coded as true/false when they are binary and using labels when they have more than two categories. Here is the model used in our paper:
> g12 < lr3( numvac ~ kid2p + mom25p + orderc + eth + wed + hed + everwork + rural + pcind81, mom, cluster, data=guimr) RELATIVE FUNCTION CONVERGENCE.
Note that we specify "mom" as our level2.id
and "cluster"
(or community) as our level3.id
. Here are the results of
printing the object:
level logit model observations: 2159 1595 161 logL = 1323.9618222914 Coefficients: (Intercept) kid2p mom25p orderc23 orderc46 orderc716 ethIndNoSpa ethIndSpa wedPrim 1.2334 1.716657 0.2159945 0.260719 0.1793735 0.4310829 0.1724736 0.08364888 0.433026 wedSec+ hedPrim hedSec+ hedDK everwork rural pcind81 log(sigma2) log(sigma3) 0.4200052 0.5393486 0.5051097 0.007241173 0.3900436 0.8882568 1.149777 0.8397855 0.02431688 sigma2 sigma3 2.31587 1.024615
And here is the more detailed summary:
> summary(g12) 3level logit model observations: 2159 1595 161 logL = 1323.9618222914 Coef Std. Error t value (Intercept) 1.233400431 0.4831806 2.55266939 kid2p 1.716657175 0.2173257 7.89900598 mom25p 0.215994520 0.2316288 0.93250286 orderc23 0.260719016 0.2319183 1.12418458 orderc46 0.179373547 0.2944780 0.60912382 orderc716 0.431082926 0.3720310 1.15872859 ethIndNoSpa 0.172473621 0.4893662 0.35244284 ethIndSpa 0.083648877 0.3633018 0.23024623 wedPrim 0.433026006 0.2224304 1.94679307 wedSec+ 0.420005173 0.4843674 0.86712108 hedPrim 0.539348585 0.2322901 2.32187475 hedSec+ 0.505109742 0.4142667 1.21928633 hedDK 0.007241173 0.3568837 0.02029001 everwork 0.390043560 0.2026816 1.92441556 rural 0.888256819 0.3062250 2.90066765 pcind81 1.149777002 0.5003787 2.29781355 log(sigma2) 0.839785461 0.1129482 7.43513559 log(sigma3) 0.024316884 0.1562851 0.15559313 sigma2 2.315870079 0.2615734 8.85361314 sigma3 1.024614951 0.1601320 6.39856332
These results are actually quite close to the published results obtained with an earlier version of the program using 20 quadrature points.
We have also verified the estimates by comparison with aML runs with 5, 12 and 20 quadrature points. As noted earlier, aML uses a different method for estimating standard errors. We have verified ours by using numerical estimates of the hessian or matrix of second derivatives.
*3.3. Strategies for Convergence
Everything we said about twolevel models applies here as well, except that there is some help for extracting the coefficients.
The coef()
method for lr3 fits has
two true/false parameters that are used to indicate whether one
wants beta and sigma, respectively. Thus, the function has
signature
coef(object, beta=T, sigma=T)
The defaults ensure that the usual call, for example
coef(g12)
, returns all estimates, converting
log(sigmas) to sigmas. You can also use

coef(g12,T,F)
, to get the fixed effects, and 
coef(g12,F,T)
, to get the standard deviations of the random effects.
The next example shows how to estimate the model using 20
quadrature points starting from the results obtained using 12,
which have been saved as g12
g20 < lr3(guim.model, mom, kid, data=guimr, quad.pts=20, beta=coef(g12,T,F), sigma=coef(g12,F,T))
3.4 The Threelevel Structures
The threelevel function has to do all the work of its twolevel counterpart, plus some. Here is what's new at this level.
In addition to checking that the data are sorted by
level2 id, we need to check that they are sorted by level3
id. This is done by a helper function is.sorted(id)
.
Also, we need to check that the level2 id is nested on
level3. What we mean by this is that if the level3 id
changes, the level2 id must change as well, a fact checked
by the function is.nested(id2,id3)
.
If all goes well the id's are tabulated to count the number
of level1 units per level2, which is a vector n12
,
and the number of level1 units per level3, which goes in
n13
. We also need the number of unit2 units
per level3, which is calculated by the function make.n23
and stored in n23
.
The X matrix is expanded to add two columns. Note that
forgetting to add room for the random effects will usually
crash S. This is one reason why going through lr3
pays.
The frame
option returns a list with y, X and the vectors
n12, n13 and n23, which together with a quadrature object are
the needed input for the criterion function:
lr3.g(y, X, theta, n12, n13, n23, gq)
This function calculates the (negative) loglikelihoood and its
gradient by calling C code.
The twin function lr3.h
calculates the hessian as well.
4. Installation
The Ssource for all the functions discussed here is available on
a file called lr3.s
, which can be sourced to create
the necessary objects. If the file is in the working directory use
source("lr3.s")
Otherwise you need a fully qualified path.
You also need the compiled C code, which is available as
relocatable code lr3.o
for our Unix system
and as a dynamic link library lr3.dll
for
Windows systems.
This code has to be loaded before it can be used. On Unix use
dyn.load("lr3.o")
On Splus for Windows use
dll.load("lr3.dll",lr3.entry.points)
where lr3.entry.points
is an object listing
all entry points in the DLL. This object is created when
you source "lr3.s".
The two sample datasets are available as "hospital.dat" and "guimr.dat".