Germán Rodríguez
Statistics and Population Princeton University

Likelihood via Quadrature

These notes document the January 2001 version of lr3, a library of S (and R) functions for fitting two and three-level logistic regression models using Gauss-Hermite quadrature. I am keeping this page here for the record, but these functions are no longer needed in light of the excellent adaptive quadrature routines implemented in Stata.

1. Overview

Please refer to Section 4 for installation instructions. Section 2 describes fitting two-level models and serves to introduce the software. Section 3 deals with issues that are specific to three-level models and therefore builds on the previous section.

2. Two-Level Models

The first two sections describe the main fitting procedure and present an example. The remaining (starred) sections deal with more technical issues and may be omitted at first reading.

2.1 Model Fitting with lr2

The top-level function here is lr2. It's structure is very similar to lm and glm:

lr2(formula, level2.id, data = sys.parent(), 
	frame = F, quad.pts = 12, beta = NULL, sigma = NULL, ...)

You specify the model using a model formula, a variable that represents the level-2 id, and optionally a data frame to be used in evaluating the outcome, predictors and id. The data frame defaults to the session frame.

The frame parameter is used to indicate that the function should set up all data structures to be used in estimation but stop short of actually fitting the model. (This option is useful in complex problems when one may want to call the function optimizer directly, as described in Section 2.4 below.)

Note that the data must be sorted by id. The function will check that the id's are indeed sorted before constructing a frame for estimation.

If a model is to be fitted then one may specify the number of quadrature points to be used, which defaults to 12. One may use fewer points to obtain initial estimates. One may also fit the model with different numbers of points to check if the quadrature procedure is working, as done by Stata.

Optionally one may provide starting values for the fixed effects. If this parameter is omitted, the function will use glm to fit an ordinary logit model and use the resulting estimates as starting values.

One may also provide starting values for the standard deviation of the random effects. If this parameter is omitted the function uses 0.6. Note that 0 is not an acceptable starting value, this parameter must be non-negative. Also, the iterative procedure works internally with the log of sigma, to avoid problems with small values becoming negative in the course of iteration.

The final dieresis ... refers to parameters that are passed directly to the iterative procedure. In S the procedure used is ms. I recommend setting trace=T to trace the iterations; it provides reassurance that the computer is actually doing something.

When the iterative procedure completes we make a final call to the evaluation routine to obtain not just the log-likelihood and its first derivatives, but also the second derivatives, which are needed for computing standard errors. This evaluation tends to take much longer than those that bypass the hessian calculation.

The return value of the function is an object of class "lr2". This is basically the object returned by ms, except that we try to tidy up the names of the parameters and add a couple of additional elements to the list, including the number of quadrature points used. The class has its own print and summary methods, as illustrated below.

2.2 Example: Hospital Deliveries

The example we use is one of the samples in Lillard and Panis (2000), Chapter 3. The data are available as a data frame that can be read into S with dget, and has 5 variables: id, hospital, income, distance, and education (1= less than high school, 2= high school, 3 = college). Here is a listing of the first three lines:

> hosp[1:3,]
  id educ income distance hospital 
1  4    3     76      1.7        0
2 17    2    275      7.9        0
3 17    2    200      1.8        1

The following instruction fits a two-level logit model using the natural log of income and introducing dummies for dropouts and college graduates, as done in Lillard and Panis (2000). We assign the resulting object to h12. I suggest you use the trace=T option when you try this.

> h12 <- lr2(hospital~log(income)+distance+(educ==1)+(educ==3),
+ level2.id=id, data=hosp)
RELATIVE FUNCTION CONVERGENCE. 

Typing the name of the object prints it:

> h12
2-level logit model
observations:  1060 501 
logL =  -522.654827451429 
Coefficients:
 (Intercept) log(income)    distance educ == 1 educ == 3 log(sigma)    sigma 
   -3.368158   0.5620118 -0.07661271 -1.997531  1.033676  0.2174697 1.242928

In the spirit of S, if you want a more conventional table with estimates and standard errors, use the summary method:

> summary(h12)
n2-level logit model
observations:  1060 501 
logL =  -522.654827451429 
                   Coef Std. Error   t value 
(Intercept) -3.36815845 0.47892802 -7.032703
log(income)  0.56201176 0.07269351  7.731251
   distance -0.07661271 0.03234586 -2.368548
  educ == 1 -1.99753084 0.25573359 -7.810983
  educ == 3  1.03367611 0.38846467  2.660927
 log(sigma)  0.21746975 0.15736486  1.381946
      sigma  1.24292783 0.19559316  6.354659

The results agree with aML, and agree even better with the output of Stata's xtlogit procedure, shown below for the same data. Note that Stata uses log(sigma^2) as the parameter. Note also that aML uses a different method for estimating standard errors, which gives results close to, but distinct from ours and Stata's (not shown).

Random-effects logit                            Number of obs      =      1060
Group variable (i) : id                         Number of groups   =       501
 
Random effects u_i ~ Gaussian                   Obs per group: min =         1
                                                               avg =       2.1
                                                               max =        10
 
                                                Wald chi2(4)       =    110.08
Log likelihood  = -522.65483                    Prob > chi2        =    0.0000
 
------------------------------------------------------------------------------
hospital |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   login |   .5620118   .0726935      7.731   0.000       .4195351    .7044885
distance |  -.0766127   .0323459     -2.369   0.018      -.1400094    -.013216
 dropout |  -1.997532   .2557338     -7.811   0.000      -2.498761   -1.496303
 college |   1.033677   .3884649      2.661   0.008          .2723    1.795055
   _cons |  -3.368159   .4789281     -7.033   0.000      -4.306841   -2.429477
---------+--------------------------------------------------------------------
/lnsig2u |   .4349417   .3147296      1.382   0.167      -.1819169      1.0518
---------+--------------------------------------------------------------------
 sigma_u |   1.242929   .1955933                          .9130556    1.691981
     rho |   .6070531   .0750755                          .4546458    .7411205
------------------------------------------------------------------------------
Likelihood ratio test of rho=0:      chi2(1) =    29.61   Prob > chi2 = 0.0000

*2.3 Strategies for Convergence

Good starting values can help. You may want to try starting the procedure using 6 quadrature points or so, perhaps imposing a limit of 10 iterations. You can do this with the parameters control = ms.control(maxiter = 10), trace= T, which are passed directly to ms.

How do you get the previous estimates, so you can start from there? You extract them from the "lr2" (or "ms") object using the coef() function, or directly as the par component.

For 2-level models, if np is the length of the parameter vector then object$par[-np] gives the betas or fixed effects (everything but the last parameter), and exp(object$par[np]) gives the sigma.

*2.4 Data Structures

If you examine the code for lr2 you will notice that it does two main things: it sets up some data structures and then calls S's function minimizer, which is turn calls back a function called lr2.g to compute (minus) the log-likelihood and its derivative.

The structures are simple: we need a vector y with the outcome and a model matrix X that codes all factors using dummy variables and has an extra column for the random effect. You could obtain a base X matrix by calling lm or glm with the model formula and x=T, and then use cbind(X,0) to add a new column.

The function also tabulates the number of level-1 units in each level-2 unit, something you could do yourself with table(id2). Setting frame=T returns a list with y, X and the tabulation, called n12, which are what the criterion function actually needs.

The other thing this function does is calculate the quadrature abscissas and weights, by calling gauher(quad.pts). You could do this yourself, and save often-used objects such as gq12=gauher(12) so you don't recalculate them all the time. A quadrature object is a data frame with a row for each quadrature point and two columns called x and w.

Once you have y, X, n12 and a quadrature object you can call the actual criterion function yourself:

lr2.g( y, X, theta, n12, gq12)

Where theta is a vector of parameter estimates with the betas followed by log(sigma). This function calls C code in the lr3 library, and returns the (negative) log-likelihood with the gradient as an attribute. It's twin lr2.h also calculates and returns the hessian as an attribute.

Minimizing the function is a simple call to ms, with an indication of the parameters and their starting values:

ms(~lr2.g(y, X, par, n12, gq12), start=list(par=c(b,ls)))

Where c(b, ls) is a vector of starting values for the betas and log(sigma). The minimizer takes the names of the parameters from the name used in the call (in the example par) and the names, if any, of the starting values, which will be informative if they come from a glm fit.

Note that if you are close to the m.l.e.'s, you might try using second derivatives by calling lr2.h instead of lr2.g. In practice, however, you are unlikely to gain much: each iteration will take longer and you might not save as many iterations as you might expect.

When you call ms yourself the resulting object is of class "ms", not "lr2", so you get the print, summary and coef methods for ms objects. You can assign the "lr2" class to the object or call the lr2.print and lr2.summary methods, but this is not guaranteed to work.

3. Three-Level Models

Please read the discussion of two-level models if you haven't done so already, as we describe only what's new with three-level models.

3.1 Model Fitting with lr3

The top-level function for fitting three-level models is lr3, with arguments as follows:

lr3(formula, level2.id, level3.id, data = sys.parent(), 
	frame = F, quad.pts = 12, beta = NULL, sigma = NULL, ...)

The structure of the function and the arguments are eactly the same as for the two-level model, except that we now have a level2.id and a level.3, and obviously sigma must have two components, representing variation at levels 2 and 3 in that order.

3.2 Example: Immunization in Guatemala

The example we will use is the model of complete immunization in Guatemala used in Rodríguez and Goldman(2001). The data are provided as an S data frame that can be read with dget. I called the data frame guimr. Here are the first three rows:

> guimr[1:3,]
  kid mom cluster numvac kid2p mom25p orderc    eth  wed  hed everwork rural   pcind81 
1   2   2       1      1     T      F      1 Ladino Sec+ Sec+        F     F 0.1075042
2 269 185      36      0     T      F    2-3 Ladino Prim Prim        T     F 0.0437295
3 272 186      36      0     T      F      1 Ladino Prim Sec+        T     F 0.0437295

Note that in the spirit of S, categorical variables are coded as true/false when they are binary and using labels when they have more than two categories. Here is the model used in our paper:

> g12 <- lr3( numvac ~ kid2p + mom25p + orderc + eth + wed + hed + everwork + 
	rural + pcind81, mom, cluster, data=guimr)
RELATIVE FUNCTION CONVERGENCE. 

Note that we specify "mom" as our level2.id and "cluster" (or community) as our level3.id. Here are the results of printing the object:

-level logit model
observations:  2159 1595 161 
logL =  -1323.9618222914 
Coefficients:
 (Intercept)    kid2p     mom25p orderc2-3 orderc4-6 orderc7-16 ethIndNoSpa   ethIndSpa  wedPrim 
     -1.2334 1.716657 -0.2159945 -0.260719 0.1793735  0.4310829  -0.1724736 -0.08364888 0.433026
   wedSec+   hedPrim   hedSec+        hedDK  everwork      rural   pcind81 log(sigma2) log(sigma3) 
 0.4200052 0.5393486 0.5051097 -0.007241173 0.3900436 -0.8882568 -1.149777   0.8397855  0.02431688
  sigma2   sigma3 
 2.31587 1.024615

And here is the more detailed summary:

> summary(g12)
3-level logit model
observations:  2159 1595 161 
logL =  -1323.9618222914 
                    Coef Std. Error     t value 
(Intercept) -1.233400431  0.4831806 -2.55266939
      kid2p  1.716657175  0.2173257  7.89900598
     mom25p -0.215994520  0.2316288 -0.93250286
  orderc2-3 -0.260719016  0.2319183 -1.12418458
  orderc4-6  0.179373547  0.2944780  0.60912382
 orderc7-16  0.431082926  0.3720310  1.15872859
ethIndNoSpa -0.172473621  0.4893662 -0.35244284
  ethIndSpa -0.083648877  0.3633018 -0.23024623
    wedPrim  0.433026006  0.2224304  1.94679307
    wedSec+  0.420005173  0.4843674  0.86712108
    hedPrim  0.539348585  0.2322901  2.32187475
    hedSec+  0.505109742  0.4142667  1.21928633
      hedDK -0.007241173  0.3568837 -0.02029001
   everwork  0.390043560  0.2026816  1.92441556
      rural -0.888256819  0.3062250 -2.90066765
    pcind81 -1.149777002  0.5003787 -2.29781355
log(sigma2)  0.839785461  0.1129482  7.43513559
log(sigma3)  0.024316884  0.1562851  0.15559313
     sigma2  2.315870079  0.2615734  8.85361314
     sigma3  1.024614951  0.1601320  6.39856332

These results are actually quite close to the published results obtained with an earlier version of the program using 20 quadrature points.

We have also verified the estimates by comparison with aML runs with 5, 12 and 20 quadrature points. As noted earlier, aML uses a different method for estimating standard errors. We have verified ours by using numerical estimates of the hessian or matrix of second derivatives.

*3.3. Strategies for Convergence

Everything we said about two-level models applies here as well, except that there is some help for extracting the coefficients.

The coef() method for lr3 fits has two true/false parameters that are used to indicate whether one wants beta and sigma, respectively. Thus, the function has signature

coef(object, beta=T, sigma=T)

The defaults ensure that the usual call, for example coef(g12), returns all estimates, converting log(sigmas) to sigmas. You can also use

The next example shows how to estimate the model using 20 quadrature points starting from the results obtained using 12, which have been saved as g12

 
g20 <- lr3(guim.model, mom, kid, data=guimr, quad.pts=20,
	beta=coef(g12,T,F), sigma=coef(g12,F,T))

3.4 The Three-level Structures

The three-level function has to do all the work of its two-level counterpart, plus some. Here is what's new at this level.

In addition to checking that the data are sorted by level-2 id, we need to check that they are sorted by level-3 id. This is done by a helper function is.sorted(id). Also, we need to check that the level-2 id is nested on level-3. What we mean by this is that if the level-3 id changes, the level-2 id must change as well, a fact checked by the function is.nested(id2,id3).

If all goes well the id's are tabulated to count the number of level-1 units per level-2, which is a vector n12, and the number of level-1 units per level-3, which goes in n13. We also need the number of unit-2 units per level-3, which is calculated by the function make.n23 and stored in n23.

The X matrix is expanded to add two columns. Note that forgetting to add room for the random effects will usually crash S. This is one reason why going through lr3 pays.

The frame option returns a list with y, X and the vectors n12, n13 and n23, which together with a quadrature object are the needed input for the criterion function:

lr3.g(y, X, theta, n12, n13, n23, gq)

This function calculates the (negative) log-likelihoood and its gradient by calling C code. The twin function lr3.h calculates the hessian as well.

4. Installation

The S-source for all the functions discussed here is available on a file called lr3.s, which can be sourced to create the necessary objects. If the file is in the working directory use

source("lr3.s")

Otherwise you need a fully qualified path.

You also need the compiled C code, which is available as relocatable code lr3.o for our Unix system and as a dynamic link library lr3.dll for Windows systems.

This code has to be loaded before it can be used. On Unix use

dyn.load("lr3.o")

On Splus for Windows use

dll.load("lr3.dll",lr3.entry.points)

where lr3.entry.points is an object listing all entry points in the DLL. This object is created when you source "lr3.s".

The two sample datasets are available as "hospital.dat" and "guimr.dat".