Germán Rodríguez

Survival Analysis
Princeton University
We will show how to fit a shared frailty model using data on
child mortality in Guatemala. The data come from a retrospective
survey conducted in 1974-76 by the *Instituto de Nutrición de
Centroamérica y Panamá* (INCAP) and RAND,
and include data on multiple children per family.

The data were first analyzed ignoring the clustering by
Pebley and Stupp (1987), Reproductive Patterns and Child Mortality in Guatemala.
*Demography*,24:43-60. (This link requires access to JSTOR.)

The data were then reanalized by
Guo and Rodríguez (1992), Estimating a Multivariate Proportional Hazards Model
for Clustered Data Using the EM Algorithm, with an Application to Child Survival
in Guatemala, *Journal of the American Statistical Association*,87:969-976,
using what we now call a shared frailty model.

The dataset is available as a Stata file in long format (one record per child):

. use https://data.princeton.edu/pop509/PebleyStupp (Child mortality in Guatemala) . desc Contains data from https://data.princeton.edu/pop509/PebleyStupp.dta obs: 3,120 Child mortality in Guatemala vars: 19 8 Mar 2005 20:48 size: 137,280 (99.6% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- momid float %9.0g kid byte %9.0g kids byte %8.0g time float %9.0g income float %9.0g death byte %8.0g sex byte %8.0g mage float %9.0g borde float %9.0g home byte %8.0g pdead byte %8.0g edyears float %9.0g p0014 byte %8.0g p1523 byte %8.0g p2435 byte %8.0g p36up byte %8.0g f0011 byte %8.0g f1223 byte %8.0g kidid float %9.0g ------------------------------------------------------------------------------- Sorted by: momid kid

The first step is to reproduce exactly the results in Pebley and Stupp.
We start by `stset`

ing the data noting that `time`

is age at death or at interview, `death`

is an indicator of
whether the child died, and `kidid`

uniquely identifies each
kid. (This is the 10 times the `momid`

in the original dataset
plus the kid number.)

. stset time, fail(death) id(kidid) id: kidid failure event: death != 0 & death < . Obs. Time interval: (time[_n-1], time] exit on or before: failure ------------------------------------------------------------------------------ 3120 total obs. 0 exclusions ------------------------------------------------------------------------------ 3120 obs. Remaining, representing 3120 subjects 403 failures in single failure-per-subject data 131512.5 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 60

Pebley and Stupp used a piece-wise exponential model where the hazard
is constant in intervals with boundaries at 1, 6, 12 and 24 months.
We can fit that kind of model in Stata using the standard Poisson trick
or using `streg`

to fit an exponential model with dummy
variables for the duration categories. We will do the latter because
it allows us to introduce gamma frailty later. So we `stsplit`

the dataset and generate dummies for all categories. (Note that the
original paper shows the hazard for each age, so we generate dummies
for all categories and omit the constant.)

. stsplit dur, at(0 1 6 12 24 60) (10474 observations (episodes) created) . gen a0 = dur==0 . gen a1to5 = dur==1 . gen a6to11 = dur==6 . gen a12to23 = dur==12 . gen a24up = dur==24 . local age "a0 a1to5 a6to11 a12to23 a24up"

The model includes mother's age and age squared, a linear term on birth order, and an indicator of whether the previous kid in the family died. There are also indicators for the length of the previous birth interval with categories 0-14, 15-23, 24-35 and 36 or more, with first births serving as the reference cell. We need to compute only one of these variables. It is useful to create macros for all:

. gen mage2 = mage^2 . local fixed "mage mage2 borde pdead" . local previous "p0014 p1523 p2435 p36up"

The model also includes a time-varying covariate with time-varying
effects. The variable in question is the length of the *following*
birth interval, with indicators `f0011`

for intervals under a year
and `f1223`

for intervals of one to two years. These dummies
capture very short and short intervals where a child would face
competition from a subsequent sibling. Pebley and Stupp treat these as
time-varying covariates because a sibling is assumed to affect the life
of the index child only after it is born. So we consider very short
intervals (< 12) only at ages 12 months and higher, and short intervals
(12-23 months) only at ages 24 months and higher. This is the time-varying
part. But the effect of very short intervals is also allowed to vary,
with intervals < 12 having different effects at ages 12-23 and 24 or more.
This means that we need to create three variables that interact the
length of the following interval with the age of the child:

. gen i011a1223 = f0011 * (dur == 12) . gen i011a24p = f0011 * (dur == 24) . gen i1223a24p = f1223 * (dur == 24) . local follow "i011a1223 i011a24p i1223a24p"We are now ready to fit the model. You should verify that this reproduces exactly the results in Pebley and Stupp, which is quite remarkable.

. streg `age' `fixed' `previous' `follow', dist(exponential) noconstant failure _d: death analysis time _t: time id: kidid Iteration 0: log likelihood = -2819.899 Iteration 1: log likelihood = -2115.6375 Iteration 2: log likelihood = -1864.5861 Iteration 3: log likelihood = -1849.9913 Iteration 4: log likelihood = -1848.7872 Iteration 5: log likelihood = -1848.7838 Iteration 6: log likelihood = -1848.7838 Exponential regression -- log relative-hazard form No. of subjects = 3120 Number of obs = 13594 No. of failures = 403 Time at risk = 131512.5 Wald chi2(16) = 9905.61 Log likelihood = -1848.7838 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. Z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- a0 | .3375566 .2532262 -1.45 0.148 .0775884 1.468575 a1to5 | .0245332 .0186095 -4.89 0.000 .0055473 .1084995 a6to11 | .0303346 .0229222 -4.63 0.000 .0068981 .1333968 a12to23 | .0180935 .0136752 -5.31 0.000 .0041132 .0795915 a24up | .0032229 .0025034 -7.39 0.000 .0007032 .0147712 mage | .8612339 .0501274 -2.57 0.010 .7683828 .9653051 mage2 | 1.002581 .001035 2.50 0.013 1.000555 1.004612 borde | 1.063634 .0355569 1.85 0.065 .9961777 1.135658 pdead | 1.103248 .1649263 0.66 0.511 .8230493 1.478839 p0014 | 1.71365 .3640741 2.54 0.011 1.130004 2.59875 p1523 | .8845005 .1648386 -0.66 0.510 .6138541 1.274474 p2435 | .7718101 .1424211 -1.40 0.160 .5375754 1.108107 p36up | .6764905 .141174 -1.87 0.061 .4493947 1.018346 i011a1223 | 2.246906 1.609712 1.13 0.258 .5517892 9.149482 i011a24p | 4.933551 3.633516 2.17 0.030 1.164816 20.89594 i1223a24p | 1.07556 .4053648 0.19 0.847 .5138396 2.251343 ------------------------------------------------------------------------------ note: no constant term was estimated in the main equation . estimates store pwe

We now introduce a shared frailty term at the mother's level to allow
for intra-family correlation in child survival. Following standard practice
we assume that the frailty term has a gamma distribution independent of
observed covariates. Guo and Rodriguez derive an EM algorithm to fit this
model (and Guo's dissertation includes the Fortran program used to fit the
model). I now show how to reproduce their results using `streg`

.
All we need to do is specify the frailty distribution and the variable that
identifies the clusters:

. streg `age' `fixed' `previous' `follow', dist(exponential) noconstant /// > frailty(gamma) shared(momid) failure _d: death analysis time _t: time id: kidid Fitting exponential model: Fitting full model: Iteration 0: log likelihood = -1864.1367 Iteration 1: log likelihood = -1848.4841 Iteration 2: log likelihood = -1847.2191 Iteration 3: log likelihood = -1847.139 Iteration 4: log likelihood = -1847.1382 Iteration 5: log likelihood = -1847.1382 Exponential regression -- log-relative hazard form Number of obs = 13594 Gamma shared frailty Number of groups = 851 Group variable: momid No. of subjects = 3120 Obs per group: min = 1 No. of failures = 403 avg = 15.97415 Time at risk = 131512.5 max = 40 Wald chi2(16) = 8796.54 Log likelihood = -1847.1382 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. Z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- a0 | .3710262 .2874212 -1.28 0.201 .0812846 1.693562 a1to5 | .0271691 .0212782 -4.60 0.000 .0058538 .1261004 a6to11 | .0337345 .0263341 -4.34 0.000 .0073047 .1557928 a12to23 | .0202894 .0158516 -4.99 0.000 .0043878 .0938193 a24up | .0036443 .0029231 -7.00 0.000 .0007566 .0175539 mage | .8559554 .0514063 -2.59 0.010 .7609048 .9628796 mage2 | 1.002689 .0010689 2.52 0.012 1.000596 1.004786 borde | 1.059148 .0376965 1.61 0.106 .9877825 1.13567 pdead | .9272842 .1655893 -0.42 0.672 .6534463 1.315878 p0014 | 1.7738 .3859226 2.63 0.008 1.158004 2.717059 p1523 | .9075818 .1718846 -0.51 0.609 .6261508 1.315505 p2435 | .7960679 .149776 -1.21 0.225 .5505554 1.151063 p36up | .6898139 .1463948 -1.75 0.080 .4550791 1.045628 i011a1223 | 2.210407 1.593665 1.10 0.271 .5379858 9.081837 i011a24p | 4.959532 3.677954 2.16 0.031 1.1593 21.21708 i1223a24p | 1.076996 .4067595 0.20 0.844 .5137269 2.257852 -------------+---------------------------------------------------------------- /ln_the | -1.540714 .632364 -2.44 0.015 -2.780125 -.3013038 -------------+---------------------------------------------------------------- theta | .214228 .1354701 .0620307 .739853 ------------------------------------------------------------------------------ Likelihood-ratio test of theta=0: chibar2(01) = 3.29 Prob>=chibar2 = 0.035 note: no constant term was estimated in the main equation . estimates store gamma

You should verify that the results agree quite closely with the paper. (The labels for the complex interaction terms involving the following birth interval are out of order in the published paper, but OK in Guo's dissertation.)

The estimated variance of frailty of 0.21 (or 0.22 in the paper) implies modest association between the lifetimes of siblings. Following Oakes it corresponds to a rank correlation of 0.09. Even this small association, however, translates into substantial hazard ratios in the sense of Clayton. The conditional hazard for a child knowing that a sibling died at age a is 22% higher than if the sibling survived to age a, holding all covariates constant. Knowing that two siblings died raises the hazard by 44%. Note that to test the significance of the variance we have to be careful because the test is on a boundary of the parameter space. In the JASA paper we obtain a t-ratio of 1.6 for the variance, which agrees with Stata (0.214/0.135=1.6, although working in the log-scale may be better). Stata's likelihood ratio test gives a chi-squared of 3.3 with an estimated p-value of 0.035, essentially the same result.

. estimates table pwe gamma, t eform ---------------------------------------- Variable | pwe gamma -------------+-------------------------- _t | a0 | .33755656 .37102625 | -1.45 -1.28 a1to5 | .02453322 .02716914 | -4.89 -4.60 a6to11 | .03033465 .03373445 | -4.63 -4.34 a12to23 | .01809347 .02028935 | -5.31 -4.99 a24up | .00322289 .00364434 | -7.39 -7.00 mage | .86123393 .85595543 | -2.57 -2.59 mage2 | 1.0025814 1.0026886 | 2.50 2.52 borde | 1.0636339 1.0591481 | 1.85 1.61 pdead | 1.1032484 .92728419 | 0.66 -0.42 p0014 | 1.7136501 1.7737999 | 2.54 2.63 p1523 | .88450054 .90758181 | -0.66 -0.51 p2435 | .77181013 .79606791 | -1.40 -1.21 p36up | .67649048 .68981395 | -1.87 -1.75 i011a1223 | 2.2469057 2.2104071 | 1.13 1.10 i011a24p | 4.9335514 4.9595321 | 2.17 2.16 i1223a24p | 1.0755599 1.0769956 | 0.19 0.20 -------------+-------------------------- ln_the | _cons | .214228 | -2.44 ---------------------------------------- legend: b/t

The other aspect of interest in the results is that the estimates of the covariate effects are remarkably stable. The one change worth mentioning is the coefficient for 'previous child died', which changes sign. This variable was clearly acting as a proxy for unobserved family effects. Note also that the t-ratios for observed covariate effects have generally declined in magnitude, which is consistent with the notion that correlated observations have less information than independent observations.

Guo and Rodríguez also fit a non-parametric frailty model using a two-point distribution. This distribution is not built into Stata, but the model can be fit in Stata with only modest effort by following the method described in Section 3.2 of the paper. Details are left as an exercise for the reader.

We continue our analysis of this dataset with a discussion of model interpretation, including calculation of predicted risks and probabilities.