Child Mortality in Guatemala: A Shared Frailty Model
We will show how to fit a shared frailty model using data on child mortality in Guatemala. The data come from a retrospective survey conducted in 1974-76 by the Instituto de Nutrición de Centroamérica y Panamá (INCAP) and RAND, and include data on multiple children per family.
The data were first analyzed ignoring the clustering by Pebley and Stupp (1987), Reproductive Patterns and Child Mortality in Guatemala. Demography,24:43-60. (This link requires access to JSTOR.)
The data were then reanalized by Guo and Rodríguez (1992), Estimating a Multivariate Proportional Hazards Model for Clustered Data Using the EM Algorithm, with an Application to Child Survival in Guatemala, Journal of the American Statistical Association,87:969-976, using what we now call a shared frailty model.
The dataset is available as a Stata file in long format (one record per child):
. use http://data.princeton.edu/pop509/PebleyStupp
(Child mortality in Guatemala)
. desc
Contains data from http://data.princeton.edu/pop509/PebleyStupp.dta
obs: 3,120 Child mortality in Guatemala
vars: 19 8 Mar 2005 20:48
size: 137,280 (99.6% of memory free) (_dta has notes)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
momid float %9.0g
kid byte %9.0g
kids byte %8.0g
time float %9.0g
income float %9.0g
death byte %8.0g
sex byte %8.0g
mage float %9.0g
borde float %9.0g
home byte %8.0g
pdead byte %8.0g
edyears float %9.0g
p0014 byte %8.0g
p1523 byte %8.0g
p2435 byte %8.0g
p36up byte %8.0g
f0011 byte %8.0g
f1223 byte %8.0g
kidid float %9.0g
-------------------------------------------------------------------------------
Sorted by: momid kid
A Piece-wise exponential model
The first step is to reproduce exactly the results in Pebley and Stupp.
We start by stseting the data noting that time
is age at death or at interview, death is an indicator of
whether the child died, and kidid uniquely identifies each
kid. (This is the 10 times the momid in the original dataset
plus the kid number.)
. stset time, fail(death) id(kidid)
id: kidid
failure event: death != 0 & death < .
Obs. Time interval: (time[_n-1], time]
exit on or before: failure
------------------------------------------------------------------------------
3120 total obs.
0 exclusions
------------------------------------------------------------------------------
3120 obs. Remaining, representing
3120 subjects
403 failures in single failure-per-subject data
131512.5 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 60
Pebley and Stupp used a piece-wise exponential model where the hazard
is constant in intervals with boundaries at 1, 6, 12 and 24 months.
We can fit that kind of model in Stata using the standard Poisson trick
or using streg to fit an exponential model with dummy
variables for the duration categories. We will do the latter because
it allows us to introduce gamma frailty later. So we stsplit
the dataset and generate dummies for all categories. (Note that the
original paper shows the hazard for each age, so we generate dummies
for all categories and omit the constant.)
. stsplit dur, at(0 1 6 12 24 60) (10474 observations (episodes) created) . gen a0 = dur==0 . gen a1to5 = dur==1 . gen a6to11 = dur==6 . gen a12to23 = dur==12 . gen a24up = dur==24 . local age "a0 a1to5 a6to11 a12to23 a24up"
The model includes mother's age and age squared, a linear term on birth order, and an indicator of whether the previous kid in the family died. There are also indicators for the length of the previous birth interval with categories 0-14, 15-23, 24-35 and 36 or more, with first births serving as the reference cell. We need to compute only one of these variables. It is useful to create macros for all:
. gen mage2 = mage^2 . local fixed "mage mage2 borde pdead" . local previous "p0014 p1523 p2435 p36up"
The model also includes a time-varying covariate with time-varying
effects. The variable in question is the length of the following
birth interval, with indicators f0011 for intervals under a year
and f1223 for intervals of one to two years. These dummies
capture very short and short intervals where a child would face
competition from a subsequent sibling. Pebley and Stupp treat these as
time-varying covariates because a sibling is assumed to affect the life
of the index child only after it is born. So we consider very short
intervals (< 12) only at ages 12 months and higher, and short intervals
(12-23 months) only at ages 24 months and higher. This is the time-varying
part. But the effect of very short intervals is also allowed to vary,
with intervals < 12 having different effects at ages 12-23 and 24 or more.
This means that we need to create three variables that interact the
length of the following interval with the age of the child:
. gen i011a1223 = f0011 * (dur == 12) . gen i011a24p = f0011 * (dur == 24) . gen i1223a24p = f1223 * (dur == 24) . local follow "i011a1223 i011a24p i1223a24p"We are now ready to fit the model. You should verify that this reproduces exactly the results in Pebley and Stupp, which is quite remarkable.
. streg `age' `fixed' `previous' `follow', dist(exponential) noconstant
failure _d: death
analysis time _t: time
id: kidid
Iteration 0: log likelihood = -2819.899
Iteration 1: log likelihood = -2115.6375
Iteration 2: log likelihood = -1864.5861
Iteration 3: log likelihood = -1849.9913
Iteration 4: log likelihood = -1848.7872
Iteration 5: log likelihood = -1848.7838
Iteration 6: log likelihood = -1848.7838
Exponential regression -- log relative-hazard form
No. of subjects = 3120 Number of obs = 13594
No. of failures = 403
Time at risk = 131512.5
Wald chi2(16) = 9905.61
Log likelihood = -1848.7838 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. Z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
a0 | .3375566 .2532262 -1.45 0.148 .0775884 1.468575
a1to5 | .0245332 .0186095 -4.89 0.000 .0055473 .1084995
a6to11 | .0303346 .0229222 -4.63 0.000 .0068981 .1333968
a12to23 | .0180935 .0136752 -5.31 0.000 .0041132 .0795915
a24up | .0032229 .0025034 -7.39 0.000 .0007032 .0147712
mage | .8612339 .0501274 -2.57 0.010 .7683828 .9653051
mage2 | 1.002581 .001035 2.50 0.013 1.000555 1.004612
borde | 1.063634 .0355569 1.85 0.065 .9961777 1.135658
pdead | 1.103248 .1649263 0.66 0.511 .8230493 1.478839
p0014 | 1.71365 .3640741 2.54 0.011 1.130004 2.59875
p1523 | .8845005 .1648386 -0.66 0.510 .6138541 1.274474
p2435 | .7718101 .1424211 -1.40 0.160 .5375754 1.108107
p36up | .6764905 .141174 -1.87 0.061 .4493947 1.018346
i011a1223 | 2.246906 1.609712 1.13 0.258 .5517892 9.149482
i011a24p | 4.933551 3.633516 2.17 0.030 1.164816 20.89594
i1223a24p | 1.07556 .4053648 0.19 0.847 .5138396 2.251343
------------------------------------------------------------------------------
note: no constant term was estimated in the main equation
. estimates store pwe
A Shared-Frailty Model
We now introduce a shared frailty term at the mother's level to allow
for intra-family correlation in child survival. Following standard practice
we assume that the frailty term has a gamma distribution independent of
observed covariates. Guo and Rodriguez derive an EM algorithm to fit this
model (and Guo's dissertation includes the Fortran program used to fit the
model). I now show how to reproduce their results using streg.
All we need to do is specify the frailty distribution and the variable that
identifies the clusters:
. streg `age' `fixed' `previous' `follow', dist(exponential) noconstant ///
> frailty(gamma) shared(momid)
failure _d: death
analysis time _t: time
id: kidid
Fitting exponential model:
Fitting full model:
Iteration 0: log likelihood = -1864.1367
Iteration 1: log likelihood = -1848.4841
Iteration 2: log likelihood = -1847.2191
Iteration 3: log likelihood = -1847.139
Iteration 4: log likelihood = -1847.1382
Iteration 5: log likelihood = -1847.1382
Exponential regression --
log-relative hazard form Number of obs = 13594
Gamma shared frailty Number of groups = 851
Group variable: momid
No. of subjects = 3120 Obs per group: min = 1
No. of failures = 403 avg = 15.97415
Time at risk = 131512.5 max = 40
Wald chi2(16) = 8796.54
Log likelihood = -1847.1382 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. Z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
a0 | .3710262 .2874212 -1.28 0.201 .0812846 1.693562
a1to5 | .0271691 .0212782 -4.60 0.000 .0058538 .1261004
a6to11 | .0337345 .0263341 -4.34 0.000 .0073047 .1557928
a12to23 | .0202894 .0158516 -4.99 0.000 .0043878 .0938193
a24up | .0036443 .0029231 -7.00 0.000 .0007566 .0175539
mage | .8559554 .0514063 -2.59 0.010 .7609048 .9628796
mage2 | 1.002689 .0010689 2.52 0.012 1.000596 1.004786
borde | 1.059148 .0376965 1.61 0.106 .9877825 1.13567
pdead | .9272842 .1655893 -0.42 0.672 .6534463 1.315878
p0014 | 1.7738 .3859226 2.63 0.008 1.158004 2.717059
p1523 | .9075818 .1718846 -0.51 0.609 .6261508 1.315505
p2435 | .7960679 .149776 -1.21 0.225 .5505554 1.151063
p36up | .6898139 .1463948 -1.75 0.080 .4550791 1.045628
i011a1223 | 2.210407 1.593665 1.10 0.271 .5379858 9.081837
i011a24p | 4.959532 3.677954 2.16 0.031 1.1593 21.21708
i1223a24p | 1.076996 .4067595 0.20 0.844 .5137269 2.257852
-------------+----------------------------------------------------------------
/ln_the | -1.540714 .632364 -2.44 0.015 -2.780125 -.3013038
-------------+----------------------------------------------------------------
theta | .214228 .1354701 .0620307 .739853
------------------------------------------------------------------------------
Likelihood-ratio test of theta=0: chibar2(01) = 3.29 Prob>=chibar2 = 0.035
note: no constant term was estimated in the main equation
. estimates store gamma
You should verify that the results agree quite closely with the paper. (The labels for the complex interaction terms involving the following birth interval are out of order in the published paper, but OK in Guo's dissertation.)
The estimated variance of frailty of 0.21 (or 0.22 in the paper) implies modest association between the lifetimes of siblings. Following Oakes it corresponds to a rank correlation of 0.09. Even this small association, however, translates into substantial hazard ratios in the sense of Clayton. The conditional hazard for a child knowing that a sibling died at age a is 22% higher than if the sibling survived to age a, holding all covariates constant. Knowing that two siblings died raises the hazard by 44%. Note that to test the significance of the variance we have to be careful because the test is on a boundary of the parameter space. In the JASA paper we obtain a t-ratio of 1.6 for the variance, which agrees with Stata (0.214/0.135=1.6, although working in the log-scale may be better). Stata's likelihood ratio test gives a chi-squared of 3.3 with an estimated p-value of 0.035, essentially the same result.
. estimates table pwe gamma, t eform
----------------------------------------
Variable | pwe gamma
-------------+--------------------------
_t |
a0 | .33755656 .37102625
| -1.45 -1.28
a1to5 | .02453322 .02716914
| -4.89 -4.60
a6to11 | .03033465 .03373445
| -4.63 -4.34
a12to23 | .01809347 .02028935
| -5.31 -4.99
a24up | .00322289 .00364434
| -7.39 -7.00
mage | .86123393 .85595543
| -2.57 -2.59
mage2 | 1.0025814 1.0026886
| 2.50 2.52
borde | 1.0636339 1.0591481
| 1.85 1.61
pdead | 1.1032484 .92728419
| 0.66 -0.42
p0014 | 1.7136501 1.7737999
| 2.54 2.63
p1523 | .88450054 .90758181
| -0.66 -0.51
p2435 | .77181013 .79606791
| -1.40 -1.21
p36up | .67649048 .68981395
| -1.87 -1.75
i011a1223 | 2.2469057 2.2104071
| 1.13 1.10
i011a24p | 4.9335514 4.9595321
| 2.17 2.16
i1223a24p | 1.0755599 1.0769956
| 0.19 0.20
-------------+--------------------------
ln_the |
_cons | .214228
| -2.44
----------------------------------------
legend: b/t
The other aspect of interest in the results is that the estimates of the covariate effects are remarkably stable. The one change worth mentioning is the coefficient for 'previous child died', which changes sign. This variable was clearly acting as a proxy for unobserved family effects. Note also that the t-ratios for observed covariate effects have generally declined in magnitude, which is consistent with the notion that correlated observations have less information than independent observations.
Guo and Rodríguez also fit a non-parametric frailty model using a two-point distribution. This distribution is not built into Stata, but the model can be fit in Stata with only modest effort by following the method described in Section 3.2 of the paper. Details are left as an exercise for the reader.
