Germán Rodríguez

Generalized Linear Models
Princeton University
This is an illustration of piecewise exponential survival using Stata,
relying on the commands `stset`

and `stsplit`

to create pseudo-observations
and `poisson`

to fit the model using the Poisson equivalence.
Stata can also fit this model using `streg`

with
`distribution(exponential)`

on the split data.

The dataset we will consider is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum length of observation is 81 months. The data are available from the Stata website in Stata format.

. use http://www.stata.com/data/jwooldridge/eacsap/recid.dta, clear . desc, s Contains data from http://www.stata.com/data/jwooldridge/eacsap/recid.dta obs: 1,445 vars: 18 28 Sep 1998 13:28 size: 33,235 Sorted by:

The file has a censoring indicator, which we subtract from 1 to get a failure indicator. We also create an id variable and list observation number 9, which goes back to prison after 54 months.

. gen fail = 1 - cens . gen id = _n . list id durat fail in 9

id durat fail | |

9. | 9 54 1 |

To create pseudo-observations for survival analysis using the
piecewise exponential model we `stset`

the data making sure
we specify an id variable, and then use `stsplit`

to
split the data into single-year
intervals of duration from 0-12 to 48-60 with an open-ended
category 60+.
The first command generates the built-in variables
`_t0`

for entering time, `_t`

for exit time
and `_d`

for failure. These are adjusted after the split
to reflect what happens in each interval.
We compute exposure as the difference between the exit and entering
times. I also create a variable for the number of events, but this is
not necessary as `_d`

would serve the same purpose.
We list these variables for individual 9 before and after the split
to illustrate how the episodes are created.

. stset durat, fail(fail) id(id) id: id failure event: fail != 0 & fail < . obs. time interval: (durat[_n-1], durat] exit on or before: failure

1445 total observations |

0 exclusions |

1445 observations remaining, representing |

1445 subjects |

552 failures in single-failure-per-subject data |

80013 total analysis time at risk and under observation |

at risk from t = 0 |

earliest observed entry t = 0 |

last observed exit t = 81 |

id _t0 _t _d | |

9. | 9 0 54 1 |

id _t0 _t _d year exposure events | |

41. | 9 0 12 0 0 12 0 |

42. | 9 12 24 0 12 12 0 |

43. | 9 24 36 0 24 12 0 |

44. | 9 36 48 0 36 12 0 |

45. | 9 48 54 1 48 6 1 |

The sample observation going back to prison after 54 months contributes
five episodes or pseudo-observations,
one each for years one to four with 12 months of exposure and no events,
and another for year five with 6 months of exposure and one event.
Note also that the variable generated by Stata to identify episodes,
here `year`

, reflects the time at which the interval *starts*,
the same as `_t0`

.

We are now ready to fit a proportional hazards model with
a piecewise exponential baseline where the hazard changes from
year to year. We use the same model as Wooldridge(2002),
involving ten predictors, all fixed covariates.
I specify the offset using the `exposure()`

option.
I could, of course, take logs and then use the `offset()`

option.

. local predictors workprg priors tserved felon alcohol drugs black married educ age . poisson events i.year `predictors', exposure(exposure) Iteration 0: log likelihood = -2244.0023 Iteration 1: log likelihood = -2241.5492 Iteration 2: log likelihood = -2241.5353 Iteration 3: log likelihood = -2241.5353 Poisson regression Number of obs = 6718 LR chi2(15) = 312.36 Prob > chi2 = 0.0000 Log likelihood = -2241.5353 Pseudo R2 = 0.0651

events | Coef. Std. Err. z P>|z| [95% Conf. Interval] |

year | |

12 | .036532 .1093659 0.33 0.738 -.1778212 .2508851 |

24 | -.3738156 .1296172 -2.88 0.004 -.6278607 -.1197706 |

36 | -.8115436 .1564067 -5.19 0.000 -1.118095 -.5049921 |

48 | -.9382311 .1683272 -5.57 0.000 -1.268146 -.6083159 |

60 | -1.547178 .2033594 -7.61 0.000 -1.945755 -1.148601 |

workprg | .0838291 .0907983 0.92 0.356 -.0941323 .2617906 |

priors | .0872458 .0134763 6.47 0.000 .0608327 .113659 |

tserved | .0130089 .0016863 7.71 0.000 .0097039 .0163139 |

felon | -.2839252 .1061534 -2.67 0.007 -.491982 -.0758684 |

alcohol | .4324425 .1057254 4.09 0.000 .2252245 .6396605 |

drugs | .2747141 .0978667 2.81 0.005 .0828989 .4665293 |

black | .433556 .0883658 4.91 0.000 .2603622 .6067497 |

married | -.1540477 .1092154 -1.41 0.158 -.3681059 .0600104 |

educ | -.0214162 .0194453 -1.10 0.271 -.0595283 .016696 |

age | -.00358 .0005223 -6.85 0.000 -.0046037 -.0025563 |

_cons | -3.830127 .280282 -13.67 0.000 -4.37947 -3.280785 |

ln(exposure) | 1 (exposure) |

The results are exactly the same as in the sister page using R. We see that the risk of recidivism is about the same in the first two years, but then decreases substantially with duration since release. At any given duration felons have 25% lower risk of recidivism than non-felons with the same observed characteristics. Subjects imprisoned for alcohol or drug related offenses have much higher risk of recidivism, everything else being equal.

We now illustrate the calculation of survival probabilities,
starting with the baseline hazard, which requires setting all
predictors to zero.
There are different ways to do these calculations in Stata,
but I will proceed from first principles using Mata.
An alternative is to use or create a dummy variable with the
required predictor values and then use `predict`

.
If you go this way make sure to specify the options `xb`

and `nooffset`

to predict the log-hazard.

Stata stores the constant as the last coefficient, here with index 17 . We add that to the year coefficients to obtain the log-hazard for each year, exponentiate to obtain hazards, and multiply by 12 and sum to obtain the cumulative or integrated hazard. Note that we only need the first five years.

. mata // mata (type end to exit) : b = st_matrix("e(b)") : logh = b[17] :+ b[1..5] : h = exp(logh) : H = runningsum(12 * h) : S = exp(-H) : S 1 2 3 4 5 1 .7706798852 .5882188201 .4916958096 .4379748047 .3955311901 : end

These calculations apply to the reference cell and are not very meaningful because they set age to zero (and age, by the way, is measured in months).

We will now estimate the probability of staying out of prison
for five years given average values of the predictors.
In calculating the mean of each predictor we have to be careful
to include only one observation per person, so we restrict the
calculation to the first interval. The easiest way to compute
means in Stata is to `collapse`

. keep if year == 0 (5273 observations deleted) . save recid1, replace file recid1.dta saved . collapse `predictors'

Now that we have the means we multiply each by the corresponding
coefficient to obtain the linear predictor `xb`

. scalar xb = 0 . foreach var of local predictors { 2. scalar xb = xb + `var' * _b[`var'] 3. } . di "xb=" xb -.79219682 . mata exp( -H[5] * exp(st_numscalar("xb")) ) .6570278064

Thus, the probability of staying out of prison for the average person is 65.7%. We can easily calculate this probability for felons and non-felons keeping all other variables at their means. All we need to do is subtract the coefficient of felon times the mean to get the linear predictor for a non-felon, and then add the coefficient to get the linear predictor for felons:

. scalar xb0 = xb - felon*_b[felon] . scalar xb1 = xb0 + _b[felon] . mata exp( -H[5] * exp(st_numscalar("xb0")) ) .6317763048 . mata exp( -H[5] * exp(st_numscalar("xb1")) ) .7077167906

The predicted probability is 70.8% for felons and 63.2% for non-felons when all other characteristics are set to the mean, a difference of 7.6 percentage points.

An alternative calculation sets every person to be a felon or non-felon leaving
all other characteristics as they are, and then averages the predicted
probability of surviving five years without returning to prison.
To do this we need the file with the first episode for each person, which
conveniently I saved. I'll also store the cumulative hazard at duration 60
in scalar `H5`

. use recid1, clear . scalar drop _all // to reuse names xb0 and xb1 . mata st_numscalar("H5",H[5]) . gen xb0 = 0 . local predictors workprg priors tserved felon alcohol drugs black married educ age . foreach var of local predictors { 2. if "`var'" == "felon" continue 3. quietly replace xb0 = xb0 + `var' * _b[`var'] 4. } . gen xb1 = xb0 + _b[felon] . gen S0 = exp(-H5 * exp(xb0) ) . gen S1 = exp(-H5 * exp(xb1) ) . sum S0 S1

Variable | Obs Mean Std. Dev. Min Max |

S0 | 1445 .6118797 .1549424 .0021267 .9595686 |

S1 | 1445 .6857928 .1392872 .0097329 .9694076 |

The average probability of staying out of prison for five years is 68.6% if a felon and 61.2% if not, a difference of 7.4 percentage points. This can be interpreted as a discrete marginal effect.