## Piecewise Exponential Models in R:

Recidivism in the U.S.

This is an illustration of piecewise exponential survival using R,
relying on the functions `pwe()`

to create pseudo-observations
and `glm()`

to fit the model using the Poisson equivalence.

#### The Data

The dataset we will consider is analyzed in Wooldridge (2002) and
credited to Chung, Schmidt and Witte (1991). The data pertain to a
random sample of convicts released from prison between July 1, 1977
and June 30, 1978. Of interest is the time until they return to
prison. The information was collected retrospectively by looking at
records in April 1984, so the maximum length of observation is 81
months. The data are available from the Stata website in Stata
format. We use the `foreign`

library to read them into R

> library(foreign) > recid <- read.dta("http://www.stata.com/data/jwooldridge/eacsap/recid.dta") > nrow(recid) [1] 1445

I get a warning that "R cannot read factor labels from Stata 5 files" but it seems harmless, as the data look Ok. The file has a censoring indicator, which we subtract from 1 to get a failure indicator. We also create an id variable and list observation number 9, which goes back to prison after 54 months.

> recid$fail = 1 - recid$cens > recid$id = 1:nrow(recid) > recid[9,c("id","durat","fail")] id durat fail 9 9 54 1

#### Creating Pseudo-Observations

To create pseudo-observations for survival analysis using the
piecewise exponential model I wrote the function `pwe()`

,
which is listed below. You can copy and paste it, or download it
from the website, where I saved it with extension `txt`

so it can be listed from any browser by navigating to the url
`http://data.princeton.edu/wws509/stata/pwe.txt`

.
You can also source it directly from R. The encoding is ASCII.
Here's the function:

pwe <- function(data, timevar, deathvar, bounds) { # pwe: expands an S data frame for piece-wise exponential survival # G. Rodriguez, Nov 29, 1992 # # Check arguments: time and death must be variables in the data frame # and boundaries must be non-negative and strictly increasing if(!is.data.frame(data)) stop("First argument must be a data frame") if(is.na(match(tn <- deparse(substitute(timevar)), names(data)))) stop(paste("\n\tSurvival time", tn, "must be a variable in the data frame")) if(is.na(match(dn <- deparse(substitute(deathvar)), names(data)))) stop(paste("\n\tDeath indicator", dn, "must be a variable in the data frame")) width <- diff(bounds) if(any(bounds < 0) | any(width <= 0)) stop(paste( "Invalid interval boundaries in", deparse(substitute( bounds)))) # # Expand the data frame creating one pseudo-observation for each # interval visited, add interval number, events and exposure time # (existing variables with these names will be overwriten) n <- cut(data[, tn], bounds) data <- data[rep(seq(along = n), n), ] i <- NULL for(k in 1:length(n)) i <- c(i, 1:n[k]) data$events <- ifelse(data[, tn] > bounds[i + 1], 0, data[, dn]) data$exposure <- ifelse(data[, tn] > bounds[i + 1], width[i], data[, tn ] - bounds[i]) data$interval <- i attr(data$interval, "levels") <- attr(n, "levels") data }

We now use this function to split the data into single-year intervals of duration from 0-12 to 48-60 with an open-ended category 60+. The function codes the interval variable using integer codes and a set of levels, and we turn that into a factor. We also show how the observation for subject 9 above becomes five pseudo-observations, with 12 months of exposure in years one to four with no events, and 6 months of exposure in year five with one event.

> source("http://data.princeton.edu/wws509/stata/pwe.txt") > recidx = pwe(recid,durat,fail,c(0,12,24,36,48,60,100)) > nrow(recidx) [1] 6718 > recidx$interval = factor(recidx$interval,labels=levels(recidx$interval)) > recidx[recidx$id==9, c("id","durat","fail","interval","exposure","events")] id durat fail interval exposure events 9 9 54 1 (0,12] 12 0 9.1 9 54 1 (12,24] 12 0 9.2 9 54 1 (24,36] 12 0 9.3 9 54 1 (36,48] 12 0 9.4 9 54 1 (48,60] 6 1

#### A PWE Proportional Hazards Model

We are now ready to fit a proportional hazards model with a piecewise exponential baseline where the hazard changes from year to year. We use the same model as Wooldridge(2002), involving ten predictors, all fixed covariates.

> fit=glm(events~interval+workprg+priors+tserved+felon+alcohol+drugs+ + black+married+educ+age+offset(log(exposure)), + data=recidx, family=poisson) > summary(fit) Call: glm(formula = events ~ interval + workprg + priors + tserved + felon + alcohol + drugs + black + married + educ + age + offset(log(exposure)), family = poisson, data = recidx) # Deviance Residuals: Min 1Q Median 3Q Max -1.6130 -0.4598 -0.3468 -0.2543 3.4746 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.8301275 0.2802673 -13.666 < 2e-16 *** interval(12,24] 0.0365320 0.1093618 0.334 0.73834 interval(24,36] -0.3738156 0.1296119 -2.884 0.00393 ** interval(36,48] -0.8115436 0.1564015 -5.189 2.12e-07 *** interval(48,60] -0.9382311 0.1683212 -5.574 2.49e-08 *** interval(60,100] -1.5471779 0.2033489 -7.608 2.77e-14 *** workprg 0.0838291 0.0907942 0.923 0.35586 priors 0.0872458 0.0134735 6.475 9.46e-11 *** tserved 0.0130089 0.0016859 7.716 1.20e-14 *** felon -0.2839252 0.1061488 -2.675 0.00748 ** alcohol 0.4324425 0.1057211 4.090 4.31e-05 *** drugs 0.2747141 0.0978635 2.807 0.00500 ** black 0.4335560 0.0883623 4.907 9.27e-07 *** married -0.1540477 0.1092119 -1.411 0.15838 educ -0.0214162 0.0194440 -1.101 0.27071 age -0.0035800 0.0005222 -6.855 7.13e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 3691.4 on 6717 degrees of freedom Residual deviance: 3379.1 on 6702 degrees of freedom AIC: 4515.1 Number of Fisher Scoring iterations: 6

The results are exactly the same as in the sister page using Stata. We see that the risk of recidivism is about the same in the first two years, but then decreases substantially with duration since release. At any given duration felons have 25% lower risk of recidivism than non-felons with the same observed characteristics. Subjects imprisoned for alcohol or drug related offenses have much higher risk of recidivism, everything else being equal.

#### Survival Probabilities

We now illustrate the calculation of survival probabilities. We start with the baseline hazard, which we obtain by adding the constant to the interval coefficients using zero for (0-12] to obtain log-hazards, and exponentiating to get hazards. To obtain the cumulative or integrated hazard we multiply each hazard by the width of the interval, which happens to be 12 for all intervals, and sum. We then obtain the survival as the exponential of the negative cumulative hazard. Note that we only need the first five years.> b = coef(fit) > h = exp( b[1] + c(0,b[2:6]) ) > H = cumsum( 12*h) > S = exp(-H) > S interval(12,24] interval(24,36] 0.7706799 0.5882188 0.4916958 interval(36,48] interval(48,60] interval(60,100] 0.4379748 0.3955312 0.3741986

These calculations apply to the reference cell and are not very meaningful because they set age to zero (and age, by the way, is measured in months).

We will now estimate the probability of staying out of prison
for five years given average values of the predictors.
In calculating the mean of each predictor we have to be careful
to include only one observation per person, so we restrict the
calculation to the first interval, which is "(0,12]".
To do this I extract the first row for each person and the
columns corresponding to fixed predictors into a matrix
`X`

.
The relevant coefficients are in slots 7 to 16.

> xvars = c("workprg","priors","tserved","felon","alcohol","drugs", + "black","married","educ","age") > X = recidx[recidx$interval=="(0,12]", xvars] > xbar = colMeans(X) > bx = b[7:16] > xb = sum(xbar * bx) > exp(-H[5] * exp(xb)) interval(48,60] 0.6570278

Thus, the probability of staying out of prison for the average
person is 65.7%. We can easily calculate this probability for
felons and non-felons keeping all other variables at their
means. Note that felon is the 4-th predictor in `X`

> xb0 = sum(xbar[-4] * bx[-4]) > exp(-H[5] * exp(xb0)) interval(48,60] 0.6317763 > xb1 = xb0 + bx[4] > exp(-H[5] * exp(xb1)) interval(48,60] 0.7077168

The predicted probability is 70.8% for felons and 63.2% for non-felons when all other characteristics are set to the mean, a difference of 7.6 percentage points.

An alternative calculation sets every person to be a felon or non-felon leaving all other characteristics as they are, and then averages the predicted probability of surviving five years without returning to prison.

> felon = X[,"felon"] > X[,"felon"]= 0 > xb = as.matrix(X) %*% bx > mean(exp(-(H[5] * exp(xb)))) [1] 0.6118797 > X[,"felon"]= 1 > xb = as.matrix(X) %*% bx > mean(exp(-(H[5] * exp(xb)))) [1] 0.6857928 > X[,"felon"] = felon

The average probability of staying out of prison for five years is 68.6% if a felon and 61.2% if not, a difference of 7.4 percentage points.