Germán Rodríguez
Generalized Linear Models Princeton University

Piecewise Exponential Models in R:
Recidivism in the U.S.

This is an illustration of piecewise exponential survival using R, relying on the functions pwe() to create pseudo-observations and glm() to fit the model using the Poisson equivalence.

The Data

The dataset we will consider is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum length of observation is 81 months. The data are available from the Stata website in Stata format. We use the foreign library to read them into R

> library(foreign)
> recid <- read.dta("http://www.stata.com/data/jwooldridge/eacsap/recid.dta")
> nrow(recid)
[1] 1445

I get a warning that "R cannot read factor labels from Stata 5 files" but it seems harmless, as the data look Ok. The file has a censoring indicator, which we subtract from 1 to get a failure indicator. We also create an id variable and list observation number 9, which goes back to prison after 54 months.

> recid$fail = 1 - recid$cens
> recid$id = 1:nrow(recid)
> recid[9,c("id","durat","fail")]
   id durat fail
9   9    54    1

Creating Pseudo-Observations

To create pseudo-observations for survival analysis using the piecewise exponential model I wrote the function pwe(), which is listed below. You can copy and paste it, or download it from the website, where I saved it with extension txt so it can be listed from any browser by navigating to the url http://data.princeton.edu/wws509/stata/pwe.txt. You can also source it directly from R. The encoding is ASCII. Here's the function:

  pwe <- function(data, timevar, deathvar, bounds) {
    # pwe: expands an S data frame for piece-wise exponential survival
    # G. Rodriguez, Nov 29, 1992
    #
    # Check arguments: time and death must be variables in the data frame
    # and boundaries must be non-negative and strictly increasing
    if(!is.data.frame(data)) stop("First argument must be a data frame")
    if(is.na(match(tn <- deparse(substitute(timevar)), names(data))))
      stop(paste("\n\tSurvival time", tn, 
                 "must be a variable in the data frame"))
    if(is.na(match(dn <- deparse(substitute(deathvar)), names(data))))
      stop(paste("\n\tDeath indicator", dn, 
                 "must be a variable in the data frame"))
    width <- diff(bounds)
    if(any(bounds < 0) | any(width <= 0)) stop(paste(
      "Invalid interval boundaries in", deparse(substitute(
        bounds))))      #
    # Expand the data frame creating one pseudo-observation for each
    # interval visited, add interval number, events and exposure time
    # (existing variables with these names will be overwriten)
    n <- cut(data[, tn], bounds)
    data <- data[rep(seq(along = n), n),  ]
    i <- NULL
    for(k in 1:length(n))
      i <- c(i, 1:n[k])
    data$events <- ifelse(data[, tn] > bounds[i + 1], 0, data[, dn])
    data$exposure <- ifelse(data[, tn] > bounds[i + 1], width[i], data[, tn
                                                                       ] - bounds[i])
    data$interval <- i
    attr(data$interval, "levels") <- attr(n, "levels")
    data
  }

We now use this function to split the data into single-year intervals of duration from 0-12 to 48-60 with an open-ended category 60+. The function codes the interval variable using integer codes and a set of levels, and we turn that into a factor. We also show how the observation for subject 9 above becomes five pseudo-observations, with 12 months of exposure in years one to four with no events, and 6 months of exposure in year five with one event.

> source("http://data.princeton.edu/wws509/stata/pwe.txt")
> recidx = pwe(recid,durat,fail,c(0,12,24,36,48,60,100))
> nrow(recidx)
[1] 6718
> recidx$interval = factor(recidx$interval,labels=levels(recidx$interval))
> recidx[recidx$id==9, c("id","durat","fail","interval","exposure","events")]
    id durat fail interval exposure events
9    9    54    1   (0,12]       12      0
9.1  9    54    1  (12,24]       12      0
9.2  9    54    1  (24,36]       12      0
9.3  9    54    1  (36,48]       12      0
9.4  9    54    1  (48,60]        6      1

A PWE Proportional Hazards Model

We are now ready to fit a proportional hazards model with a piecewise exponential baseline where the hazard changes from year to year. We use the same model as Wooldridge(2002), involving ten predictors, all fixed covariates.

> fit=glm(events~interval+workprg+priors+tserved+felon+alcohol+drugs+
+     black+married+educ+age+offset(log(exposure)), 
+      data=recidx, family=poisson)
> summary(fit)

Call:
glm(formula = events ~ interval + workprg + priors + tserved + 
    felon + alcohol + drugs + black + married + educ + age + 
    offset(log(exposure)), family = poisson, data = recidx)
#
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6130  -0.4598  -0.3468  -0.2543   3.4746  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.8301275  0.2802673 -13.666  < 2e-16 ***
interval(12,24]   0.0365320  0.1093618   0.334  0.73834    
interval(24,36]  -0.3738156  0.1296119  -2.884  0.00393 ** 
interval(36,48]  -0.8115436  0.1564015  -5.189 2.12e-07 ***
interval(48,60]  -0.9382311  0.1683212  -5.574 2.49e-08 ***
interval(60,100] -1.5471779  0.2033489  -7.608 2.77e-14 ***
workprg           0.0838291  0.0907942   0.923  0.35586    
priors            0.0872458  0.0134735   6.475 9.46e-11 ***
tserved           0.0130089  0.0016859   7.716 1.20e-14 ***
felon            -0.2839252  0.1061488  -2.675  0.00748 ** 
alcohol           0.4324425  0.1057211   4.090 4.31e-05 ***
drugs             0.2747141  0.0978635   2.807  0.00500 ** 
black             0.4335560  0.0883623   4.907 9.27e-07 ***
married          -0.1540477  0.1092119  -1.411  0.15838    
educ             -0.0214162  0.0194440  -1.101  0.27071    
age              -0.0035800  0.0005222  -6.855 7.13e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3691.4  on 6717  degrees of freedom
Residual deviance: 3379.1  on 6702  degrees of freedom
AIC: 4515.1

Number of Fisher Scoring iterations: 6

The results are exactly the same as in the sister page using Stata. We see that the risk of recidivism is about the same in the first two years, but then decreases substantially with duration since release. At any given duration felons have 25% lower risk of recidivism than non-felons with the same observed characteristics. Subjects imprisoned for alcohol or drug related offenses have much higher risk of recidivism, everything else being equal.

Survival Probabilities

We now illustrate the calculation of survival probabilities. We start with the baseline hazard, which we obtain by adding the constant to the interval coefficients using zero for (0-12] to obtain log-hazards, and exponentiating to get hazards. To obtain the cumulative or integrated hazard we multiply each hazard by the width of the interval, which happens to be 12 for all intervals, and sum. We then obtain the survival as the exponential of the negative cumulative hazard. Note that we only need the first five years.

> b = coef(fit)
> h = exp( b[1] + c(0,b[2:6]) )
> H = cumsum( 12*h)
> S = exp(-H)
> S
                  interval(12,24]  interval(24,36] 
       0.7706799        0.5882188        0.4916958 
 interval(36,48]  interval(48,60] interval(60,100] 
       0.4379748        0.3955312        0.3741986 

These calculations apply to the reference cell and are not very meaningful because they set age to zero (and age, by the way, is measured in months).

We will now estimate the probability of staying out of prison for five years given average values of the predictors. In calculating the mean of each predictor we have to be careful to include only one observation per person, so we restrict the calculation to the first interval, which is "(0,12]". To do this I extract the first row for each person and the columns corresponding to fixed predictors into a matrix X. The relevant coefficients are in slots 7 to 16.

> xvars = c("workprg","priors","tserved","felon","alcohol","drugs",
+   "black","married","educ","age")
> X = recidx[recidx$interval=="(0,12]", xvars]
> xbar = colMeans(X)
> bx = b[7:16]
> xb = sum(xbar * bx)
> exp(-H[5] * exp(xb))
interval(48,60] 
      0.6570278 

Thus, the probability of staying out of prison for the average person is 65.7%. We can easily calculate this probability for felons and non-felons keeping all other variables at their means. Note that felon is the 4-th predictor in X

> xb0 = sum(xbar[-4] * bx[-4])  
> exp(-H[5] * exp(xb0))
interval(48,60] 
      0.6317763 
> xb1 = xb0 + bx[4]
> exp(-H[5] * exp(xb1))
interval(48,60] 
      0.7077168 

The predicted probability is 70.8% for felons and 63.2% for non-felons when all other characteristics are set to the mean, a difference of 7.6 percentage points.

An alternative calculation sets every person to be a felon or non-felon leaving all other characteristics as they are, and then averages the predicted probability of surviving five years without returning to prison.

> felon = X[,"felon"]
> X[,"felon"]= 0
> xb = as.matrix(X) %*% bx
> mean(exp(-(H[5] * exp(xb))))
[1] 0.6118797
> X[,"felon"]= 1
> xb = as.matrix(X) %*% bx
> mean(exp(-(H[5] * exp(xb))))
[1] 0.6857928
> X[,"felon"] = felon

The average probability of staying out of prison for five years is 68.6% if a felon and 61.2% if not, a difference of 7.4 percentage points.