Constructing Pseudo-Observations for Piece-Wise Exponential Survival Analysis

Recividivism Duration

We will use data on recidivism used in Wooldridge's text and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available from the Stata website and can be accessed using the command

. use http://www.stata.com/data/jwooldridge/eacsap/recid

We will keep only ten observations on three variables: black, duract, cens, so we can see exactly what's going on. We will also generate an id, and make sure it is the first variable in the dataset. Finally we generate a variable called fail to indicate return to prison. We list the 10 cases and check that we have 4 failures in 601 months of exposure.

. keep in 1/10
(1435 observations deleted)

. keep black durat cens

. gen id = _n

. move id black

. gen fail = 1 - cens

. save ten, replace
(note: file ten.dta not found)
file ten.dta saved

. // list
. tabstat fail durat, stats(sum)

   stats |      fail     durat
---------+--------------------
     sum |         4       601
------------------------------

From First Principles

Our first approach will do all calculations "by hand" so you get a better appreciation of what's involved. We will group time in years combining 61-81 in the last category, so time is 1-12, 13-24, 25-36, 37-48, 49-60, 61+, and replicate each observation so we have one record per year:

. gen nyears = int((durat-1)/12) + 1

. replace nyears = 5 if nyears > 5
(6 real changes made)

. expand nyears
(34 observations created)

. // list

Stata adds the new copies at the end of the datase. We will sort by id and add a new variable to keep track of the year. We also fix our copies of fail (only the last one can be a failure) and compute time exposed in each year (you are always exposed 12 months except in the last year):

. sort id

. quietly by id: gen year = _n

. replace fail = 0 if year < nyears
(10 real changes made)

. gen expo = 12

. replace expo = durat - 12*(year-1) if year == nyears
(9 real changes made)

. // list
. tabstat fail expo, stats(sum) by(year)

Summary statistics: sum
  by categories of: year 

    year |      fail      expo
---------+--------------------
       1 |         1       117
       2 |         0       108
       3 |         1        97
       4 |         0        96
       5 |         2       183
---------+--------------------
   Total |         4       601
------------------------------

Not bad for eight lines of Stata code. Now we will reduce that to three.

Using the Stata Method

Stata has facilities for managing survival data and "knows" how to create pseudo-observations or "episodes". We first use stset to tell Stata we have survival data, defining the time variable and failure indicator, and then use stsplit to create the yearly episodes:

. clear 

. use ten

. stset durat, failure(fail) id(id)

                id:  id
     failure event:  fail != 0 & fail < .
obs. time interval:  (durat[_n-1], durat]
 exit on or before:  failure

------------------------------------------------------------------------------
       10  total obs.
        0  exclusions
------------------------------------------------------------------------------
       10  obs. remaining, representing
       10  subjects
        4  failures in single failure-per-subject data
      601  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        81


. stsplit year, at(12 24 36 48 100)
(34 observations (episodes) created)
. // list

stsplit generates two time variables, _t0 and _t, to indicate the start and end of each episode, and a death indicator _d. We use the time variables to compute exposure and list our results in tabular form:

. gen expo = _t - _t0

. tabstat _d expo, stats(sum) by (year)

Summary statistics: sum
  by categories of: year 

    year |        _d      expo
---------+--------------------
       0 |         1       117
      12 |         0       108
      24 |         1        97
      36 |         0        96
      48 |         2       183
---------+--------------------
   Total |         4       601
------------------------------
The results agree exactly with those obtained from first principles.