Constructing Person-Year Files
We discuss how to prepare data for piecewise exponential modeling. We assume time or duration is split into intervals such that the hazard is constant in each interval. To fit the model we need to expand the data so that each individual contributes a separate record for each duration category. These types of files are often called person-year files, although the intervals don't have to be years.
Recividivism Duration
We will use data on recidivism used in Wooldridge's text and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available from the Stata website and can be accessed using the command
. use http://www.stata.com/data/jwooldridge/eacsap/recid
We will keep only ten observations on three variables (black, durat,
cens) so we can see exactly what's going on. We will also generate an
id, and make sure it is the first variable in the dataset. Finally,
we generate a variable called fail to indicate return to
prison. We list the 10 cases and check that we have 4 failures in 601
months of exposure.
. keep in 1/10
(1435 observations deleted)
. keep black durat cens
. gen id = _n
. move id black
. gen fail = 1 - cens
. save ten, replace
file ten.dta saved
. // list
. tabstat fail durat, stats(sum)
stats | fail durat
---------+--------------------
sum | 4 601
------------------------------
From First Principles
Our first approach will do all calculations "by hand" so you get a better appreciation of what's involved. We will group time in years combining 61-81 months in the last category, so the intervals are 1-12, 13-24, 25-36, 37-48, 49-60, and 61+. We then replicate each observation so we have one record per year:
. gen nyears = int((durat-1)/12) + 1 . replace nyears = 6 if nyears > 6 (5 real changes made) . expand nyears (40 observations created) . // list
Stata adds the new copies at the end of the datase. We will sort by id and add a new variable to keep track of the year. We also fix our copies of fail (only the last entry for each individual can be a failure) and compute time exposed in each year (you are always exposed 12 months, except possibly in the last year of observation for each individual):
. sort id
. quietly by id: gen year = _n
. replace fail = 0 if year < nyears
(10 real changes made)
. gen expo = 12
. replace expo = durat - 12*(year-1) if year == nyears
(8 real changes made)
. // list
. tabstat fail expo, stats(sum) by(year)
Summary statistics: sum
by categories of: year
year | fail expo
---------+--------------------
1 | 1 117
2 | 0 108
3 | 1 97
4 | 0 96
5 | 2 90
6 | 0 93
---------+--------------------
Total | 4 601
------------------------------
Not bad for eight lines of Stata code. Now we will reduce that to three.
Using the Stata Method
Stata has facilities for managing survival data and "knows" how to create pseudo-observations or "episodes". We first usestset to tell Stata that we have survival
data, defining the time variable and failure indicator, and then
use stsplit to create the yearly episodes:
. clear
. use ten
. stset durat, failure(fail) id(id)
id: id
failure event: fail != 0 & fail < .
obs. time interval: (durat[_n-1], durat]
exit on or before: failure
------------------------------------------------------------------------------
10 total obs.
0 exclusions
------------------------------------------------------------------------------
10 obs. remaining, representing
10 subjects
4 failures in single failure-per-subject data
601 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 81
. stsplit year, at(12 24 36 48 60 100)
(40 observations (episodes) created)
. // list
. gen expo = _t - _t0
. tabstat _d expo, stats(sum) by (year)
Summary statistics: sum
by categories of: year
year | _d expo
---------+--------------------
0 | 1 117
12 | 0 108
24 | 1 97
36 | 0 96
48 | 2 90
60 | 0 93
---------+--------------------
Total | 4 601
------------------------------
And that's that. We obtain exactly the same results in three lines of code.

