Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

Constructing Person-Year Files

We discuss how to prepare data for piecewise exponential modeling. We assume time or duration is split into intervals such that the hazard is constant in each interval. To fit the model we need to expand the data so that each individual contributes a separate record for each duration category. These types of files are often called person-year files, although the intervals don't have to be years.

Recividivism Duration

We will use data on recidivism used in Wooldridge's text and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available from the Stata website and can be accessed using the command

. use http://www.stata.com/data/jwooldridge/eacsap/recid

We will keep only ten observations on three variables (black, durat, cens) so we can see exactly what's going on. We will also generate an id, and make sure it is the first variable in the dataset. Finally, we generate a variable called fail to indicate return to prison. We list the 10 cases and check that we have 4 failures in 601 months of exposure.

. keep in 1/10
(1435 observations deleted)
 
. keep black durat cens
 
. gen id = _n
 
. move id black
 
. gen fail = 1 - cens
 
. save ten, replace
file ten.dta saved
 
. // list
. tabstat fail durat, stats(sum)
 
   stats |      fail     durat
---------+--------------------
     sum |         4       601
------------------------------

From First Principles

Our first approach will do all calculations "by hand" so you get a better appreciation of what's involved. We will group time in years combining 61-81 months in the last category, so the intervals are 1-12, 13-24, 25-36, 37-48, 49-60, and 61+. We then replicate each observation so we have one record per year:

. gen nyears = int((durat-1)/12) + 1
 
. replace nyears = 6 if nyears > 6
(5 real changes made)
 
. expand nyears
(40 observations created)
 
. // list

Stata adds the new copies at the end of the datase. We will sort by id and add a new variable to keep track of the year. We also fix our copies of fail (only the last entry for each individual can be a failure) and compute time exposed in each year (you are always exposed 12 months, except possibly in the last year of observation for each individual):

. sort id
 
. quietly by id: gen year = _n
 
. replace fail = 0 if year < nyears
(10 real changes made)
 
. gen expo = 12
 
. replace expo = durat - 12*(year-1) if year == nyears
(8 real changes made)
 
. // list
. tabstat fail expo, stats(sum) by(year)
 
Summary statistics: sum
  by categories of: year 
 
    year |      fail      expo
---------+--------------------
       1 |         1       117
       2 |         0       108
       3 |         1        97
       4 |         0        96
       5 |         2        90
       6 |         0        93
---------+--------------------
   Total |         4       601
------------------------------

Not bad for eight lines of Stata code. Now we will reduce that to three.

Using the Stata Method

Stata has facilities for managing survival data and "knows" how to create pseudo-observations or "episodes". We first use stset to tell Stata that we have survival data, defining the time variable and failure indicator, and then use stsplit to create the yearly episodes:
. clear 
 
. use ten
 
. stset durat, failure(fail) id(id)
 
                id:  id
     failure event:  fail != 0 & fail < .
obs. time interval:  (durat[_n-1], durat]
 exit on or before:  failure
 
------------------------------------------------------------------------------
       10  total obs.
        0  exclusions
------------------------------------------------------------------------------
       10  obs. remaining, representing
       10  subjects
        4  failures in single failure-per-subject data
      601  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        81
 
. stsplit year, at(12 24 36 48 60 100)
(40 observations (episodes) created)
 
. // list
. gen expo = _t - _t0
 
. tabstat _d expo, stats(sum) by (year)
 
Summary statistics: sum
  by categories of: year 
 
    year |        _d      expo
---------+--------------------
       0 |         1       117
      12 |         0       108
      24 |         1        97
      36 |         0        96
      48 |         2        90
      60 |         0        93
---------+--------------------
   Total |         4       601
------------------------------

And that's that. We obtain exactly the same results in three lines of code.