![]() | ![]() | ![]() |
We will use data on recidivism used in Wooldridge's text and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available from the Stata website and can be accessed using the command
. use http://www.stata.com/data/jwooldridge/eacsap/recid
We will keep only ten observations on three variables: black, duract, cens,
so we can see exactly what's going on. We will also generate an id, and make
sure it is the first variable in the dataset. Finally we generate a variable
called fail to indicate return to prison. We list the 10 cases
and check that we have 4 failures in 601 months of exposure.
. keep in 1/10
(1435 observations deleted)
. keep black durat cens
. gen id = _n
. move id black
. gen fail = 1 - cens
. save ten, replace
(note: file ten.dta not found)
file ten.dta saved
. // list
. tabstat fail durat, stats(sum)
stats | fail durat
---------+--------------------
sum | 4 601
------------------------------
Our first approach will do all calculations "by hand" so you get a better appreciation of what's involved. We will group time in years combining 61-81 in the last category, so time is 1-12, 13-24, 25-36, 37-48, 49-60, 61+, and replicate each observation so we have one record per year:
. gen nyears = int((durat-1)/12) + 1 . replace nyears = 5 if nyears > 5 (6 real changes made) . expand nyears (34 observations created) . // list
Stata adds the new copies at the end of the datase. We will sort by id and add a new variable to keep track of the year. We also fix our copies of fail (only the last one can be a failure) and compute time exposed in each year (you are always exposed 12 months except in the last year):
. sort id
. quietly by id: gen year = _n
. replace fail = 0 if year < nyears
(10 real changes made)
. gen expo = 12
. replace expo = durat - 12*(year-1) if year == nyears
(9 real changes made)
. // list
. tabstat fail expo, stats(sum) by(year)
Summary statistics: sum
by categories of: year
year | fail expo
---------+--------------------
1 | 1 117
2 | 0 108
3 | 1 97
4 | 0 96
5 | 2 183
---------+--------------------
Total | 4 601
------------------------------
Not bad for eight lines of Stata code. Now we will reduce that to three.
Stata has facilities for managing survival data and "knows" how to create
pseudo-observations or "episodes".
We first use stset to tell Stata we have survival data,
defining the time variable and failure indicator, and then use
stsplit to create the yearly episodes:
. clear
. use ten
. stset durat, failure(fail) id(id)
id: id
failure event: fail != 0 & fail < .
obs. time interval: (durat[_n-1], durat]
exit on or before: failure
------------------------------------------------------------------------------
10 total obs.
0 exclusions
------------------------------------------------------------------------------
10 obs. remaining, representing
10 subjects
4 failures in single failure-per-subject data
601 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 81
. stsplit year, at(12 24 36 48 100)
(34 observations (episodes) created)
. // list
stsplit generates two time variables,
_t0 and _t, to indicate the start and
end of each episode, and a death indicator _d.
We use the time variables to compute exposure and list
our results in tabular form:
. gen expo = _t - _t0
. tabstat _d expo, stats(sum) by (year)
Summary statistics: sum
by categories of: year
year | _d expo
---------+--------------------
0 | 1 117
12 | 0 108
24 | 1 97
36 | 0 96
48 | 2 183
---------+--------------------
Total | 4 601
------------------------------
The results agree exactly with those obtained from first principles.