Comparing Continuous and Discrete Models
Let's have another look at the recidivism data. We will split duration into single years with an open-ended category at 5+ and fit a piecewise exponential model with the same covariates as Wooldridge.
We will then treat the data as discrete, assuming all we know is that recidivism occured somewhere in the year. We will fit a binary data model with a logit link, which corresponds to the discrete time model, and using a complementary-log-log link, which corresponds to a grouped continuous time model.
A Piece-wise Exponential Model
We read, set and split the data and then fit our model
. use http://www.stata.com/data/jwooldridge/eacsap/recid, clear
. gen fail = 1-cens
. gen id = _n
. stset durat, fail(fail) id(id)
id: id
failure event: fail != 0 & fail < .
obs. time interval: (durat[_n-1], durat]
exit on or before: failure
------------------------------------------------------------------------------
1445 total obs.
0 exclusions
------------------------------------------------------------------------------
1445 obs. remaining, representing
1445 subjects
552 failures in single failure-per-subject data
80013 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 81
. stsplit year, at(12 24 36 48 60 100) // max is 81
(5273 observations (episodes) created)
. replace year = year/12
(5273 real changes made)
. local x workprg priors tserved felon alcohol drugs ///
> black married educ age
. streg i.year `x', distribution(exponential) nohr
failure _d: fail
analysis time _t: durat
id: id
Iteration 0: log likelihood = -1739.8944
Iteration 1: log likelihood = -1653.6678
Iteration 2: log likelihood = -1587.2575
Iteration 3: log likelihood = -1583.7542
Iteration 4: log likelihood = -1583.7129
Iteration 5: log likelihood = -1583.7129
Exponential regression -- log relative-hazard form
No. of subjects = 1445 Number of obs = 6718
No. of failures = 552
Time at risk = 80013
LR chi2(15) = 312.36
Log likelihood = -1583.7129 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1 | .036532 .1093659 0.33 0.738 -.1778212 .2508851
2 | -.3738156 .1296172 -2.88 0.004 -.6278607 -.1197706
3 | -.8115436 .1564067 -5.19 0.000 -1.118095 -.5049921
4 | -.9382311 .1683272 -5.57 0.000 -1.268146 -.6083159
5 | -1.547178 .2033594 -7.61 0.000 -1.945755 -1.148601
|
workprg | .0838291 .0907983 0.92 0.356 -.0941324 .2617906
priors | .0872458 .0134763 6.47 0.000 .0608327 .113659
tserved | .0130089 .0016863 7.71 0.000 .0097039 .0163139
felon | -.2839252 .1061534 -2.67 0.007 -.491982 -.0758685
alcohol | .4324425 .1057254 4.09 0.000 .2252245 .6396605
drugs | .2747141 .0978667 2.81 0.005 .0828989 .4665293
black | .4335559 .0883658 4.91 0.000 .2603622 .6067497
married | -.1540477 .1092154 -1.41 0.158 -.3681059 .0600104
educ | -.0214162 .0194453 -1.10 0.271 -.0595283 .016696
age | -.00358 .0005223 -6.85 0.000 -.0046037 -.0025563
_cons | -3.830127 .280282 -13.67 0.000 -4.37947 -3.280785
------------------------------------------------------------------------------
. estimates store phaz
A Logit Model
For a discrete-time survival analysis we have to make sure we only include intervals with complete exposure, where we can classify the outcome as failure or survival. The convicts were released between July 1, 1977 and June 30, 1978 and the data were collected in April 1984, so the length of observation ranges between 70 and 81 months. We therefore restrict our attention to 5 years or 60 months. (We could go up to 6 years or 72 months for some convicts, but unfortunately we don't have the date of release, so we can't identify these cases and must censor everyone at 60.)
. drop if _t0 >= 60
(921 observations deleted)
. logit _d i.year `x'
Iteration 0: log likelihood = -1759.0583
Iteration 1: log likelihood = -1654.9242
Iteration 2: log likelihood = -1637.1916
Iteration 3: log likelihood = -1637.1267
Iteration 4: log likelihood = -1637.1267
Logistic regression Number of obs = 5797
LR chi2(14) = 243.86
Prob > chi2 = 0.0000
Log likelihood = -1637.1267 Pseudo R2 = 0.0693
------------------------------------------------------------------------------
_d | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1 | .0305282 .1193583 0.26 0.798 -.2034098 .2644661
2 | -.4131403 .1384065 -2.98 0.003 -.6844119 -.1418686
3 | -.8641487 .1639958 -5.27 0.000 -1.185575 -.5427229
4 | -.9936625 .1756322 -5.66 0.000 -1.337895 -.6494297
|
workprg | .1109887 .1003087 1.11 0.269 -.0856129 .3075902
priors | .0992921 .0164654 6.03 0.000 .0670205 .1315636
tserved | .0149221 .0021429 6.96 0.000 .0107221 .0191222
felon | -.3196621 .1178117 -2.71 0.007 -.5505687 -.0887555
alcohol | .4724998 .1184177 3.99 0.000 .2404055 .7045941
drugs | .316729 .1086092 2.92 0.004 .1038589 .5295992
black | .4580275 .0973977 4.70 0.000 .2671315 .6489235
married | -.2048073 .1204593 -1.70 0.089 -.4409032 .0312885
educ | -.0267259 .0215052 -1.24 0.214 -.0688754 .0154235
age | -.0040231 .000584 -6.89 0.000 -.0051678 -.0028784
_cons | -1.140803 .3084159 -3.70 0.000 -1.745287 -.5363185
------------------------------------------------------------------------------
. estimates store logit
A Complementary Log-Log Model
Finally we use a complementary log-log link
. glm _d i.year `x', family(binomial) link(cloglog)
Iteration 0: log likelihood = -1818.8831
Iteration 1: log likelihood = -1640.7899
Iteration 2: log likelihood = -1637.5235
Iteration 3: log likelihood = -1637.5083
Iteration 4: log likelihood = -1637.5083
Generalized linear models No. of obs = 5797
Optimization : ML Residual df = 5782
Scale parameter = 1
Deviance = 3275.016541 (1/df) Deviance = .5664159
Pearson = 5908.117581 (1/df) Pearson = 1.021812
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = ln(-ln(1-u)) [Complementary log-log]
AIC = .5701253
Log likelihood = -1637.50827 BIC = -46826.57
------------------------------------------------------------------------------
| OIM
_d | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1 | .0216127 .1095455 0.20 0.844 -.1930925 .2363179
2 | -.3926148 .1296978 -3.03 0.002 -.6468179 -.1384117
3 | -.8249973 .1564479 -5.27 0.000 -1.131629 -.5183651
4 | -.9483385 .1683997 -5.63 0.000 -1.278396 -.6182811
|
workprg | .1044651 .0932517 1.12 0.263 -.0783049 .2872351
priors | .0887063 .0139849 6.34 0.000 .0612964 .1161163
tserved | .013267 .0017417 7.62 0.000 .0098534 .0166806
felon | -.2885449 .1091355 -2.64 0.008 -.5024465 -.0746433
alcohol | .4397795 .1090665 4.03 0.000 .226013 .6535459
drugs | .2991025 .1002774 2.98 0.003 .1025625 .4956425
black | .4272096 .0909458 4.70 0.000 .248959 .6054602
married | -.1830403 .1137539 -1.61 0.108 -.405994 .0399133
educ | -.0233346 .0201545 -1.16 0.247 -.0628367 .0161674
age | -.003851 .0005466 -7.04 0.000 -.0049224 -.0027796
_cons | -1.238797 .2893845 -4.28 0.000 -1.80598 -.6716138
------------------------------------------------------------------------------
. estimates store cloglog
Comparison of Estimates
All that remains is to compare the estimates
. estimates table phaz cloglog logit, eq(1:1:1)
-----------------------------------------------------
Variable | phaz cloglog logit
-------------+---------------------------------------
year |
1 | .03653199 .02161269 .03052816
2 | -.37381564 -.39261481 -.41314026
3 | -.81154363 -.82499726 -.8641487
4 | -.93823111 -.94833849 -.99366251
5 | -1.5471779
|
workprg | .0838291 .1044651 .11098865
priors | .08724582 .08870634 .09929206
tserved | .01300886 .01326703 .01492214
felon | -.28392523 -.28854488 -.3196621
alcohol | .43244249 .43977945 .47249981
drugs | .27471411 .29910246 .31672903
black | .43355595 .42720963 .4580275
married | -.15404774 -.18304032 -.20480734
educ | -.02141618 -.02333462 -.02672593
age | -.00358 -.00385099 -.00402309
_cons | -3.8301275 -1.238797 -1.1408026
-----------------------------------------------------
As one would expect, the estimates of the relative risks based on the c-log-log link are closer to the continuous time estimates than those based on the logit link.
This result makes sense because the piece wise exponential and c-log-log link models are estimating the same continuous time hazard, one from continuous and one from grouped data, while the logit model is estimating a discrete time hazard.
Recall that in a continuous time model the relative risk multiplies the hazard or instantaneous failure rate, whereas in a discrete time model it multiplies the conditional odds of failure at a given time (or in a given time interval) given survival to that time (or interval).
All three approaches, however, lead to similar predicted survival probabilities.
