Germán Rodríguez
Survival Analysis Princeton University

## Continuous and Discrete Models

Let's have another look at the recidivism data. We will split duration into single years with an open-ended category at 5+ and fit a piecewise exponential model with the same covariates as Wooldridge.

We will then treat the data as discrete, assuming that all we know is that recidivism occured somewhere in the year. We will fit a binary data model with a logit link, which corresponds to the discrete time model, and using a complementary-log-log link, which corresponds to a grouped continuous time model.

### A Piece-wise Exponential Model

. use http://www.stata.com/data/jwooldridge/eacsap/recid, clear

. gen fail = 1-cens

. gen id = _n

. stset durat, fail(fail) id(id)

id:  id
failure event:  fail != 0 & fail < .
obs. time interval:  (durat[_n-1], durat]
exit on or before:  failure

------------------------------------------------------------------------------
1445  total observations
0  exclusions
------------------------------------------------------------------------------
1445  observations remaining, representing
1445  subjects
552  failures in single-failure-per-subject data
80013  total analysis time at risk and under observation
at risk from t =         0
earliest observed entry t =         0
last observed exit t =        81

. stsplit year, at(12 24 36 48 60 100) // max is 81
(5,273 observations (episodes) created)

. replace year = year/12

. local x workprg priors tserved felon alcohol drugs ///
>         black married educ age

. streg i.year x', distribution(exponential) nohr

failure _d:  fail
analysis time _t:  durat
id:  id

Iteration 0:   log likelihood = -1739.8944
Iteration 1:   log likelihood = -1653.6678
Iteration 2:   log likelihood = -1587.2575
Iteration 3:   log likelihood = -1583.7542
Iteration 4:   log likelihood = -1583.7129
Iteration 5:   log likelihood = -1583.7129

Exponential regression -- log relative-hazard form

No. of subjects =        1,445                  Number of obs    =       6,718
No. of failures =          552
Time at risk    =        80013
LR chi2(15)      =      312.36
Log likelihood  =   -1583.7129                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
_t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1  |    .036532   .1093659     0.33   0.738    -.1778212    .2508851
2  |  -.3738156   .1296172    -2.88   0.004    -.6278607   -.1197706
3  |  -.8115436   .1564067    -5.19   0.000    -1.118095   -.5049921
4  |  -.9382311   .1683272    -5.57   0.000    -1.268146   -.6083159
5  |  -1.547178   .2033594    -7.61   0.000    -1.945755   -1.148601
|
workprg |   .0838291   .0907983     0.92   0.356    -.0941324    .2617906
priors |   .0872458   .0134763     6.47   0.000     .0608327     .113659
tserved |   .0130089   .0016863     7.71   0.000     .0097039    .0163139
felon |  -.2839252   .1061534    -2.67   0.007     -.491982   -.0758685
alcohol |   .4324425   .1057254     4.09   0.000     .2252245    .6396605
drugs |   .2747141   .0978667     2.81   0.005     .0828989    .4665293
black |   .4335559   .0883658     4.91   0.000     .2603622    .6067497
married |  -.1540477   .1092154    -1.41   0.158    -.3681059    .0600104
educ |  -.0214162   .0194453    -1.10   0.271    -.0595283     .016696
age |    -.00358   .0005223    -6.85   0.000    -.0046037   -.0025563
_cons |  -3.830127    .280282   -13.67   0.000     -4.37947   -3.280785
------------------------------------------------------------------------------

. estimates store pwe

> library(foreign)

> library(survival)

> recid$fail <- 1 - recid$cens

> recidx <- survSplit(recid, cut = seq(12, 60, 12),
+   start = "t0", end = "durat", event = "fail", episode = "interval")

> labels <- paste("(",seq(0,60,12),",",c(seq(12,60,12),81), "]",sep="")

> recidx <- mutate(recidx,
+   exposure = durat - t0,
+   interval = factor(interval + 1, labels = labels))

> mf <-  fail ~  interval + workprg + priors + tserved + felon +
+   alcohol + drugs + black + married + educ + age

> pwe <- glm(mf, offset = log(exposure), data = recidx, family = poisson)

> coef(summary(pwe))
Estimate  Std. Error     z value     Pr(>|z|)
(Intercept)     -3.830127469 0.280267334 -13.6659789 1.621090e-42
interval(12,24]  0.036531989 0.109361775   0.3340471 7.383440e-01
interval(24,36] -0.373815644 0.129611909  -2.8841150 3.925154e-03
interval(36,48] -0.811543632 0.156401452  -5.1888497 2.115971e-07
interval(48,60] -0.938231113 0.168321156  -5.5740534 2.488794e-08
interval(60,81] -1.547177936 0.203348918  -7.6084886 2.773196e-14
workprg          0.083829106 0.090794162   0.9232874 3.558575e-01
priors           0.087245826 0.013473463   6.4753825 9.457203e-11
tserved          0.013008862 0.001685901   7.7162667 1.197865e-14
felon           -0.283925203 0.106148770  -2.6747856 7.477705e-03
alcohol          0.432442493 0.105721133   4.0904073 4.306163e-05
drugs            0.274714115 0.097863462   2.8071162 4.998720e-03
black            0.433555955 0.088362277   4.9065729 9.268154e-07
married         -0.154047742 0.109211869  -1.4105403 1.583802e-01
educ            -0.021416177 0.019444026  -1.1014271 2.707108e-01
age             -0.003580003 0.000522249  -6.8549738 7.132557e-12


The estimates are very close to those obtained using a Cox model.

### A Logit Model

For a discrete-time survival analysis we have to make sure we only include intervals with complete exposure, where we can classify the outcome as failure or survival. The convicts were released between July 1, 1977 and June 30, 1978 and the data were collected in April 1984, so the length of observation ranges between 70 and 81 months. We therefore restrict our attention to 5 years or 60 months. (We could go up to 6 years or 72 months for some convicts, but unfortunately we don't have the date of release, so we can't identify these cases and must censor everyone at 60.)

. drop if _t0 >= 60
(921 observations deleted)

. logit _d i.year x'

Iteration 0:   log likelihood = -1759.0583
Iteration 1:   log likelihood = -1654.9242
Iteration 2:   log likelihood = -1637.1916
Iteration 3:   log likelihood = -1637.1267
Iteration 4:   log likelihood = -1637.1267

Logistic regression                             Number of obs     =      5,797
LR chi2(14)       =     243.86
Prob > chi2       =     0.0000
Log likelihood = -1637.1267                     Pseudo R2         =     0.0693

------------------------------------------------------------------------------
_d |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1  |   .0305282   .1193583     0.26   0.798    -.2034098    .2644661
2  |  -.4131403   .1384065    -2.98   0.003    -.6844119   -.1418686
3  |  -.8641487   .1639958    -5.27   0.000    -1.185575   -.5427229
4  |  -.9936625   .1756322    -5.66   0.000    -1.337895   -.6494297
|
workprg |   .1109887   .1003087     1.11   0.269    -.0856129    .3075902
priors |   .0992921   .0164654     6.03   0.000     .0670205    .1315636
tserved |   .0149221   .0021429     6.96   0.000     .0107221    .0191222
felon |  -.3196621   .1178117    -2.71   0.007    -.5505687   -.0887555
alcohol |   .4724998   .1184177     3.99   0.000     .2404055    .7045941
drugs |    .316729   .1086092     2.92   0.004     .1038589    .5295992
black |   .4580275   .0973977     4.70   0.000     .2671315    .6489235
married |  -.2048073   .1204593    -1.70   0.089    -.4409032    .0312885
educ |  -.0267259   .0215052    -1.24   0.214    -.0688754    .0154235
age |  -.0040231    .000584    -6.89   0.000    -.0051678   -.0028784
_cons |  -1.140803   .3084159    -3.70   0.000    -1.745287   -.5363185
------------------------------------------------------------------------------

. estimates store logit

> recidx <- filter(recidx, interval != "(60,81]")

> logit <- glm(mf, data = recidx, family = binomial)  # no offset

> coef(summary(logit))
Estimate   Std. Error    z value     Pr(>|z|)
(Intercept)     -1.140802599 0.3084159337 -3.6989094 2.165279e-04
interval(12,24]  0.030528163 0.1193582701  0.2557692 7.981291e-01
interval(24,36] -0.413140262 0.1384064532 -2.9849783 2.835984e-03
interval(36,48] -0.864148699 0.1639957690 -5.2693353 1.369186e-07
interval(48,60] -0.993662524 0.1756321916 -5.6576332 1.534747e-08
workprg          0.110988653 0.1003087410  1.1064704 2.685230e-01
priors           0.099292063 0.0164653717  6.0303566 1.635983e-09
tserved          0.014922136 0.0021429307  6.9634244 3.320994e-12
felon           -0.319662098 0.1178116529 -2.7133318 6.661038e-03
alcohol          0.472499810 0.1184176515  3.9901130 6.604183e-05
drugs            0.316729032 0.1086092071  2.9162264 3.542934e-03
black            0.458027506 0.0973977193  4.7026512 2.568049e-06
married         -0.204807338 0.1204592720 -1.7002206 8.908944e-02
educ            -0.026725931 0.0215052145 -1.2427651 2.139544e-01
age             -0.004023087 0.0005840427 -6.8883431 5.644594e-12


We will compare all estimates below.

### A Complementary Log-log Model

Finally we use a complementary log-log link

. glm _d i.year x', family(binomial) link(cloglog)

Iteration 0:   log likelihood = -1818.8831
Iteration 1:   log likelihood = -1640.7899
Iteration 2:   log likelihood = -1637.5235
Iteration 3:   log likelihood = -1637.5083
Iteration 4:   log likelihood = -1637.5083

Generalized linear models                         No. of obs      =      5,797
Optimization     : ML                             Residual df     =      5,782
Scale parameter =          1
Deviance         =  3275.016541                   (1/df) Deviance =   .5664159
Pearson          =  5908.117581                   (1/df) Pearson  =   1.021812

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(-ln(1-u))            [Complementary log-log]

AIC             =   .5701253
Log likelihood   =  -1637.50827                   BIC             =  -46826.57

------------------------------------------------------------------------------
|                 OIM
_d |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1  |   .0216127   .1095455     0.20   0.844    -.1930925    .2363179
2  |  -.3926148   .1296978    -3.03   0.002    -.6468179   -.1384117
3  |  -.8249973   .1564479    -5.27   0.000    -1.131629   -.5183651
4  |  -.9483385   .1683997    -5.63   0.000    -1.278396   -.6182811
|
workprg |   .1044651   .0932517     1.12   0.263    -.0783049    .2872351
priors |   .0887063   .0139849     6.34   0.000     .0612964    .1161163
tserved |    .013267   .0017417     7.62   0.000     .0098534    .0166806
felon |  -.2885449   .1091355    -2.64   0.008    -.5024465   -.0746433
alcohol |   .4397795   .1090665     4.03   0.000      .226013    .6535459
drugs |   .2991025   .1002774     2.98   0.003     .1025625    .4956425
black |   .4272096   .0909458     4.70   0.000      .248959    .6054602
married |  -.1830403   .1137539    -1.61   0.108     -.405994    .0399133
educ |  -.0233346   .0201545    -1.16   0.247    -.0628367    .0161674
age |   -.003851   .0005466    -7.04   0.000    -.0049224   -.0027796
_cons |  -1.238797   .2893845    -4.28   0.000     -1.80598   -.6716138
------------------------------------------------------------------------------

. estimates store cloglog

> cloglog <- glm(mf, data = recidx, family = binomial(link = cloglog))

> coef(summary(cloglog))
Estimate   Std. Error    z value     Pr(>|z|)
(Intercept)     -1.238795113 0.2893607427 -4.2811444 1.859347e-05
interval(12,24]  0.021613951 0.1095604758  0.1972787 8.436094e-01
interval(24,36] -0.392613793 0.1297681301 -3.0255024 2.482204e-03
interval(36,48] -0.824996440 0.1566132100 -5.2677321 1.381194e-07
interval(48,60] -0.948338328 0.1684247392 -5.6306356 1.795467e-08
workprg          0.104466422 0.0934228228  1.1182109 2.634769e-01
priors           0.088706984 0.0145113085  6.1129556 9.780261e-10
tserved          0.013266906 0.0018142064  7.3127875 2.616567e-13
felon           -0.288542238 0.1096491770 -2.6315039 8.500789e-03
alcohol          0.439780479 0.1090998881  4.0309893 5.554258e-05
drugs            0.299102966 0.1003895869  2.9794222 2.887925e-03
black            0.427210098 0.0910947168  4.6897352 2.735589e-06
married         -0.183040394 0.1136073451 -1.6111669 1.071433e-01
educ            -0.023334468 0.0202349457 -1.1531767 2.488379e-01
age             -0.003851008 0.0005486362 -7.0192372 2.230827e-12


### Comparison of Estimates

All that remains is to compare the estimates

. estimates table pwe cloglog logit, eq(1:1:1)

-----------------------------------------------------
Variable |    pwe        cloglog       logit
-------------+---------------------------------------
year |
1  |  .03653199    .02161269    .03052816
2  | -.37381564   -.39261481   -.41314026
3  | -.81154363   -.82499726    -.8641487
4  | -.93823111   -.94833849   -.99366251
5  | -1.5471779
|
workprg |   .0838291     .1044651    .11098865
priors |  .08724582    .08870634    .09929206
tserved |  .01300886    .01326703    .01492214
felon | -.28392523   -.28854488    -.3196621
alcohol |  .43244249    .43977945    .47249981
drugs |  .27471411    .29910246    .31672903
black |  .43355595    .42720963     .4580275
married | -.15404774   -.18304032   -.20480734
educ | -.02141618   -.02333462   -.02672593
age |    -.00358   -.00385099   -.00402309
_cons | -3.8301275    -1.238797   -1.1408026
-----------------------------------------------------

> cbind(coef(pwe)[-6], coef(cloglog), coef(logit))
[,1]         [,2]         [,3]
(Intercept)     -3.830127469 -1.238795113 -1.140802599
interval(12,24]  0.036531989  0.021613951  0.030528163
interval(24,36] -0.373815644 -0.392613793 -0.413140262
interval(36,48] -0.811543632 -0.824996440 -0.864148699
interval(48,60] -0.938231113 -0.948338328 -0.993662524
workprg          0.083829106  0.104466422  0.110988653
priors           0.087245826  0.088706984  0.099292063
tserved          0.013008862  0.013266906  0.014922136
felon           -0.283925203 -0.288542238 -0.319662098
alcohol          0.432442493  0.439780479  0.472499810
drugs            0.274714115  0.299102966  0.316729032
black            0.433555955  0.427210098  0.458027506
married         -0.154047742 -0.183040394 -0.204807338
educ            -0.021416177 -0.023334468 -0.026725931
age             -0.003580003 -0.003851008 -0.004023087
`

As one would expect, the estimates of the relative risks based on the c-log-log link are closer to the continuous time estimates than those based on the logit link.

This result makes sense because the piece wise exponential and c-log-log link models are estimating the same continuous time hazard, one from continuous and one from grouped data, while the logit model is estimating a discrete time hazard.

Recall that in a continuous time model the relative risk multiplies the hazard or instantaneous failure rate, whereas in a discrete time logit model it multiplies the conditional odds of failure at a given time (or in a given time interval) given survival to that time (or the start of the interval). Interpretation of the results should take this fact into account.

All three approaches, however, lead to similar predicted survival probabilities.