Germán Rodríguez

Survival Analysis
Princeton University
The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months.

Wooldridge fits Weibull and log-normal models using the following predictors

workprg | an indicator of participation in a work program |

priors | the number of previous convictions |

tserved | the time served rounded to months |

felon | and indicator of felony sentences |

alcohol | an indicator of alcohol problems |

drugs | an indicator of drug use history |

black | an indicator for African Americans |

married | an indicator if married when incarcerated |

educ | the number of years of schooling, and |

age | in months. |

We will reproduce his results using R and Stata.
The data are available in binary format from the Stata website
and consists of 1445 observations on 18 variables.
The duration variable is called `durat`

and represents time in months
until return to prison or end of follow up.
The censoring indicator is called `cens`

and is coded 1 if the observation
was censored (i.e. the individual had *not* returned to prison).

We create a new variable `fail`

coded 1 for failures.

. use https://www.stata.com/data/jwooldridge/eacsap/recid . desc, short Contains data from https://www.stata.com/data/jwooldridge/eacsap/recid.dta obs: 1,445 vars: 18 28 Sep 1998 13:28 size: 33,235 Sorted by: . gen fail = 1 - cens

> library(foreign) > recid <- read.dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta") > recid <- mutate(recid, fail = 1 - cens) > head(recid) durat cens workprg priors tserved felon alcohol drugs black married educ age fail 1 72 1 1 0 30 0 1 0 0 1 7 441 0 2 75 1 1 0 19 1 0 0 1 0 12 307 0 3 9 0 1 0 27 0 0 0 0 0 9 262 1 4 25 0 1 2 38 1 0 1 0 0 9 253 1 5 81 1 0 0 4 0 0 1 0 0 9 244 0 6 79 1 1 1 13 0 0 0 1 0 12 277 0

Let us first fit a proportional hazards model with a Weibull baseline,
using `stset`

to set the data and `survreg`

to fit the model.
To avoid repetition we will save the predictors in a macro.
which we can do using `survreg`

with `dist`

set to "weibull".
For convenience we save the model formula so we can reuse it later

. stset durat, failure(fail) failure event: fail != 0 & fail < . obs. time interval: (0, durat] exit on or before: failure ------------------------------------------------------------------------------ 1445 total observations 0 exclusions ------------------------------------------------------------------------------ 1445 observations remaining, representing 552 failures in single-record/single-failure data 80013 total analysis time at risk and under observation at risk from t = 0 earliest observed entry t = 0 last observed exit t = 81 . local predictors workprg priors tserved felon alcohol drugs /// > black married educ age . streg `predictors', distrib(weibull) failure _d: fail analysis time _t: durat Fitting constant-only model: Iteration 0: log likelihood = -1739.8944 Iteration 1: log likelihood = -1716.1367 Iteration 2: log likelihood = -1715.7712 Iteration 3: log likelihood = -1715.7711 Fitting full model: Iteration 0: log likelihood = -1715.7711 Iteration 1: log likelihood = -1669.1785 Iteration 2: log likelihood = -1634.3693 Iteration 3: log likelihood = -1633.0405 Iteration 4: log likelihood = -1633.0325 Iteration 5: log likelihood = -1633.0325 Weibull regression -- log relative-hazard form No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80013 LR chi2(10) = 165.48 Log likelihood = -1633.0325 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- workprg | 1.095148 .0992728 1.00 0.316 .9168814 1.308074 priors | 1.092848 .014683 6.61 0.000 1.064445 1.122008 tserved | 1.013655 .0017037 8.07 0.000 1.010321 1.017 felon | .7412054 .0785485 -2.83 0.005 .6021898 .9123128 alcohol | 1.564179 .165389 4.23 0.000 1.271406 1.92437 drugs | 1.325064 .1296765 2.88 0.004 1.093791 1.605237 black | 1.574149 .1390031 5.14 0.000 1.32398 1.871587 married | .8593436 .0938794 -1.39 0.165 .6937084 1.064527 educ | .9769709 .0189724 -1.20 0.230 .9404845 1.014873 age | .9962823 .000523 -7.09 0.000 .9952577 .997308 _cons | .0333035 .0100249 -11.30 0.000 .0184613 .0600781 -------------+---------------------------------------------------------------- /ln_p | -.2158398 .0389149 -5.55 0.000 -.2921115 -.1395681 -------------+---------------------------------------------------------------- p | .8058644 .0313601 .7466852 .8697338 1/p | 1.240904 .0482896 1.149777 1.339252 ------------------------------------------------------------------------------

By default Stata exponentiates the coefficients to show relative risks.
Use the option `nohr`

for *no* hazard ratios to obtain the coefficients.

. streg, nohr Weibull regression -- log relative-hazard form No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80013 LR chi2(10) = 165.48 Log likelihood = -1633.0325 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- workprg | .0908893 .0906478 1.00 0.316 -.0867772 .2685558 priors | .0887867 .0134355 6.61 0.000 .0624535 .1151198 tserved | .0135625 .0016808 8.07 0.000 .0102682 .0168567 felon | -.2994775 .105974 -2.83 0.005 -.5071826 -.0917723 alcohol | .4473611 .1057353 4.23 0.000 .2401236 .6545985 drugs | .2814605 .0978644 2.88 0.004 .0896499 .4732711 black | .4537147 .0883037 5.14 0.000 .2806426 .6267867 married | -.1515864 .1092454 -1.39 0.165 -.3657035 .0625307 educ | -.0232984 .0194196 -1.20 0.230 -.0613601 .0147633 age | -.0037246 .000525 -7.09 0.000 -.0047536 -.0026956 _cons | -3.402094 .3010177 -11.30 0.000 -3.992077 -2.81211 -------------+---------------------------------------------------------------- /ln_p | -.2158398 .0389149 -5.55 0.000 -.2921115 -.1395681 -------------+---------------------------------------------------------------- p | .8058644 .0313601 .7466852 .8697338 1/p | 1.240904 .0482896 1.149777 1.339252 ------------------------------------------------------------------------------

> require(survival) > mf <- Surv(durat, fail) ~ workprg + priors + tserved + felon + + alcohol + drugs + black + married + educ + age > wf <- survreg(mf, data=recid, dist="weibull")

By default R reports coefficients in the accelerated failure time metric.
I wrote a simple function to convert them to a proportional hazards metric,
which you can download from this website.(I also work with *p = 1/scale*.)

> source("https://data.princeton.edu/pop509/aft2ph.txt") > phfit <- aft2ph(wf); phfit term estimate std.error statistic p.value 1 (Intercept) -3.402093509 0.3010177314 -11.301970 1.283063e-29 2 workprg 0.090889277 0.0906478280 1.002664 3.160232e-01 3 priors 0.088786675 0.0134355280 6.608350 3.886262e-11 4 tserved 0.013562467 0.0016807811 8.069145 7.079190e-16 5 felon -0.299477494 0.1059739596 -2.825954 4.714009e-03 6 alcohol 0.447361071 0.1057353368 4.230951 2.327050e-05 7 drugs 0.281460523 0.0978643554 2.876027 4.027153e-03 8 black 0.453714680 0.0883036888 5.138117 2.775052e-07 9 married -0.151586401 0.1092454397 -1.387576 1.652661e-01 10 educ -0.023298411 0.0194196040 -1.199737 2.302416e-01 11 age -0.003724612 0.0005249981 -7.094526 1.297959e-12 12 log.p -0.215839833 0.0389148545 -5.546464 2.915049e-08

This reproduces Table 20.1 in Wooldridge. All but three of the predictors have significant coefficients, the exceptions being participation in a work program, marital status, and education.

To obtain *hazard ratios* we exponentiate the coefficients.
We also report the Weibull parameter *p*, while R uses scale = 1/p.

> transmute(phfit, term, hazard.ratio = exp(estimate)) term hazard.ratio 1 (Intercept) 0.03330348 2 workprg 1.09514774 3 priors 1.09284750 4 tserved 1.01365485 5 felon 0.74120540 6 alcohol 1.56417898 7 drugs 1.32506369 8 black 1.57414880 9 married 0.85934363 10 educ 0.97697090 11 age 0.99628232 12 log.p 0.80586436 > exp(tail(phfit,1)$estimate) [1] 0.8058644

The Weibull parameter *p* is 0.8, indicating that the risk of recidivism declines
over time, about 21% per month. The hypothesis that the risk is constant over time
would be soundly rejected.

The exponentiated coefficient of drugs indicates that former inmates with a history of drug use have 32.5% higher risk or returning to jail at any given time than peers with identical characteristics but no history of drug use.

We can also work with the Weibull model in an accelerated failure time framework,
which we can do by simply adding the `time`

option:
which is in fact the default in R. We'll use the `summary()`

function.
(An alternative is to use `tidy`

to produce a data frame.)

. streg `predictors', distrib(weibull) time failure _d: fail analysis time _t: durat Fitting constant-only model: Iteration 0: log likelihood = -1739.8944 Iteration 1: log likelihood = -1716.1367 Iteration 2: log likelihood = -1715.7712 Iteration 3: log likelihood = -1715.7711 Fitting full model: Iteration 0: log likelihood = -1715.7711 Iteration 1: log likelihood = -1669.1785 Iteration 2: log likelihood = -1634.3693 Iteration 3: log likelihood = -1633.0405 Iteration 4: log likelihood = -1633.0325 Iteration 5: log likelihood = -1633.0325 Weibull regression -- accelerated failure-time form No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80013 LR chi2(10) = 165.48 Log likelihood = -1633.0325 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- workprg | -.1127848 .1125346 -1.00 0.316 -.3333486 .107779 priors | -.1101757 .0170675 -6.46 0.000 -.1436273 -.0767241 tserved | -.0168297 .0021303 -7.90 0.000 -.021005 -.0126544 felon | .3716227 .1319951 2.82 0.005 .112917 .6303284 alcohol | -.555132 .1322427 -4.20 0.000 -.8143229 -.295941 drugs | -.3492654 .1218801 -2.87 0.004 -.5881461 -.1103847 black | -.5630162 .110817 -5.08 0.000 -.7802135 -.3458189 married | .1881041 .1357519 1.39 0.166 -.0779647 .4541729 educ | .0289111 .0241153 1.20 0.231 -.0183541 .0761763 age | .0046219 .0006648 6.95 0.000 .0033189 .0059249 _cons | 4.22167 .3413114 12.37 0.000 3.552712 4.890628 -------------+---------------------------------------------------------------- /ln_p | -.2158398 .0389149 -5.55 0.000 -.2921115 -.1395681 -------------+---------------------------------------------------------------- p | .8058644 .0313601 .7466852 .8697338 1/p | 1.240904 .0482896 1.149777 1.339252 ------------------------------------------------------------------------------

> summary(wf) Call: survreg(formula = mf, data = recid, dist = "weibull") Value Std. Error z p (Intercept) 4.22167 0.341311 12.37 3.85e-35 workprg -0.11278 0.112535 -1.00 3.16e-01 priors -0.11018 0.017067 -6.46 1.08e-10 tserved -0.01683 0.002130 -7.90 2.78e-15 felon 0.37162 0.131995 2.82 4.87e-03 alcohol -0.55513 0.132243 -4.20 2.69e-05 drugs -0.34927 0.121880 -2.87 4.16e-03 black -0.56302 0.110817 -5.08 3.76e-07 married 0.18810 0.135752 1.39 1.66e-01 educ 0.02891 0.024115 1.20 2.31e-01 age 0.00462 0.000665 6.95 3.60e-12 Log(scale) 0.21584 0.038915 5.55 2.92e-08 Scale= 1.24 Weibull distribution Loglik(model)= -3192.1 Loglik(intercept only)= -3274.8 Chisq= 165.48 on 10 degrees of freedom, p= 0 Number of Newton-Raphson Iterations: 5 n= 1445

The substantive results are the same as before, which is not surprising given that
it is really the same model. You may want to verity that the AFT parameters are
exactly the same as the PH parameters with opposite sign and divided by *p*.
For example the coefficient for drugs is -0.28/0.8 = -0.35.

The coefficients may be exponentiated, which allows interpreting them in terms
of *time ratios*.
You can do this in Stata using the `tr`

option. We will focus on one coefficient.

> exp(coef(wf)) (Intercept) workprg priors tserved felon alcohol 68.1472035 0.8933429 0.8956767 0.9833111 1.4500858 0.5739965 drugs black married educ age 0.7052060 0.5694888 1.2069592 1.0293331 1.0046326

Exponentiating the drug coefficient that see that former inmates with a history of drug use spend 29% less time out of prison than peers with the same observed characteristics by no history of drug use. This is easier to see substracting one

. di exp(_b[drugs]) - 1 -.29479404

> exp(coef(wf)["drugs"]) - 1 drugs -0.294794

We can also say that time outside of prison passes 42% faster for former inmates with a history of drug use than for those without, everything else being equal, which you can verify by changing signs and exponentiating

. di exp(-_b[drugs]) - 1 .41802546

> exp(-coef(wf)["drugs"]) - 1 drugs 0.4180255

The two interpretations are, of course, equivalent.

The Weibull model allows the hazard to increase or decrease with time, but at a constant rate. Wooldridge notes that the log-normal distribution provides a better fit to the data. We can fit a log-normal model with the same tools, just by changing the distribution to "lognormal".

. streg `predictors', distrib(lognormal) failure _d: fail analysis time _t: durat Fitting constant-only model: Iteration 0: log likelihood = -1999.58 Iteration 1: log likelihood = -1695.747 Iteration 2: log likelihood = -1681.0153 Iteration 3: log likelihood = -1680.4273 Iteration 4: log likelihood = -1680.427 Iteration 5: log likelihood = -1680.427 Fitting full model: Iteration 0: log likelihood = -1680.427 Iteration 1: log likelihood = -1608.1657 Iteration 2: log likelihood = -1597.1838 Iteration 3: log likelihood = -1597.0591 Iteration 4: log likelihood = -1597.059 Lognormal regression -- accelerated failure-time form No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80013 LR chi2(10) = 166.74 Log likelihood = -1597.059 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- workprg | -.0625714 .1200369 -0.52 0.602 -.2978394 .1726965 priors | -.1372528 .0214587 -6.40 0.000 -.179311 -.0951946 tserved | -.0193305 .0029779 -6.49 0.000 -.0251671 -.0134939 felon | .4439944 .1450865 3.06 0.002 .1596302 .7283586 alcohol | -.6349088 .1442165 -4.40 0.000 -.9175681 -.3522496 drugs | -.2981599 .1327355 -2.25 0.025 -.5583168 -.0380031 black | -.5427175 .1174427 -4.62 0.000 -.772901 -.312534 married | .3406835 .139843 2.44 0.015 .0665962 .6147707 educ | .0229195 .0253974 0.90 0.367 -.0268584 .0726975 age | .0039103 .0006062 6.45 0.000 .0027221 .0050984 _cons | 4.099386 .3475349 11.80 0.000 3.41823 4.780542 -------------+---------------------------------------------------------------- /ln_sig | .5935861 .0344122 17.25 0.000 .5261395 .6610327 -------------+---------------------------------------------------------------- sigma | 1.810469 .0623022 1.692386 1.936791 ------------------------------------------------------------------------------

We do not need to specify `time`

as this distribution is only available
in the AFT metric.

> lnf <- survreg(mf, data=recid, dist="lognormal") > summary(lnf) Call: survreg(formula = mf, data = recid, dist = "lognormal") Value Std. Error z p (Intercept) 4.09939 0.347535 11.796 4.11e-32 workprg -0.06257 0.120037 -0.521 6.02e-01 priors -0.13725 0.021459 -6.396 1.59e-10 tserved -0.01933 0.002978 -6.491 8.51e-11 felon 0.44399 0.145087 3.060 2.21e-03 alcohol -0.63491 0.144217 -4.402 1.07e-05 drugs -0.29816 0.132736 -2.246 2.47e-02 black -0.54272 0.117443 -4.621 3.82e-06 married 0.34068 0.139843 2.436 1.48e-02 educ 0.02292 0.025397 0.902 3.67e-01 age 0.00391 0.000606 6.450 1.12e-10 Log(scale) 0.59359 0.034412 17.249 1.13e-66 Scale= 1.81 Log Normal distribution Loglik(model)= -3156.1 Loglik(intercept only)= -3239.5 Chisq= 166.74 on 10 degrees of freedom, p= 0 Number of Newton-Raphson Iterations: 4 n= 1445

We see that the log-likelihood is indeed higher for the log-normal model, -1597.1 compared to -1633.0 -3156.1 compared to -3192.1 for the Weibull, so we now have a better fit to the data.

You may be interested to know that Stata and R report different log-likelihoods.
For example the log-normal log-likelihood is -1597.1 in Stata and -3156.1 in R.
This is because R works with the distribution of *time* and Stata with the distribution
of *log(time)*. The difference turns out to be the Jacobian of the transformation,
which is the sum of *log(t)* over all failures or 1559.1. Differences between
log-likelihood are not affected by this constant term.

Most of the effects are robust to the choice of distribution, but note that the protective effect of marriage is now significant. The coefficient for drugs, at -0.30, is smaller in magnituded and less significant than before.

Fitting a generalized gamma model leads to similar conclusions, except that the effect of drugs is no longer significant. This result suggest that there may be an interaction between drug history and duration, as the effect depends on how the hazard is specified. We will return to this issue later.

The generalized gama model is
available in Stata as part of `streg`

,
*not* available in R's `survival`

package, but you will find it in the package `flexsurv`

,
which also allows fitting Gompertz models.