Germán Rodríguez
Survival Analysis Princeton University

Recidivism in the U.S.

The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months.

Wooldridge fits Weibull and log-normal models using the following predictors

workprgan indicator of participation in a work program
priorsthe number of previous convictions
tservedthe time served rounded to months
felonand indicator of felony sentences
alcoholan indicator of alcohol problems
drugsan indicator of drug use history
black an indicator for African Americans
marriedan indicator if married when incarcerated
educthe number of years of schooling, and
agein months.

We will reproduce his results using R and Stata. The data are available in binary format from the Stata website and consists of 1445 observations on 18 variables. The duration variable is called durat and represents time in months until return to prison or end of follow up. The censoring indicator is called cens and is coded 1 if the observation was censored (i.e. the individual had not returned to prison).

We create a new variable fail coded 1 for failures.

. use http://www.stata.com/data/jwooldridge/eacsap/recid

. desc, short

Contains data from http://www.stata.com/data/jwooldridge/eacsap/recid.dta
  obs:         1,445                          
 vars:            18                          28 Sep 1998 13:28
 size:        33,235                          
Sorted by: 

. gen fail = 1 - cens
> library(foreign)

> recid <- read.dta("http://www.stata.com/data/jwooldridge/eacsap/recid.dta")

> recid <- mutate(recid, fail = 1 - cens)

> head(recid)
  durat cens workprg priors tserved felon alcohol drugs black married educ age fail
1    72    1       1      0      30     0       1     0     0       1    7 441    0
2    75    1       1      0      19     1       0     0     1       0   12 307    0
3     9    0       1      0      27     0       0     0     0       0    9 262    1
4    25    0       1      2      38     1       0     1     0       0    9 253    1
5    81    1       0      0       4     0       0     1     0       0    9 244    0
6    79    1       1      1      13     0       0     0     1       0   12 277    0

A Proportional Hazards Weibull Model

Let us first fit a proportional hazards model with a Weibull baseline, using stset to set the data and survreg to fit the model. To avoid repetition we will save the predictors in a macro. which we can do using survreg with dist set to "weibull". For convenience we save the model formula so we can reuse it later

. stset durat, failure(fail)

     failure event:  fail != 0 & fail < .
obs. time interval:  (0, durat]
 exit on or before:  failure

------------------------------------------------------------------------------
       1445  total observations
          0  exclusions
------------------------------------------------------------------------------
       1445  observations remaining, representing
        552  failures in single-record/single-failure data
      80013  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        81

. local predictors workprg priors tserved felon alcohol drugs ///
>         black married educ age

. streg `predictors', distrib(weibull)

         failure _d:  fail
   analysis time _t:  durat

Fitting constant-only model:

Iteration 0:   log likelihood = -1739.8944
Iteration 1:   log likelihood = -1716.1367
Iteration 2:   log likelihood = -1715.7712
Iteration 3:   log likelihood = -1715.7711

Fitting full model:

Iteration 0:   log likelihood = -1715.7711  
Iteration 1:   log likelihood = -1669.1785  
Iteration 2:   log likelihood = -1634.3693  
Iteration 3:   log likelihood = -1633.0405  
Iteration 4:   log likelihood = -1633.0325  
Iteration 5:   log likelihood = -1633.0325  

Weibull regression -- log relative-hazard form 

No. of subjects =        1,445                  Number of obs    =       1,445
No. of failures =          552
Time at risk    =        80013
                                                LR chi2(10)      =      165.48
Log likelihood  =   -1633.0325                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |   1.095148   .0992728     1.00   0.316     .9168814    1.308074
      priors |   1.092848    .014683     6.61   0.000     1.064445    1.122008
     tserved |   1.013655   .0017037     8.07   0.000     1.010321       1.017
       felon |   .7412054   .0785485    -2.83   0.005     .6021898    .9123128
     alcohol |   1.564179    .165389     4.23   0.000     1.271406     1.92437
       drugs |   1.325064   .1296765     2.88   0.004     1.093791    1.605237
       black |   1.574149   .1390031     5.14   0.000      1.32398    1.871587
     married |   .8593436   .0938794    -1.39   0.165     .6937084    1.064527
        educ |   .9769709   .0189724    -1.20   0.230     .9404845    1.014873
         age |   .9962823    .000523    -7.09   0.000     .9952577     .997308
       _cons |   .0333035   .0100249   -11.30   0.000     .0184613    .0600781
-------------+----------------------------------------------------------------
       /ln_p |  -.2158398   .0389149    -5.55   0.000    -.2921115   -.1395681
-------------+----------------------------------------------------------------
           p |   .8058644   .0313601                      .7466852    .8697338
         1/p |   1.240904   .0482896                      1.149777    1.339252
------------------------------------------------------------------------------

By default Stata exponentiates the coefficients to show relative risks. Use the option nohr for no hazard ratios to obtain the coefficients.

. streg, nohr

Weibull regression -- log relative-hazard form 

No. of subjects =        1,445                  Number of obs    =       1,445
No. of failures =          552
Time at risk    =        80013
                                                LR chi2(10)      =      165.48
Log likelihood  =   -1633.0325                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |   .0908893   .0906478     1.00   0.316    -.0867772    .2685558
      priors |   .0887867   .0134355     6.61   0.000     .0624535    .1151198
     tserved |   .0135625   .0016808     8.07   0.000     .0102682    .0168567
       felon |  -.2994775    .105974    -2.83   0.005    -.5071826   -.0917723
     alcohol |   .4473611   .1057353     4.23   0.000     .2401236    .6545985
       drugs |   .2814605   .0978644     2.88   0.004     .0896499    .4732711
       black |   .4537147   .0883037     5.14   0.000     .2806426    .6267867
     married |  -.1515864   .1092454    -1.39   0.165    -.3657035    .0625307
        educ |  -.0232984   .0194196    -1.20   0.230    -.0613601    .0147633
         age |  -.0037246    .000525    -7.09   0.000    -.0047536   -.0026956
       _cons |  -3.402094   .3010177   -11.30   0.000    -3.992077    -2.81211
-------------+----------------------------------------------------------------
       /ln_p |  -.2158398   .0389149    -5.55   0.000    -.2921115   -.1395681
-------------+----------------------------------------------------------------
           p |   .8058644   .0313601                      .7466852    .8697338
         1/p |   1.240904   .0482896                      1.149777    1.339252
------------------------------------------------------------------------------
> require(survival)

> mf <- Surv(durat, fail) ~  workprg + priors + tserved + felon +
+   alcohol + drugs + black + married + educ + age

> wf <- survreg(mf, data=recid, dist="weibull")

By default R reports coefficients in the accelerated failure time metric. I wrote a simple function to convert them to a proportional hazards metric, which you can download from this website.(I also work with p = 1/scale.)

> source("http://data.princeton.edu/pop509/aft2ph.txt")

> phfit <- aft2ph(wf); phfit
          term     estimate    std.error  statistic      p.value
1  (Intercept) -3.402093509 0.3010177314 -11.301970 1.283063e-29
2      workprg  0.090889277 0.0906478280   1.002664 3.160232e-01
3       priors  0.088786675 0.0134355280   6.608350 3.886262e-11
4      tserved  0.013562467 0.0016807811   8.069145 7.079190e-16
5        felon -0.299477494 0.1059739596  -2.825954 4.714009e-03
6      alcohol  0.447361071 0.1057353368   4.230951 2.327050e-05
7        drugs  0.281460523 0.0978643554   2.876027 4.027153e-03
8        black  0.453714680 0.0883036888   5.138117 2.775052e-07
9      married -0.151586401 0.1092454397  -1.387576 1.652661e-01
10        educ -0.023298411 0.0194196040  -1.199737 2.302416e-01
11         age -0.003724612 0.0005249981  -7.094526 1.297959e-12
12       log.p -0.215839833 0.0389148545  -5.546464 2.915049e-08

This reproduces Table 20.1 in Wooldridge. All but three of the predictors have significant coefficients, the exceptions being participation in a work program, marital status, and education.

To obtain hazard ratios we exponentiate the coefficients. We also report the Weibull parameter p, while R uses scale = 1/p.

> transmute(phfit, term, hazard.ratio = exp(estimate))
          term hazard.ratio
1  (Intercept)   0.03330348
2      workprg   1.09514774
3       priors   1.09284750
4      tserved   1.01365485
5        felon   0.74120540
6      alcohol   1.56417898
7        drugs   1.32506369
8        black   1.57414880
9      married   0.85934363
10        educ   0.97697090
11         age   0.99628232
12       log.p   0.80586436

> exp(tail(phfit,1)$estimate)
[1] 0.8058644

The Weibull parameter p is 0.8, indicating that the risk of recidivism declines over time, about 21% per month. The hypothesis that the risk is constant over time would be soundly rejected.

The exponentiated coefficient of drugs indicates that former inmates with a history of drug use have 31% higher risk or returning to jail at any given time that peers with identical characteristics but no history of drug use.

Accelerated Failure Time Weibull

We can also work with the Weibull model in an accelerated failure time framework, which we can do by simply adding the time option: which is in fact the default in R. We'll use the summary() function. (An alternative is tidy, but the current version fails with survreg objects.)

. streg `predictors', distrib(weibull) time

         failure _d:  fail
   analysis time _t:  durat

Fitting constant-only model:

Iteration 0:   log likelihood = -1739.8944
Iteration 1:   log likelihood = -1716.1367
Iteration 2:   log likelihood = -1715.7712
Iteration 3:   log likelihood = -1715.7711

Fitting full model:

Iteration 0:   log likelihood = -1715.7711  
Iteration 1:   log likelihood = -1669.1785  
Iteration 2:   log likelihood = -1634.3693  
Iteration 3:   log likelihood = -1633.0405  
Iteration 4:   log likelihood = -1633.0325  
Iteration 5:   log likelihood = -1633.0325  

Weibull regression -- accelerated failure-time form 

No. of subjects =        1,445                  Number of obs    =       1,445
No. of failures =          552
Time at risk    =        80013
                                                LR chi2(10)      =      165.48
Log likelihood  =   -1633.0325                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |  -.1127848   .1125346    -1.00   0.316    -.3333486     .107779
      priors |  -.1101757   .0170675    -6.46   0.000    -.1436273   -.0767241
     tserved |  -.0168297   .0021303    -7.90   0.000     -.021005   -.0126544
       felon |   .3716227   .1319951     2.82   0.005      .112917    .6303284
     alcohol |   -.555132   .1322427    -4.20   0.000    -.8143229    -.295941
       drugs |  -.3492654   .1218801    -2.87   0.004    -.5881461   -.1103847
       black |  -.5630162    .110817    -5.08   0.000    -.7802135   -.3458189
     married |   .1881041   .1357519     1.39   0.166    -.0779647    .4541729
        educ |   .0289111   .0241153     1.20   0.231    -.0183541    .0761763
         age |   .0046219   .0006648     6.95   0.000     .0033189    .0059249
       _cons |    4.22167   .3413114    12.37   0.000     3.552712    4.890628
-------------+----------------------------------------------------------------
       /ln_p |  -.2158398   .0389149    -5.55   0.000    -.2921115   -.1395681
-------------+----------------------------------------------------------------
           p |   .8058644   .0313601                      .7466852    .8697338
         1/p |   1.240904   .0482896                      1.149777    1.339252
------------------------------------------------------------------------------
> summary(wf)

Call:
survreg(formula = mf, data = recid, dist = "weibull")
               Value Std. Error     z        p
(Intercept)  4.22167   0.341311 12.37 3.85e-35
workprg     -0.11278   0.112535 -1.00 3.16e-01
priors      -0.11018   0.017067 -6.46 1.08e-10
tserved     -0.01683   0.002130 -7.90 2.78e-15
felon        0.37162   0.131995  2.82 4.87e-03
alcohol     -0.55513   0.132243 -4.20 2.69e-05
drugs       -0.34927   0.121880 -2.87 4.16e-03
black       -0.56302   0.110817 -5.08 3.76e-07
married      0.18810   0.135752  1.39 1.66e-01
educ         0.02891   0.024115  1.20 2.31e-01
age          0.00462   0.000665  6.95 3.60e-12
Log(scale)   0.21584   0.038915  5.55 2.92e-08

Scale= 1.24 

Weibull distribution
Loglik(model)= -3192.1   Loglik(intercept only)= -3274.8
    Chisq= 165.48 on 10 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 5 
n= 1445 

The substantive results are the same as before, which is not surprising given that it is really the same model. You may want to verity that the AFT parameters are exactly the same as the PH parameters with opposite sign and divided by p. For example the coefficient for drugs is -0.28/0.8 = -0.35.

The coefficients may be exponentiated, which allows interpreting them in terms of time ratios. You can do this in Stata using the tr option. We will focus on one coefficient.

> exp(coef(wf))
(Intercept)     workprg      priors     tserved       felon     alcohol 
 68.1472035   0.8933429   0.8956767   0.9833111   1.4500858   0.5739965 
      drugs       black     married        educ         age 
  0.7052060   0.5694888   1.2069592   1.0293331   1.0046326 

Exponentiating the drug coefficient that see that former inmates with a history of drug use spend 29% less time out of prison than peers with the same observed characteristics by no history of drug use. This is easier to see substracting one

. di exp(_b[drugs]) - 1
-.29479404
> exp(coef(wf)["drugs"]) - 1
    drugs 
-0.294794 

We can also say that time outside of prison passes 42% faster for former inmates with a history of drug use than for those without, everything else being equal, which you can verify by changing signs and exponentiating

. di exp(-_b[drugs]) - 1
.41802546
>  exp(-coef(wf)["drugs"]) - 1
    drugs 
0.4180255 

The two interpretations are, of course, equivalent.

A Log-Normal AFT Model

The Weibull model allows the hazard to increase or decrease with time, but at a constant rate. Wooldridge notes that the log-normal distribution provides a better fit to the data. We can fit a log-normal model with the same tools, just by changing the distribution to "lognormal".

. streg `predictors', distrib(lognormal)

         failure _d:  fail
   analysis time _t:  durat

Fitting constant-only model:

Iteration 0:   log likelihood =   -1999.58  
Iteration 1:   log likelihood =  -1695.747  
Iteration 2:   log likelihood = -1681.0153  
Iteration 3:   log likelihood = -1680.4273  
Iteration 4:   log likelihood =  -1680.427  
Iteration 5:   log likelihood =  -1680.427  

Fitting full model:

Iteration 0:   log likelihood =  -1680.427  
Iteration 1:   log likelihood = -1608.1657  
Iteration 2:   log likelihood = -1597.1838  
Iteration 3:   log likelihood = -1597.0591  
Iteration 4:   log likelihood =  -1597.059  

Lognormal regression -- accelerated failure-time form 

No. of subjects =        1,445                  Number of obs    =       1,445
No. of failures =          552
Time at risk    =        80013
                                                LR chi2(10)      =      166.74
Log likelihood  =    -1597.059                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |  -.0625714   .1200369    -0.52   0.602    -.2978394    .1726965
      priors |  -.1372528   .0214587    -6.40   0.000     -.179311   -.0951946
     tserved |  -.0193305   .0029779    -6.49   0.000    -.0251671   -.0134939
       felon |   .4439944   .1450865     3.06   0.002     .1596302    .7283586
     alcohol |  -.6349088   .1442165    -4.40   0.000    -.9175681   -.3522496
       drugs |  -.2981599   .1327355    -2.25   0.025    -.5583168   -.0380031
       black |  -.5427175   .1174427    -4.62   0.000     -.772901    -.312534
     married |   .3406835    .139843     2.44   0.015     .0665962    .6147707
        educ |   .0229195   .0253974     0.90   0.367    -.0268584    .0726975
         age |   .0039103   .0006062     6.45   0.000     .0027221    .0050984
       _cons |   4.099386   .3475349    11.80   0.000      3.41823    4.780542
-------------+----------------------------------------------------------------
     /ln_sig |   .5935861   .0344122    17.25   0.000     .5261395    .6610327
-------------+----------------------------------------------------------------
       sigma |   1.810469   .0623022                      1.692386    1.936791
------------------------------------------------------------------------------

We do not need to specify time as this distribution is only available in the AFT metric.

> lnf <- survreg(mf, data=recid, dist="lognormal")  

> summary(lnf)

Call:
survreg(formula = mf, data = recid, dist = "lognormal")
               Value Std. Error      z        p
(Intercept)  4.09939   0.347535 11.796 4.11e-32
workprg     -0.06257   0.120037 -0.521 6.02e-01
priors      -0.13725   0.021459 -6.396 1.59e-10
tserved     -0.01933   0.002978 -6.491 8.51e-11
felon        0.44399   0.145087  3.060 2.21e-03
alcohol     -0.63491   0.144217 -4.402 1.07e-05
drugs       -0.29816   0.132736 -2.246 2.47e-02
black       -0.54272   0.117443 -4.621 3.82e-06
married      0.34068   0.139843  2.436 1.48e-02
educ         0.02292   0.025397  0.902 3.67e-01
age          0.00391   0.000606  6.450 1.12e-10
Log(scale)   0.59359   0.034412 17.249 1.13e-66

Scale= 1.81 

Log Normal distribution
Loglik(model)= -3156.1   Loglik(intercept only)= -3239.5
    Chisq= 166.74 on 10 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 4 
n= 1445 

We see that the log-likelihood is indeed higher for the log-normal model, -1597.1 compared to -1633.0 -3156.1 compared to -3192.1 for the Weibull, so we now have a better fit to the data.

You may be interested to know that Stata and R report different log-likelihoods. For example the log-normal log-likelihood is -1597.1 in Stata and -3156.1 in R. This is because R works with the distribution of time and Stata with the distribution of log(time). The difference turns out to be the Jacobian of the transformation, which is the sum of log(t) over all failures or 1559.1. Differences between log-likelihood are not affected by this constant term.

Most of the effects are robust to the choice of distribution, but note that the protective effect of marriage is now significant. The coefficient for drugs, at -0.30, is smaller in magnituded and less significant than before.

Other Parametric Models

Fitting a generalized gamma model leads to similar conclusions, except that the effect of drugs is no longer significant. This result suggest that there may be an interaction between drug history and duration, as the effect depends on how the hazard is specified. We will return to this issue later.

The generalized gama model is available in Stata as part of streg, not available in R's survival package, but you will find it in the package flexsurv, which also allows fitting Gompertz models.