Home | GLMs | Multilevel | Survival | Demography | Stata | R

Recidivism in the U.S.

The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available from the Stata website and can be accessed using the command

. use http://www.stata.com/data/jwooldridge/eacsap/recid

. desc, short

Contains data from http://www.stata.com/data/jwooldridge/eacsap/recid.dta
  obs:         1,445                          
 vars:            18                          28 Sep 1998 13:28
 size:        39,015 (99.6% of memory free)
Sorted by:  

You should have 1445 observations on 18 variables. The duration variable is called durat and represents time in months until return to prison or end of follow up. The censoring indicator is called cens and is coded 1 if the observation was censored (i.e. the individual had not returned to prison).

Setting the Data

Before using any of Stata's survival command's one has to stset the data. This tells Stata that we have duration data and specifies the time variable and the failure indicator. For this example we need to calculate the latter:

. gen fail = 1 - cens

. stset durat, failure(fail)

     failure event:  fail != 0 & fail < .
obs. time interval:  (0, durat]
 exit on or before:  failure

------------------------------------------------------------------------------
     1445  total obs.
        0  exclusions
------------------------------------------------------------------------------
     1445  obs. remaining, representing
      552  failures in single record/single failure data
    80013  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        81

Stata runs a few sanity checks. Note that we have 552 events on 80,013 weeks of exposure, which gives a crude annual recidivism rate of 35.9% (or 36 events per 100 person-years of exposure). A crude rate should always be interpreted cautiously, particularly if there is duration dependence.

A Proportional Hazards Weibull

Wooldridge fits a Weibul model using as predictors

workprg an indicator of participation in a work program
priors the number of previous convictions
tserved the time served rounded to months
felon and indicator of felony sentences
alcohol an indicator of alcohol problems
drugs an indicator of drug use history
black an indicator for African Americans
married an indicator if married when incarcerated
educ the number of years of schooling, and
age in months.

Let us first fit a proportional hazards model, which we can do using the streg command with the option distrib(weibull) to specify a Weibull distribution.

. local predictors workprg priors tserved felon alcohol drugs ///
>  black married educ age

.  streg `predictors', distrib(weibull)

         failure _d:  fail
   analysis time _t:  durat

Fitting constant-only model:

Iteration 0:   log likelihood = -1739.8944
Iteration 1:   log likelihood = -1716.1367
Iteration 2:   log likelihood = -1715.7712
Iteration 3:   log likelihood = -1715.7711

Fitting full model:

Iteration 0:   log likelihood = -1715.7711  
Iteration 1:   log likelihood = -1669.1785  
Iteration 2:   log likelihood = -1634.3693  
Iteration 3:   log likelihood = -1633.0405  
Iteration 4:   log likelihood = -1633.0325  
Iteration 5:   log likelihood = -1633.0325  

Weibull regression -- log relative-hazard form 

No. of subjects =         1445                     Number of obs   =      1445
No. of failures =          552
Time at risk    =        80013
                                                   LR chi2(10)     =    165.48
Log likelihood  =   -1633.0325                     Prob > chi2     =    0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |   1.095148   .0992728     1.00   0.316     .9168814    1.308074
      priors |   1.092848    .014683     6.61   0.000     1.064445    1.122008
     tserved |   1.013655   .0017037     8.07   0.000     1.010321       1.017
       felon |   .7412054   .0785485    -2.83   0.005     .6021898    .9123128
     alcohol |   1.564179    .165389     4.23   0.000     1.271406     1.92437
       drugs |   1.325064   .1296765     2.88   0.004     1.093791    1.605237
       black |   1.574149   .1390031     5.14   0.000      1.32398    1.871587
     married |   .8593436   .0938794    -1.39   0.165     .6937084    1.064527
        educ |   .9769709   .0189724    -1.20   0.230     .9404845    1.014873
         age |   .9962823    .000523    -7.09   0.000     .9952577     .997308
-------------+----------------------------------------------------------------
       /ln_p |  -.2158398   .0389149    -5.55   0.000    -.2921115   -.1395681
-------------+----------------------------------------------------------------
           p |   .8058644   .0313601                      .7466852    .8697338
         1/p |   1.240904   .0482896                      1.149777    1.339252
------------------------------------------------------------------------------

Note that we do not specify the outcome, as this has been done with stset; we just list the predictors.

The Weibull parameter p is 0.8, indicating that the risk of recidivism declines over time (about 21% per week!). The hypothesis that the risk is constant over time would be soundly rejected.

BY default Stata exponentiates the coefficients to show relative risks. Use the option nohr, for no hazard ratios, to obtain the coefficients. This can be done issuing the streg command with no predictors, and reproduces Table 20.1 in Wooldridge:

. streg, nohr

Weibull regression -- log relative-hazard form 

No. of subjects =         1445                     Number of obs   =      1445
No. of failures =          552
Time at risk    =        80013
                                                   LR chi2(10)     =    165.48
Log likelihood  =   -1633.0325                     Prob > chi2     =    0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |   .0908893   .0906478     1.00   0.316    -.0867772    .2685558
      priors |   .0887867   .0134355     6.61   0.000     .0624535    .1151198
     tserved |   .0135625   .0016808     8.07   0.000     .0102682    .0168567
       felon |  -.2994775    .105974    -2.83   0.005    -.5071826   -.0917723
     alcohol |   .4473611   .1057353     4.23   0.000     .2401236    .6545985
       drugs |   .2814605   .0978644     2.88   0.004     .0896499    .4732711
       black |   .4537147   .0883037     5.14   0.000     .2806426    .6267867
     married |  -.1515864   .1092454    -1.39   0.165    -.3657035    .0625307
        educ |  -.0232984   .0194196    -1.20   0.230    -.0613601    .0147633
         age |  -.0037246    .000525    -7.09   0.000    -.0047536   -.0026956
       _cons |  -3.402094   .3010177   -11.30   0.000    -3.992077    -2.81211
-------------+----------------------------------------------------------------
       /ln_p |  -.2158398   .0389149    -5.55   0.000    -.2921115   -.1395681
-------------+----------------------------------------------------------------
           p |   .8058644   .0313601                      .7466852    .8697338
         1/p |   1.240904   .0482896                      1.149777    1.339252
------------------------------------------------------------------------------

All but three of the predictors affect recidivism, the exceptions being participation in a work program, marital status and education.

The coefficient of drugs indicates that former inmates with a history of drug use have 31% higher risk or returning to jail at any given time that peers with identical characteristics but no history of drug use.

Accelerated Failure Time Weibull

Let us fit the Weibull model in the accelerated failure time framework. We can do this simply adding the time option:

. streg `predictors', distrib(weibull) time

         failure _d:  fail
   analysis time _t:  durat

Fitting constant-only model:

Iteration 0:   log likelihood = -1739.8944
Iteration 1:   log likelihood = -1716.1367
Iteration 2:   log likelihood = -1715.7712
Iteration 3:   log likelihood = -1715.7711

Fitting full model:

Iteration 0:   log likelihood = -1715.7711  
Iteration 1:   log likelihood = -1669.1785  
Iteration 2:   log likelihood = -1634.3693  
Iteration 3:   log likelihood = -1633.0405  
Iteration 4:   log likelihood = -1633.0325  
Iteration 5:   log likelihood = -1633.0325  

Weibull regression -- accelerated failure-time form 

No. of subjects =         1445                     Number of obs   =      1445
No. of failures =          552
Time at risk    =        80013
                                                   LR chi2(10)     =    165.48
Log likelihood  =   -1633.0325                     Prob > chi2     =    0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |  -.1127848   .1125346    -1.00   0.316    -.3333486     .107779
      priors |  -.1101757   .0170675    -6.46   0.000    -.1436273   -.0767241
     tserved |  -.0168297   .0021303    -7.90   0.000     -.021005   -.0126544
       felon |   .3716227   .1319951     2.82   0.005      .112917    .6303284
     alcohol |   -.555132   .1322427    -4.20   0.000    -.8143229    -.295941
       drugs |  -.3492654   .1218801    -2.87   0.004    -.5881461   -.1103847
       black |  -.5630162    .110817    -5.08   0.000    -.7802135   -.3458189
     married |   .1881041   .1357519     1.39   0.166    -.0779647    .4541729
        educ |   .0289111   .0241153     1.20   0.231    -.0183541    .0761763
         age |   .0046219   .0006648     6.95   0.000     .0033189    .0059249
       _cons |    4.22167   .3413114    12.37   0.000     3.552712    4.890628
-------------+----------------------------------------------------------------
       /ln_p |  -.2158398   .0389149    -5.55   0.000    -.2921115   -.1395681
-------------+----------------------------------------------------------------
           p |   .8058644   .0313601                      .7466852    .8697338
         1/p |   1.240904   .0482896                      1.149777    1.339252
------------------------------------------------------------------------------

By default Stata does not exponentiate the coefficients in AFT models. You can exponentiate them using the option tr, which stands for time ratios.

The substantive results are the same as before, which is not surprising because we have fitted exactly the same model. You may want to verify that the AFT parameters are exactly the same as the PH parameters with opposite sign and divided by p. For example the coefficient for drugs is -0.28/0.8 = -0.35.

However, we have two new interpretations of these effects. Exponentiating the drug coefficient we see that former inmates with a history of drug use spend 29% less time out of prison than peers with the same characteristics but no history of drug use. This is because

. di 1-exp(_b[drugs])
.29479404

Also, we can say that time outside of prison passes 42% faster for former inmates with a history of drug use than for those without, everything else being equal. (So they get into trouble more quickly.) This is because

. di exp(-_b[drugs])
1.4180255

A Log-Normal AFT Model

The Weibull allows the hazard to increase or decrease with time but at a constant rate. Wooldridge notes that the log-normal distribution provides a better fit to the data. We can fit a log-normal in Stata just changing the distrib option to lognormal:

. streg `predictors', distrib(lognormal)

         failure _d:  fail
   analysis time _t:  durat

Fitting constant-only model:

Iteration 0:   log likelihood =   -1999.58  
Iteration 1:   log likelihood =  -1695.747  
Iteration 2:   log likelihood = -1681.0153  
Iteration 3:   log likelihood = -1680.4273  
Iteration 4:   log likelihood =  -1680.427  
Iteration 5:   log likelihood =  -1680.427  

Fitting full model:

Iteration 0:   log likelihood =  -1680.427  
Iteration 1:   log likelihood = -1608.1657  
Iteration 2:   log likelihood = -1597.1838  
Iteration 3:   log likelihood = -1597.0591  
Iteration 4:   log likelihood =  -1597.059  

Lognormal regression -- accelerated failure-time form 

No. of subjects =         1445                     Number of obs   =      1445
No. of failures =          552
Time at risk    =        80013
                                                   LR chi2(10)     =    166.74
Log likelihood  =    -1597.059                     Prob > chi2     =    0.0000

------------------------------------------------------------------------------
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     workprg |  -.0625714   .1200369    -0.52   0.602    -.2978394    .1726965
      priors |  -.1372528   .0214587    -6.40   0.000     -.179311   -.0951946
     tserved |  -.0193305   .0029779    -6.49   0.000    -.0251671   -.0134939
       felon |   .4439944   .1450865     3.06   0.002     .1596302    .7283586
     alcohol |  -.6349088   .1442165    -4.40   0.000    -.9175681   -.3522496
       drugs |  -.2981599   .1327355    -2.25   0.025    -.5583168   -.0380031
       black |  -.5427175   .1174427    -4.62   0.000     -.772901    -.312534
     married |   .3406835    .139843     2.44   0.015     .0665962    .6147707
        educ |   .0229195   .0253974     0.90   0.367    -.0268584    .0726975
         age |   .0039103   .0006062     6.45   0.000     .0027221    .0050984
       _cons |   4.099386   .3475349    11.80   0.000      3.41823    4.780542
-------------+----------------------------------------------------------------
     /ln_sig |   .5935861   .0344122    17.25   0.000     .5261395    .6610327
-------------+----------------------------------------------------------------
       sigma |   1.810469   .0623022                      1.692386    1.936791
------------------------------------------------------------------------------

We do not need to specify time, as this distribution is available in Stata only in the AFT framework.

We see that the log-likelihood is indeed higher, -1597.1 compared to -1633.0 for the Weibull, so the model provides a better fit to the data.

Most of the effects are robust to the choice of distribution, but note that the protective effect of marriage is now significant. The coefficient for drugs, at -0.30 is smaller in magnitude and less significant than before.

The command stcurve can plot some aspects of the fit. Try the hazard option to have a look at the log-normal hazard evaluated at the mean of all predictors. You'll see that it raises very rapidly in the first seven weeks or so and then declines.

Fitting a generalized gamma model leads to similar conclusions except that the effect of drugs looses significance. These results suggests that there may be an interaction between drug history and duration, as the effect depends on how the hazard is specified. We will return to this issue.