6.5 Models for Ordinal Response Data Table of Contents

7  Survival Models in Stata

Stata has extensive facilities for fitting survival models. We will discuss only the use of Poisson regression to fit piece-wise exponential survival models.

7.5  Infant and Child Mortality in Colombia

The datasets page has the original Somoza tabulation of children by sex, cohort, age and survival status (dead or still alive at interview). We will read the data and collapse over sex, and then compute events and exposure to reproduce Table 7.1 in the lecture notes.

. infile sex cohort age dead alive using ///
>         http://data.princeton.edu/wws509/datasets/somoza.raw
(48 observations read)

. collapse (sum) alive (sum) dead, by (cohort age)

We make sure the data are sorted by cohort and then age, use egen to count the total number of children in each cohort, and then use replace with a by cohort prefix to [re]compute the number who start an age other than the first as the number who started the previous age minus those still alive at the previous age and minus those who died at the previous age. Having done this we can drop kids older than 10, as we are only interested in survival to age ten:

. sort cohort age // make sure data are sorted

. egen start = total(alive+dead), by(cohort) 

. by cohort: replace start = start[_n-1] - alive[_n-1] - dead[_n-1] if _n > 1
(21 real changes made)

. drop if age > 7
(3 observations deleted)

The next step is to use recode to generate a variable representing the width of the age intervals in months. We then use generate to compute exposure, assuming everyone is exposed the full width of the interval except those censored or who die in the interval, who are exposed on average half the interval. Note that we divide by 12 to express exposure in person-years.

. recode age 4=6 5=12 6=36 7=60, gen(width) // interval widths in months
(12 differences between age and width)

. gen exposure = width * (start - 0.5 * (alive + dead)) / 12  // in years

Finally we list the results. For convenience we rename dead to deaths and set a format so exposure prints with one decimal. The results coincide with Table 7.1 in the notes.

. rename dead deaths

. format expo %8.1f

. list cohort age deaths expo, sep(7)

     +----------------------------------+
     | cohort   age   deaths   exposure |
     |----------------------------------|
  1. |      1     1      168      278.4 |
  2. |      1     2       48      538.8 |
  3. |      1     3       63      794.4 |
  4. |      1     4       89     1550.8 |
  5. |      1     5      102     3006.0 |
  6. |      1     6       81     8743.5 |
  7. |      1     7       40    14270.0 |
     |----------------------------------|
  8. |      2     1      197      403.2 |
  9. |      2     2       48      786.0 |
 10. |      2     3       62     1165.3 |
 11. |      2     4       81     2294.8 |
 12. |      2     5       97     4500.5 |
 13. |      2     6      103    13201.5 |
 14. |      2     7       39    19525.0 |
     |----------------------------------|
 15. |      3     1      195      495.3 |
 16. |      3     2       55      956.7 |
 17. |      3     3       58     1381.4 |
 18. |      3     4       85     2604.5 |
 19. |      3     5       87     4618.5 |
 20. |      3     6       70     9814.5 |
 21. |      3     7       10     5802.5 |
     +----------------------------------+

We will label the variables for reference (and for use in the last section)

. label define age 1 "0-1m" 2 "1-3m" 3 "3-6m" 4 "6-12m" 5 "1-2y" 6 "2-5y" 7"5-1
> 0y"

. label values age age

. label define cohort 1 "1941-59" 2 "1960-67" 3 "1968-76"

. label values cohort cohort

Lets us calculate the logarithm of exposure (to be used as an offset) as well as the usual dummy variables for age and cohort. The quickest way is to use tab,gen() but we want more informative variable names

. gen logexp = log(exposure)
. gen age_1_3m  = age==2

. gen age_3_6m  = age==3

. gen age_6_12m = age==4

. gen age_1_2y  = age==5

. gen age_2_5y  = age==6

. gen age_5_10y = age==7

. global age "age_*"
. gen cohort_60_67 = cohort == 2

. gen cohort_68_76 = cohort == 3

. global cohort "cohort_*"

Exponential Survival

Let us fit the null model, which is equivalent to a simple exponential survival model. We will also store the estimates for use in later tests

. poisson deaths, offset(logexp)

Iteration 0:   log likelihood = -2184.0965
Iteration 1:   log likelihood = -2184.0965  (backed up)

Poisson regression                                Number of obs   =         21
                                                  LR chi2(0)      =       0.00
                                                  Prob > chi2     =          .
Log likelihood = -2184.0965                       Pseudo R2       =     0.0000

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |  -3.996451   .0237156  -168.52   0.000    -4.042933   -3.949969
      logexp |   (offset)
------------------------------------------------------------------------------

. poisgof

         Goodness-of-fit chi2  =   4239.85
         Prob > chi2(20)       =    0.0000

. estimates store null

Note the astronomical deviance. The estimate of the constant happens to be the log of the overall mortality rate. Check it out.

. di "Fitted rate = " exp(_b[_cons])
Fitted rate = .01838076

. quietly summarize deaths

. mac def deaths =  r(sum)

. quietly summarize exposure

. di "Observed Rate = " $deaths/r(sum)
Observed Rate = .01838076

Now on to the one-factor models. We start with the cohort model, which is equivalent to a separate exponential survival model for each cohort

. poisson deaths $cohort, offset(logexp)

Iteration 0:   log likelihood = -2160.0544
Iteration 1:   log likelihood = -2159.5162
Iteration 2:   log likelihood = -2159.5159
Iteration 3:   log likelihood = -2159.5159

Poisson regression                                Number of obs   =         21
                                                  LR chi2(2)      =      49.16
                                                  Prob > chi2     =     0.0000
Log likelihood = -2159.5159                       Pseudo R2       =     0.0113

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
cohort_60_67 |  -.3020405   .0573319    -5.27   0.000    -.4144089   -.1896721
cohort_68_76 |   .0742143   .0589726     1.26   0.208    -.0413698    .1897983
       _cons |  -3.899488   .0411345   -94.80   0.000     -3.98011   -3.818866
      logexp |   (offset)
------------------------------------------------------------------------------

. poisgof

         Goodness-of-fit chi2  =  4190.689
         Prob > chi2(18)       =    0.0000

Compare these results with the gross effect estimates in Table 7.3. Note that the hazard rate declined 26% between the 1941-59 and 1960-67 cohorts (1-exp(0.302) = 0.261) but appears to have increased almost 8% for the 1968-76 cohort (exp(0.074)=1.077). Here's the likelihood ratio test of the gross cohort effect

. lrtest null .

likelihood-ratio test                                  LR chi2(2)  =     49.16
(Assumption: null nested in .)                         Prob > chi2 =    0.0000

Piece-Wise Exponential Survival

Now on to the age model.

. poisson deaths $age, offset(logexp)

Iteration 0:   log likelihood = -100.90573
Iteration 1:   log likelihood = -100.49824
Iteration 2:   log likelihood = -100.49817
Iteration 3:   log likelihood = -100.49817

Poisson regression                                Number of obs   =         21
                                                  LR chi2(6)      =    4167.20
                                                  Prob > chi2     =     0.0000
Log likelihood = -100.49817                       Pseudo R2       =     0.9540

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    age_1_3m |  -1.972606   .0916964   -21.51   0.000    -2.152328   -1.792885
    age_3_6m |  -2.161867   .0851481   -25.39   0.000    -2.328754    -1.99498
   age_6_12m |  -2.487885   .0755466   -32.93   0.000    -2.635954   -2.339817
    age_1_2y |  -3.004331   .0726789   -41.34   0.000    -3.146779   -2.861883
    age_2_5y |  -4.085911   .0756487   -54.01   0.000    -4.234179   -3.937642
   age_5_10y |  -5.355183   .1141125   -46.93   0.000    -5.578839   -5.131526
       _cons |  -.7427022   .0422577   -17.58   0.000    -.8255258   -.6598786
      logexp |   (offset)
------------------------------------------------------------------------------

. poisgof

         Goodness-of-fit chi2  =  72.65357
         Prob > chi2(14)       =    0.0000

. estimates store age

The age model is equivalent to a piece-wise exponential survival model with no cohort effects. Compare the results with the gross effects in Table 7.3. Note the dramatic decrease in risk with age. At age 1 the mortality rate is only 5% of what it is in the first month of life (exp(-3.00) = 0.496). Here's the likelihood ratio tests of the age effect:

. lrtest null .

likelihood-ratio test                                  LR chi2(6)  =   4167.20
(Assumption: null nested in age)                       Prob > chi2 =    0.0000

Can you compute Wald tests for the cohort and age effects?

The Proportional Hazards Model

Now on to the additive model with main effects of age and cohort, which is equivalent to a proportional hazards model:

. poisson deaths $age $cohort, offset(logexp)

Iteration 0:   log likelihood =  -67.79376
Iteration 1:   log likelihood = -67.262772
Iteration 2:   log likelihood = -67.262634
Iteration 3:   log likelihood = -67.262634

Poisson regression                                Number of obs   =         21
                                                  LR chi2(8)      =    4233.67
                                                  Prob > chi2     =     0.0000
Log likelihood = -67.262634                       Pseudo R2       =     0.9692

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    age_1_3m |  -1.972671   .0916965   -21.51   0.000    -2.152393    -1.79295
    age_3_6m |  -2.163342   .0851488   -25.41   0.000     -2.33023   -1.996453
   age_6_12m |  -2.491684    .075551   -32.98   0.000    -2.639761   -2.343607
    age_1_2y |  -3.014045   .0727035   -41.46   0.000    -3.156541   -2.871548
    age_2_5y |  -4.115378   .0758263   -54.27   0.000    -4.263995   -3.966761
   age_5_10y |  -5.435889   .1147682   -47.36   0.000     -5.66083   -5.210947
cohort_60_67 |  -.3242573   .0573352    -5.66   0.000    -.4366323   -.2118824
cohort_68_76 |  -.4784147   .0593257    -8.06   0.000    -.5946909   -.3621385
       _cons |  -.4484664   .0545394    -8.22   0.000    -.5553618   -.3415711
      logexp |   (offset)
------------------------------------------------------------------------------

. poisgof

         Goodness-of-fit chi2  =  6.182494
         Prob > chi2(12)       =    0.9066

Note that this model fits reasonably well. Compare the results with the net effect estimates in Table 7.3. Note that the anomaly with the youngest cohort has been corrected. The estimates now indicate a steady decline in mortality across cohorts. Taking the 1941-59 cohort as a baseline, mortality at every age from zero to ten was 28% lower for the 1960-67 cohort and 36% lower for the more recent 1968-76 cohort. The survey was conducted in 1976. Can you explain what was going on here?

Here's a likelihood ratio test for the cohort effect adjusted for age. Note that we compare the age model (which we saved) with the additive model that has age and cohort:

. lrtest age .

likelihood-ratio test                                  LR chi2(2)  =     66.47
(Assumption: age nested in .)                          Prob > chi2 =    0.0000

Estimating Survival Probabilities (Re: 7.5.4)

Let us calculate the fitted life table shown in Table 7.4 of the lecture notes. Note that predict in a Poisson regression calculates the expected number of events, so we need to divide by exposure to obtain fitted rates. An alternative is to use the xb and nooffsetoptions (you need both) to obtain the linear predictor or log-hazard, which you can then exponentiate to obtain the fitted hazard rate.

. predict events
(option n assumed; predicted number of events)

. gen hazard  = events/exposure

At this point recall that the age intervals have different widths. We stored the widths in months in the variable width. We will now convert it to years, so it is in the same units as exposure.

. quietly replace width=width/12

Now we will sort the data by age within each cohort and calculate the cumulative hazard for each cohort as a running sum of the hazard times the interval width. We then use the fact that S(t)= exp{-L(t)} to obtain the survival function.

. sort cohort age

. by cohort: gen cumhaz = sum(hazard * width)

. gen survival = exp( -cumhaz)

The last thing to do is print our handy work. I will use the tabulate command rather than list to obtain a suitable two-way layout. We specify the "mean" to list the single value in each combination of age and cohort.

. tab age cohort, sum(survival) mean

                             Means of survival

           |            cohort
       age |   1941-59    1960-67    1968-76 |     Total
-----------+---------------------------------+----------
      0-1m |   .948174  .96225148  .96755582 |  .9593271
      1-3m | .93424118  .95200664  .95871943 | .94832242
      3-6m | .91725379  .93945831  .94787771 | .93486327
     6-12m | .89332932  .92167592  .93247825 | .91582783
      1-2y | .86575711  .90101773  .91453487 |  .8937699
      2-5y | .83910733  .88087684   .8969841 | .87232276
     5-10y | .82751352  .87205952  .88928276 | .86295193
-----------+---------------------------------+----------
     Total | .88933946  .91847806  .92963328 |  .9124836

Stata has commands for fitting some of the parametric models discussed in the bibliographic notes, such as the Weibull model. It also has non-parametric methods, including procedures for calculating Kaplan-Meier estimates and for fitting Cox regression models by partial likelihood. Finally, Stata has facilities for generating person-year files.


Copyright © Germán Rodríguez, 1993-2003. Please send feedback to grodri@princeton.edu