Germán Rodríguez
Survival Analysis Princeton University

Competing Risks

Singer and Willet (2003) Applied Longitudinal Data Analysis, Oxford, analyze data on the length of service of U.S. Supreme Court Justices, treating death and retirement as competing risks and age at appointment and calendar year of appointment as predictors.

We will conduct a similar analysis using a dataset that I created by extracting data from Wikipedia in 2012 and then updating it to 1-Jan-2016. Dates are recorded in calendar years only for simplicity and consistency with the reference, although full dates would obviously be better. The file is available in Stata 11 format as justices.dta on this website.

. use http://data.princeton.edu/pop509/justices.dta
(Supreme court justices 1-Jan-2016)

. stset tenure, fail(event)

     failure event:  event != 0 & event < .
obs. time interval:  (0, tenure]
 exit on or before:  failure

------------------------------------------------------------------------------
        112  total observations
          0  exclusions
------------------------------------------------------------------------------
        112  observations remaining, representing
        103  failures in single-record/single-failure data
       1858  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        36

. tab event

      event |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          9        8.04        8.04
          1 |         49       43.75       51.79
          2 |         54       48.21      100.00
------------+-----------------------------------
      Total |        112      100.00
> library(foreign)

> sc <- read.dta("http://data.princeton.edu/pop509/justices.dta")

> table(sc$event)

 0  1  2 
 9 49 54 

The 'event' variable is coded 0 for censored (9 justices serving at the start of the year), 1 for those who left the court due to death (49), and 2 for justices who retired or resigned (54).

I ignore the return engagements by John Rutledge and Charles Evans Hughes, who resigned and later were reappointed as Chief Justice.

Overall Survival

The first thing we calculate is the Kaplan-Meier estimator of overall survival

. sts graph

         failure _d:  event
   analysis time _t:  tenure

. graph export justices-km.png, width(500) replace
(file justices-km.png written in PNG format)

. sts gen km = s // save Kaplan-Meier for later use

Overall Survival

> library(survival)

> km <- survfit(Surv(tenure, event > 0) ~ 1, data = sc)

> # plot(km) 
> kmd <- data.frame(time = c(0, km$time), survival = c(1, km$surv))

> ggplot(kmd, aes(time, survival)) + geom_step()

> ggsave("justices-kmr.png", width = 500/72, height = 400/72, dpi = 72)

Overall Survival The median length of service is about 15.5 years.

Incidence Functions

The command stcompet, written by V. Coviello and M. Boggess (2004) "Cumulative incidence estimation in the presence of competing risks" Stata Journal, 4(2):103-112, can estimate incidence functions. We specify death as the event of interest and retirement as the competing event.

We will use the function cuminc() in Bob Gray's cmprsk package to estimate the cumulative incidence functions.

. streset, fail(event == 1)
-> stset tenure, failure(event==1)

     failure event:  event == 1
obs. time interval:  (0, tenure]
 exit on or before:  failure

------------------------------------------------------------------------------
        112  total observations
          0  exclusions
------------------------------------------------------------------------------
        112  observations remaining, representing
         49  failures in single-record/single-failure data
       1858  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        36

. stcompet ci = ci, compet(2) // you just specify the code
> library(cmprsk)   

> cif <- cuminc(sc$tenure, sc$event)

> cifd <- data.frame(       
+   cause = factor(rep(c("death","retirement"),c(length(cif[[1]]$time), length(cif[[2]]$time)))),
+   time  = c(cif[[1]]$time, cif[[2]]$time),
+   cif   = c(cif[[1]]$est,  cif[[2]]$est)) 

The code generates the cumulative incidence functions CIF for each competing cause. To plot them you need to select the appropriate event code. I stacked them into a data frame suitable for using ggplot() Here's the incidence of death

. twoway line ci _t if event==1, c(J) sort ///
>         title(Cumulative incidence of death) name(d, replace)
> g1 <- ggplot(filter(cifd, cause=="death"), aes(time, cif)) + 
+   geom_step() + ggtitle("Incidence of Death")

And here's the incidence of retirement

. twoway line ci _t if event==2 & _t > 0, c(J) sort ///
>    title(Cumulative incidence of retirement) name(r, replace)

. graph combine d r, xsize(6) ysize(3)

. graph export justices-cif.png, width(720) height(360) replace  
(file justices-cif.png written in PNG format)

Incidence

> g2 <- ggplot(filter(cifd, cause=="retirement"), aes(time, cif)) +     
+   geom_step() + ggtitle("Incidence of Retirement")

> g <- arrangeGrob(g1, g2, ncol=2); #plot(g)

> ggsave(plot = g, "justices-cifr.png", width = 10, height = 5, dpi = 72)

Incidence

The Status Plot

We now build a nice plot that combines the Kaplan-Meier estimate of overall attrition with the cumulative incidence functions showing atrition by cause.
This takes a bit of work, but IMHO the result is worth the effort.

We use the fact that the survival and incidence functions add up to one, S(t) + I1(t) + I2(t) = 1, so the areas in the plot have boundaries *1, 1 - I1(t), S(t) and 0. The survival can be obtained from Kaplan-Meier or as the complement of the sum of the incidence functions.

In Stata we start by extracting the CIF for death, and fill in the times for other types of events, which are coded as missing. We also add a dummy observation so the plot actually starts at time 0.

. sort _t

. gen cid = ci if event==1
(63 missing values generated)

. replace cid = cid[_n-1] if missing(cid)
(60 real changes made)

. replace cid = 0 if missing(cid)
(3 real changes made)

. expand 2 in -1 // new observation is at the end
(1 observation created)

. replace _d = . in -1
(1 real change made, 1 to missing)

. replace _t = 0 in -1
(1 real change made)

. replace km = 1 in -1
(1 real change made)

. replace cid = 0 in -1
(1 real change made)

. 
.  // We can now compute the boundaries and plot them
. gen b1 = 1

. gen b2 = 1 - cid

. gen b3 = km

. gen b4 = 0

. sort _t

. twoway rarea b1 b2 _t, color("160 160 255") c(J) ///
>    ||  rarea b2 b3 _t, color("255 160 160") c(J) ///
>    ||  rarea b3 b4 _t, color("160 255 160") c(J) ///
>  , legend(off)  xtitle(Years) text(.2 15 "Active") ///
>    text(.5 25 "Died") text(.8 30 "Retired") ///
> title("Status of Justices by Years since Appointment")     

. graph export justices-status.png, width(600) replace 
(file justices-status.png written in PNG format)

. drop in -1 //remove dummmy
(1 observation deleted)

Status Plot

In R we have to deal with the fact that survfit() and cuminc() arrange their results differently and that the times do not line up. My solution merges the estimates and uses a simple function to fill the gaps. For the plot itself I found it easier to draw the areas using `polygon()`.

> # align times and fill gaps
> death <- filter(cifd, cause=="death"); 

> retirement <- filter(cifd, cause=="retirement")

> dr <- full_join(death, retirement, by = "time") %>%
+   select(time, death=3, retirement=5) %>% arrange(time)

> fg <- function(x) {for(i in 2:length(x)){ if(is.na(x[i])) x[i] <- x[i-1]}; x}

> dr <- mutate(dr, death = fg(death), retirement = fg(retirement))

> # draw the graph using overlapping polygons
> png("justices-statusr.png", width=600, height=500)

> par(mar=c(2,2,1,1), mgp=c(2 ,.7, 0), tck=-.01, cex=0.75)

> tmax = max(dr$time)

> plot(c(0, tmax), c(0,1), type="n", xlab="time", ylab="survival")

> polygon(c(0,tmax,tmax,0), c(0,0,1,1), 
+   col = "#A0A0FF", border = "#A0A0FF")

> text(30, .8, "Retired")

> polygon(c(dr$time,tmax,0),c(1 - dr$retirement, 0, 0), 
+   col="#FFA0A0", border="#FFA0A0")

>  text(25, .5, "Died")

>  polygon(c(dr$time,tmax,0),c(1 - dr$death - dr$retirement, 0, 0), 
+   col="#A0FFA0", border ="#A0FFA0")

> text(15, .2, "Active")

> dev.off()     
windows 
      2 

Status Plot The green area is the survival function, showing the probability of remaining in the court, and the other two areas partition attrition into death and retirement based on the incidence functions. In the long run death and retirement are approximately equally likely.

Continue with Hazard Models