Germán Rodríguez
Demographic Methods Princeton University

Kaplan-Meir Survival

We'll illustrate the Kaplan-Meier estimator with the classic dataset used by Cox in his seminal paper on proportional hazard models. The data shows the length of remission in weeks for two groups of leukemia patients, treated and controls.

We start by reading the data, available in the datasets section of my GLM course at http://data.princeton.edu/wws509/datasets/ as a plain text file gehan.dat and a Stata file gehan.dta.

. use    http://data.princeton.edu/wws509/datasets/gehan, clear    
(Dataset used in Cox's 1972 paper, JRSS 34:187-220)

. stset weeks, failure(relapse) // define as survival data

     failure event:  relapse != 0 & relapse < .
obs. time interval:  (0, weeks]
 exit on or before:  failure

──────────────────────────────────────────────────────────────────────────────
         42  total observations
          0  exclusions
──────────────────────────────────────────────────────────────────────────────
         42  observations remaining, representing
         30  failures in single-record/single-failure data
        541  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        35
> library(survival)
> library(dplyr)
> library(ggplot2)
> gehan <- read.table("http://data.princeton.edu/wws509/datasets/gehan.dat")
> summarize(gehan, events = sum(relapse), exposure = sum(weeks))
  events exposure
1     30      541

As you can see, 30 of the 42 patients had a relapse, with a total exposure time of 541 weeks. The weekly relapse rate is 7.8%.

Next we compute and plot the Kaplan-Meir estimator of the survival function in each of the two groups.

. sts graph, by(group)

         failure _d:  relapse
   analysis time _t:  weeks

. graph export kmg.png, width(500) replace
(file kmg.png written in PNG format)

Kaplan Meier

> kmg <- survfit(Surv(weeks, relapse) ~ group, data=gehan)
> #plot(kmg)
> a = 1:kmg$strata[1]
> kmdf <- data.frame(group = rep(c("control", "treated"), kmg$strata+1),
+   time = c(0, kmg$time[a], 0, kmg$time[-a]),
+   surv = c(1, kmg$surv[a], 1, kmg$surv[-a])) %>% group_by(group)
> ggplot(kmdf, aes(time, surv, color=group)) + geom_step()
> ggsave("kmgr.png", width=500/72, height=400/72, dpi=72)

Kaplan Meier
The graph shows that after 23 weeks all patients in the control group had relapsed, but about half those in the treated group remained in remission. We can list the survival function.

. sts list, by(group)

         failure _d:  relapse
   analysis time _t:  weeks

           Beg.          Net            Survivor      Std.
  Time    Total   Fail   Lost           Function     Error     [95% Conf. Int.]
───────────────────────────────────────────────────────────────────────────────
control 
     1       21      2      0             0.9048    0.0641     0.6700    0.9753
     2       19      2      0             0.8095    0.0857     0.5689    0.9239
     3       17      1      0             0.7619    0.0929     0.5194    0.8933
     4       16      2      0             0.6667    0.1029     0.4254    0.8250
     5       14      2      0             0.5714    0.1080     0.3380    0.7492
     8       12      4      0             0.3810    0.1060     0.1831    0.5778
    11        8      2      0             0.2857    0.0986     0.1166    0.4818
    12        6      2      0             0.1905    0.0857     0.0595    0.3774
    15        4      1      0             0.1429    0.0764     0.0357    0.3212
    17        3      1      0             0.0952    0.0641     0.0163    0.2612
    22        2      1      0             0.0476    0.0465     0.0033    0.1970
    23        1      1      0             0.0000         .          .         .
treated 
     6       21      3      1             0.8571    0.0764     0.6197    0.9516
     7       17      1      0             0.8067    0.0869     0.5631    0.9228
     9       16      0      1             0.8067    0.0869     0.5631    0.9228
    10       15      1      1             0.7529    0.0963     0.5032    0.8894
    11       13      0      1             0.7529    0.0963     0.5032    0.8894
    13       12      1      0             0.6902    0.1068     0.4316    0.8491
    16       11      1      0             0.6275    0.1141     0.3675    0.8049
    17       10      0      1             0.6275    0.1141     0.3675    0.8049
    19        9      0      1             0.6275    0.1141     0.3675    0.8049
    20        8      0      1             0.6275    0.1141     0.3675    0.8049
    22        7      1      0             0.5378    0.1282     0.2678    0.7468
    23        6      1      0             0.4482    0.1346     0.1881    0.6801
    25        5      0      1             0.4482    0.1346     0.1881    0.6801
    32        4      0      2             0.4482    0.1346     0.1881    0.6801
    34        2      0      1             0.4482    0.1346     0.1881    0.6801
    35        1      0      1             0.4482    0.1346     0.1881    0.6801
───────────────────────────────────────────────────────────────────────────────
> summary(kmg)
Call: survfit(formula = Surv(weeks, relapse) ~ group, data = gehan)

                group=control 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1     21       2   0.9048  0.0641      0.78754        1.000
    2     19       2   0.8095  0.0857      0.65785        0.996
    3     17       1   0.7619  0.0929      0.59988        0.968
    4     16       2   0.6667  0.1029      0.49268        0.902
    5     14       2   0.5714  0.1080      0.39455        0.828
    8     12       4   0.3810  0.1060      0.22085        0.657
   11      8       2   0.2857  0.0986      0.14529        0.562
   12      6       2   0.1905  0.0857      0.07887        0.460
   15      4       1   0.1429  0.0764      0.05011        0.407
   17      3       1   0.0952  0.0641      0.02549        0.356
   22      2       1   0.0476  0.0465      0.00703        0.322
   23      1       1   0.0000     NaN           NA           NA

                group=treated 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    6     21       3    0.857  0.0764        0.720        1.000
    7     17       1    0.807  0.0869        0.653        0.996
   10     15       1    0.753  0.0963        0.586        0.968
   13     12       1    0.690  0.1068        0.510        0.935
   16     11       1    0.627  0.1141        0.439        0.896
   22      7       1    0.538  0.1282        0.337        0.858
   23      6       1    0.448  0.1346        0.249        0.807

The convention is to report the survival function immediately after each time. In the control group there are no censored observations, and the Kaplan-Meier estimate is simply the proportion alive after each distinct failure time. You can also check that the standard error is the usual binomial estimate. For example just after 8 weeks there are 8 out of 21 alive, the proportion is 8/21 = 0.381 and the standard error is √0.381(1 - 0.381)/21) = 0.106.

In the treated group 12 cases are censored and 9 relapse. Can you compute the estimate by hand? The distinct times of death are 6, 7, 10, 13, 16, 22 and 23. The counts of relapses are 3, 1, 1, 1, 1, 1, 1. When there are ties between event and censoring times it is customary to assume that the event occurred first; that is, observations censored at t are assumed to be exposed at that time, effectively censored just after t. The counts of censored observations after each death time (but before the next) are 1, 1, 2, 0, 3, 0 and 5. When the last observation is censored the K-M estimate is greater than zero, and is usually considered undefined after that.

Continue with Cox