Germán Rodríguez
Demographic Methods Princeton University

The Lee-Carter Model

We apply the Lee-Carter approach to forecasting U.S. mortality, doing all calculations from first principles. This log is best read in conjunction with the class handout.

Lee and Carter use U.S. mortality rates for conventional 5-year age groups up to 85+ by single calendar years from 1933 to 1987. You can get the data from mortality.org, select U.S.A. and navigate to life tables.

The code below shows the steps I followed after saving the web page as a text file bltper_5x1.txt. We need to code age from the group labels and, more importantly, compute the rate for the open-ended age group 85+. For fitting purposes we select ages up to 85 and the years up to 1987, but later use all ages and years.

. import delimit using bltper_5x1.txt, rowrange(3) varnames(3) ///
>         delimit(" ", collapse) clear
(11 vars, 1,944 obs)

. replace age = "110" if age == "110+"
(81 real changes made)

. gen pos = strpos(age, "-")

. replace age = substr(age, 1, pos - 1) if pos > 0
(1,782 real changes made)

. rename age age_s

. gen age = real(age_s)   

. save us1x5, replace // for later use
file us1x5.dta saved

. replace mx = lx/tx if age == 85
(81 real changes made)

. keep year age mx

. keep if age <= 85 & year <= 1987
(899 observations deleted)
> us <- read.table("bltper_5x1.txt", skip = 2, header=TRUE)

> us <- mutate(us, age0 = as.numeric(str_extract(us$Age, "[0-9]+")),
+   age = ifelse(age0 >= 5, age0 + 2.5, ifelse(age0 > 0, 3, 0.5)))

> us85 <- mutate(us, mx = ifelse(Age == "85-89", lx/Tx, mx))

> usmx <- select(us85, Year, age, mx) %>% filter(age < 90)

Next we compute the log of the rates. The calculations that follow are a bit simpler is we reshape the data We will store them in a matrix of 55 years by 19 age groups.

. gen logm = log(mx)

. drop mx

. quietly reshape wide logm, i(age) j(year)
> rates <- filter(usmx, Year <= 1987)$mx

> M <- matrix(log(rates), 55, 19, byrow = TRUE)

Fitting the Model

The first thing we need is the mean log-rate for each age group. This is easily computed using the rowmeans function in egen colMeans().
The results are remarkably close to Table 1 in the original paper.

. egen a = rowmean(logm1933 - logm1987)

. list age a

     +-----------------+
     | age           a |
     |-----------------|
  1. |   0   -3.642263 |
  2. |   1   -6.696482 |
  3. |   5    -7.51463 |
  4. |  10   -7.565431 |
  5. |  15   -6.758131 |
     |-----------------|
  6. |  20   -6.448188 |
  7. |  25   -6.405933 |
  8. |  30    -6.22762 |
  9. |  35   -5.907345 |
 10. |  40   -5.514151 |
     |-----------------|
 11. |  45   -5.087705 |
 12. |  50   -4.652652 |
 13. |  55   -4.260813 |
 14. |  60   -3.857138 |
 15. |  65   -3.474784 |
     |-----------------|
 16. |  70   -3.059151 |
 17. |  75   -2.639279 |
 18. |  80   -2.217548 |
 19. |  85   -1.619349 |
     +-----------------+
> a <- colMeans(M)

> a
 [1] -3.642263 -6.696482 -7.514630 -7.565431 -6.758130 -6.448188 -6.405933
 [8] -6.227620 -5.907345 -5.514151 -5.087705 -4.652652 -4.260813 -3.857138
[15] -3.474784 -3.059151 -2.639279 -2.217548 -1.619349

The next step is to subtract the average age pattern a from all years. To do this we copy the data into Mata

. mata a = st_data(1::19, "a")

. mata Y = st_data(1::19, 2..56)

. mata Y = Y :- a
>  for(j in 1:19) M[,j] <- M[,j] - a[j]

We are now ready to compute the Singular Value Decomposition (SVD), which writes M = U D V__ where U and V' are orthogonal matrices and D is a diagonal matrix of singular values. Before we call the function we need to define the output matrices. In fact we need just the first left and right singular vectors. The first column of U times D1,1 times the first row of V' has the best rank-1 approximation to the input matrix.

. mata U = d = Vt = J(0, 0, .)

. mata svd(Y', U, d, Vt)
> d <- svd(M, 1, 1)

Lee and Carter normalize the first row of V so it sums to one and call it b. This vector models how the different age groups react to mortality change.

. mata b = Vt[1,]/sum(Vt[1,])

. mata b[1..5] // partial list ...
                 1             2             3             4             5
    +-----------------------------------------------------------------------+
  1 |  .0910547088   .1120915573   .0937907842   .0832350427   .0497888488  |
    +-----------------------------------------------------------------------+
> b <- d$v/sum(d$v)

> head(b, 5) 
           [,1]
[1,] 0.09105471
[2,] 0.11209155
[3,] 0.09379079
[4,] 0.08323504
[5,] 0.04978885

These values are remarkably close to the b's published in Table 2.
Lee and Carter also take the first column of U, multiply by D1,1 and multiply by the sum of the first row of V' (to cancel the division) and call that k. This vector captures overall mortality change over time.

. mata k = U[,1] * sum(Vt[1,]) * d[1]

. mata k[1..5]' // partial list...
                 1             2             3             4             5
    +-----------------------------------------------------------------------+
  1 |   11.4068805   11.86130518   11.36618889   11.65111415   10.85912354  |
    +-----------------------------------------------------------------------+
> k <- d$u * sum(d$v) * d$d[1]

> head(k, 5)
         [,1]
[1,] 11.40688
[2,] 11.86131
[3,] 11.36619
[4,] 11.65111
[5,] 10.85912

Plotting Parameters and Fits

The next task is to compute the Lee-Carter fits to the mortality rates in 1933 and 1987, reproducing part of Figure 2 in the original paper. We are done with Mata, so we copy b and k into Stata.

. mata st_store(1::19, st_addvar("float","b"), b')

. set obs 55 
number of observations (_N) was 19, now 55

. mata st_store(., st_addvar("float","k"), k)

. gen fit1933 = a + b * k[1]
(36 missing values generated)

. gen fit1987 = a + b * k[1987-1932]
(36 missing values generated)

. gen agem = age + 2.5
(36 missing values generated)

. replace agem = 0.5 in 1
(1 real change made)

. replace agem = 3   in 2
(1 real change made)

. twoway (scatter logm1933 agem) (line fit1933 agem) ///
>   (scatter logm1987 agem) (line fit1987 agem) ///
>   , title(Lee-Carter fits for 1933 and 1987) xt(age) legend(off)

. graph export lcfig1.png, width(500) replace
(file lcfig1.png written in PNG format)

Lee-Carter fits to 1933 and 1987

> usmx2 <- filter(usmx, Year==1933 | Year == 1987) %>% 
+   mutate(year = factor(Year),
+       fit = c(exp(a + b * k[1]), exp(a + b * k[55])))

> ggplot(usmx2, aes(age, mx, color=year)) + geom_point() + 
+   geom_line(aes(age,fit,color=year)) + scale_y_log10() +
+   ggtitle("Lee-Carter Fits for 1933 and 1987")

> ggsave("lcfig1r.png", width = 500/72, height = 400/72, dpi = 72)

Lee-Carter fits to 1933 and 1987 Here's the trajectory of k

. gen t = 1932 + _n 

. line k t, title("Lee-Carter k for 1933-1987") xt(year)

. graph export lcfig2.png, width(500) replace
(file lcfig2.png written in PNG format)

Lee-Carter k for 1933-1987

> trend <- data.frame(year = 1933:1987, k = k)

> ggplot(trend, aes(year, k)) + geom_line() +
+   ggtitle("Lee-Carter k for 1933-1987")

> ggsave("lcfig2r.png", width = 500/72, height = 400/72, dpi = 72)  

Lee-Carter k for 1933-1987

A Random Walk

We now simulate the stochastic process driving k, a random walk with drift -0.365 and innovations with standard deviation 0.652. The idea is to generate normally distributed errors, add the drift, and them add to each value the previous one, which is easiest done as a running sum. We do this 50 times. For clarity I use a temporary variable called e for the errors

. set obs 61
number of observations (_N) was 55, now 61

. gen n = _n

. gen e = .
(61 missing values generated)

. forvalues r = 1/50 {
  2.         quietly replace e = rnormal() * 0.652 
  3.         quietly gen     k`r' = -11.05     - 0.365 + e in 1
  4.         quietly replace k`r' = k`r'[_n-1] - 0.365 + e in 2/L
  5. }

. twoway (line k1-k50 n), legend(off) xt(year) title(50 Random Walks)

. graph export lcfig3.png, width(500) replace
(file lcfig3.png written in PNG format)

Fifty random walks

> S <- matrix(0,61,50)

> for(j in 1:50) S[,j] <- -11.05 + cumsum(-0.365 + rnorm(61, 0, 0.652))

> df <- data.frame(S); df$year <- 1990:2050

> ggplot(gather(df, sim, k, -year), aes(year, k, color = sim)) +
+   geom_line() + guides(color = FALSE) + 
+   ggtitle("Fifty Random Walks")

> ggsave("lcfig3r.png", width = 500/72, height = 400/72, dpi = 72)

Fifty random walks This being a simulation your results will differ from mine. You should note that there is considerable uncertainty about the level of mortality 50 years into the future.

Age-Specific Forecasts.

Let us forecast age-specific mortality in 2050 starting from 1989. To this end we will use the official values of a and b, which I saved in a text file called LeeCarter.dat in the dataset section. These values extend the two sets of parameters to age 105+, but are otherwise practically identical to the estimates we just obtained.

Next we compute k starting from -11.05 in 1989. I call the resulting scalar km to avoid conflict with the variable k. We also compute the standard deviation and a 95% confidence interval. At this point we restore the original data to get all ages

. use us1x5, clear

. keep if age < 110
(81 observations deleted)

. gen logm = log(mx)

. keep age year logm

. merge m:1 age using http://data.princeton.edu/eco572/datasets/LeeCarter_ab

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                             1,863  (_merge==3)
    -----------------------------------------

. gen agem = age + 2.5

. quietly replace agem = 0.5 if age == 0

. quietly replace agem = 3 if age == 1

. scalar km  = -11.05 - 0.365*(2050 - 1989)

. scalar sd = sqrt(2050 - 1989)*0.652

. di km - 1.96*sd, km + 1.96*sd
-43.295874 -23.334126

. gen p2050  = lc_a + lc_b *  km

. gen lb2050 = lc_a + lc_b * (km - 1.96*sd)

. gen ub2050 = lc_a + lc_b * (km + 1.96*sd)

. twoway (rarea ub2050 lb2050 agem, color(ltblue)) (line p2050 agem) ///
> , legend(off) title(Forecast for 2050 with 95% CI) xt(age)

. graph export lcfig4.png, width(500) replace
(file lcfig4.png written in PNG format)

Forecast for 2050

> lc <- read.table("LeeCarter.dat", header = TRUE)

> k2050 <- 33.3

> z <- qnorm(0.975)

> se <- sqrt(2050 - 1989) * 0.652

> k2050 + c(-1, 1) * z * se
[1] 23.31931 43.28069

> forecast <- data.frame(age = lc$age,
+   fit = exp(lc$a + lc$b * k2050),
+   low = exp(lc$a + lc$b * (k2050 - z * se)),
+           hi  = exp(lc$a + lc$b * (k2050 + z * se)))

> ggplot(forecast, aes(age, fit)) + scale_y_log10() +
+   geom_ribbon(aes(ymin = low, ymax = hi), fill="#d0d0d0") +
+   geom_line() + ggtitle("Figure 4: Forecast for 2050 with 95% CI")

> ggsave("lcfig4r.png", width = 500/72, height = 400/72, dpi = 72)

Forecast for 2050 So we are 95% confident that k will be between -43.3 and -23.3 in 2050. Finally we combine k and the a and b schedules to forecast and plot age-specific mortality (in the log scale)

Corrected Jump-off

The final issue concerns the choice of jump-off. Below we will make a "forecast" for 2013, the latest year for which we have data. The prediction is not too good for young adults because of a slight lack of fit in 1989 which propagates in the forecast. The solution is to "jump-off" from the actual 1989 rates.

. keep if year == 1989 | year == 2013
(1,817 observations deleted)

. drop _merge

. reshape wide logm, i(age lc_a lc_b) j(year)     
(note: j = 1989 2013)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                       46   ->      23
Number of variables                   9   ->       9
j variable (2 values)              year   ->   (dropped)
xij variables:
                                   logm   ->   logm1989 logm2013
-----------------------------------------------------------------------------

. scalar drift = -0.365 * (2013 - 1989)

. scalar se    =  0.625 * sqrt(2013 - 1989)

. scalar z  = invnormal(0.975)

. gen p2013  =     lc_a + lc_b * (-11.05 + drift)

. gen j2013  = logm1989 + lc_b * (         drift)

. gen ub2013 = logm1989 + lc_b * (         drift - z * se)

. gen lb2013 = logm1989 + lc_b * (         drift + z * se)

. 
. twoway (rarea ub2013 lb2013 agem, color(ltblue)) ///
>   (scatter logm2013 agem if age < 110) ///
>   (line p2013 agem, lpat(dash)) (line j2013 agem if age < 110) ///
>   , title(Observed and Forecast for 2013) xt(age) ///
>   legend(order(3 "standard" 4 "jumpoff 1989") ring(0) pos(5) cols(1))   

. graph export lcfig5.png, width(500) replace
(file lcfig5.png written in PNG format)

Predictive check for 2013

> later <- select(us, age, Year, mx) %>% filter(Year == 1989 | Year == 2013)

> a89 <- log(filter(later, Year == 1989, age < 110)$mx)

> k89 <-  -(2013 - 1989) * 0.365

> se <- sqrt(2013 - 1989) * 0.652

> y2013 <- filter(later, Year == 2013, age < 110) %>%
+   mutate(standard = exp(lc$a - lc$b * 19.81),
+       jumpof   = exp(a89  + lc$b * k89),
+       low      = exp(a89  + lc$b * (k89 - z * se)),
+       hi       = exp(a89  + lc$b * (k89 + z * se)))

> ggplot(y2013, aes(age, mx)) + scale_y_log10() +
+   geom_ribbon(aes(ymin = low, ymax = hi), fill = "#d0d0d0") +
+   geom_point() + geom_line(aes(age, standard)) + geom_line(aes(age, jumpof)) +
+   ggtitle("Observed and Forecast for 2013")

> ggsave("lcfig5r.png", width = 500/72, height = 400/72, dpi = 72)

Predictive check for 2013 The performance of the model with the correct jump-off for a 24-year forecast is nothing short of remarkable. And that's all folks!