Home | GLMs | Multilevel | Survival | Demography | Stata | R

The Lee-Carter Model

We apply the Lee-Carter approach to forecasting U.S. mortality, doing all calculations from first principles. This log is best read in conjunction with the 2009 class handout.

First we get mortality rates data for U.S. females from 1933 to 2005 using conventional five-year age groups (abridged) from the Human Mortality Database, using a handy Stata command

. hmd rates, country(USA) abridged
The United States of America, Death rates (period 5x1)     Last modified: 18-Mar-2008, MPv5 (May07)
obs was 0, now 1752

. list if year==2005

      +------------------------------------------+
      | year   age    female      male     total |
      |------------------------------------------|
1729. | 2005     0   .006189    .00762   .006921 |
1730. | 2005     1   .000252   .000334   .000294 |
1731. | 2005     5   .000134   .000155   .000145 |
1732. | 2005    10   .000144   .000215    .00018 |
1733. | 2005    15   .000372   .000916   .000651 |
      |------------------------------------------|
1734. | 2005    20   .000482    .00144   .000976 |
1735. | 2005    25   .000542   .001386   .000974 |
1736. | 2005    30   .000737   .001483   .001114 |
1737. | 2005    35   .001091   .001893   .001494 |
1738. | 2005    40   .001749   .002931   .002337 |
      |------------------------------------------|
1739. | 2005    45   .002645   .004439   .003531 |
1740. | 2005    50   .003812   .006655   .005203 |
1741. | 2005    55   .005531   .009242   .007333 |
1742. | 2005    60   .008874   .014096   .011365 |
1743. | 2005    65   .013629   .020828   .016983 |
      |------------------------------------------|
1744. | 2005    70   .021613   .032641   .026548 |
1745. | 2005    75   .034664     .0511   .041575 |
1746. | 2005    80   .058175   .081439   .067086 |
1747. | 2005    85   .099697   .127603     .1092 |
1748. | 2005    90   .157889   .179698   .164124 |
      |------------------------------------------|
1749. | 2005    95   .258398   .278856   .262757 |
1750. | 2005   100   .375229   .383529   .376524 |
1751. | 2005   105   .486081   .415883   .477356 |
1752. | 2005   110    .49642   .382848   .480231 |
      +------------------------------------------+

Next we compute log-rates. Lee-Carter work with both sexes so we'll do the same. The calculations that follow are a bit simpler if we work with data in a wide age by period format, so we'll drop what we don't need and reshape

. gen logm = log(total)

. drop male female total

. quietly reshape wide logm, i(age) j(year)

Fitting the Model

The first thing we need is the mean log-rate for each age across years. This is easily computed with the rowmeans function in egen. We select the years up to 1987 and focus on ages up to 85-89 (not quite the same as 85+). The results are remarkably close to Table 1 in the original paper.

. egen a = rowmean(logm1933-logm1987)

. list age a if age <= 85

     +-----------------+
     | age           a |
     |-----------------|
  1. |   0    -3.64229 |
  2. |   1   -6.700243 |
  3. |   5   -7.511998 |
  4. |  10    -7.56499 |
  5. |  15   -6.761693 |
     |-----------------|
  6. |  20   -6.448143 |
  7. |  25   -6.405937 |
  8. |  30   -6.229009 |
  9. |  35   -5.908996 |
 10. |  40   -5.515899 |
     |-----------------|
 11. |  45   -5.089163 |
 12. |  50   -4.654123 |
 13. |  55   -4.262783 |
 14. |  60   -3.858791 |
 15. |  65   -3.477189 |
     |-----------------|
 16. |  70   -3.063755 |
 17. |  75   -2.645488 |
 18. |  80   -2.223745 |
 19. |  85   -1.799029 |
     +-----------------+

Next we copy the data into Mata using the function st_data(), recalling that we only want ages up to 85-89, or the first 19 rows. The function can refer to Stata variables by name or index; we want the age pattern, called "a", and the log rates for 1933-1987, which are in columns 2 to 56 (or 1933 to 1987 minus 1931, which may be clearer). We then subtract a from each column of Y, thus centering the rates

. mata a = st_data(1::19, "a")

. mata Y = st_data(1::19, 2..56)

. mata Y = Y :- a

The singular value decomposition can be computed easily with the function svd(), using the transpose to get a year by age array. Before we call the function we need to define the output, which consists of the matrix U, a vector with the diagonal of S and the matrix V transpose.

. mata U = s = Vt = J(0,0,.)

. mata svd(Y',U,s,Vt)

And that's it! The first column of U times s[1] times the first row of Vt has the best rank-1 approximation to the input matrix. Lee and Carter normalize the first row of Vt so it sums to one and call it b, indicating how the different age groups react to mortality change

. mata b = Vt[1,]/sum(Vt[1,])

. mata b[1..5] // partial list ...
                 1             2             3             4             5
    +-----------------------------------------------------------------------+
  1 |  .0906582504   .1107155039   .0930655715   .0825816339   .0491651399  |
    +-----------------------------------------------------------------------+

These values are remarkably close to the b's published in Table 2. Lee and Carter also take the first column of U, multiply by s[1] and multiply by the sum of the first row of Vt (to cancel the division) and call that k.

. mata k = U[,1]*sum(Vt[1,])*s[1]

. mata k[1..5]' // partial list...
                 1             2             3             4             5
    +-----------------------------------------------------------------------+
  1 |  11.41409805   11.84927137   11.37470162   11.68714388    10.9078721  |
    +-----------------------------------------------------------------------+

Plotting Parameters and Fits

We are done with Mata. I'll just copy b and k into Stata. Because k has 55 elements we need to increase the number of observations. I'll also add a variable for the years and plot k, reproducing part of Figure 2 in the Lee-Carter paper

. mata st_store(1::19, st_addvar("float","b"), b')

. set obs 55 
obs was 24, now 55

. mata st_store(., st_addvar("float","k"), k)

. gen t = 1932 + _n 

. line k t, title("Lee-Carter k for 1933-1987") xt(year)

. graph export lcf1b.png, width(400) replace
(file lcf1b.png written in PNG format)

Let us have a look at observed and fitted values. To compute fitted values for any given year we simply multiply b by the corresponding value of k and we add a. I also compute age midpoints for the plot. Here are the fits for 1933 and 1987, which as you can see are excellent. See also Figure 4 in the paper. (Our fit only goes to age 85-89, but I show below how to use the published schedules to go up to105+)

. gen fit1933 = a + b * k[1]
(36 missing values generated)

. gen fit1987 = a + b * k[1987-1932]
(36 missing values generated)

. gen agem = age + 2.5
(31 missing values generated)

. replace agem = 0.5 in 1
(1 real change made)

. replace agem = 3   in 2
(1 real change made)

. twoway (scatter logm1933 agem) (line fit1933 agem) ///
>        (scatter logm1987 agem) (line fit1987 agem) ///
>     , title(Lee-Carter fits for 1933 and 1987) xt(age) legend(off)

. graph export lcf1a.png, width(400) replace
(file lcf1a.png written in PNG format)

A Random Walk

We now simulate the stochastic process driving k, a random walk with drift -0.365 and innovations with standard deviation 0.652. The idea is to generate normally distributed erorrs, add the drift, and them add to each value the previous one. We do this 50 times. For clarity I use a temporary variable called e for the errors

. gen e = .
(55 missing values generated)

. forvalues r=1/50 {
  2.   quietly replace e = rnormal() * 0.652 in 1/50
  3.   quietly gen     k`r' = -11.05     - 0.365 + e in 1
  4.   quietly replace k`r' = k`r'[_n-1] - 0.365 + e in 2/50
  5. }

Before we plot these let us compute the upper and lower 95% confidence bands. I'll also create a variable for the future years:

. gen n = 1989 + _n in 1/50
(5 missing values generated)

. gen ub = -11.05 -(n-1989)*0.365 + 1.96*sqrt(n-1989)*0.652
(5 missing values generated)

. gen lb = -11.05 -(n-1989)*0.365 - 1.96*sqrt(n-1989)*0.652
(5 missing values generated)

Now we are ready to do our plot. We start with a range area plot to delimit the 95% confidence region and then draw the 50 simulations

. twoway (rarea ub lb n, color(gs12)) ///
>   (line k1-k50 n), legend(off) xt(year)title(50 Random Walks)

. graph export lcf2.png, width(400) replace
(file lcf2.png written in PNG format)

This being a simulation, your results will differ from mine. You should note that the lines stay pretty much in the gray area with the couple of exceptions one would expect, and that there is considerable uncertainty in the level of mortality 50 years in the future.

Age-Specific Forecasts

Let us forecast age-specific mortality in 2050 starting from 1989. First we will merge the official values of a and b, which I saved as lc_a and lc_b in a Stata file. These values extend the model schedules to age 105+, but are otherwise practically identical to the estimates we just obtained

. sort age

. merge age using http://data.princeton.edu/eco572/datasets/LeeCarter_ab
variable age does not uniquely identify observations in the master data

Next we compute k starting from -11.05 in 1989. (I call the resulting scalar km to avoid conflict with the variable k) We also compute the standard deviation and a 95% confidence interval

. scalar km  = -11.05 -0.365*(2050-1989)

. scalar sd = sqrt(2050-1989)*0.652

. di km - 1.96*sd, km + 1.96*sd
-43.295874 -23.334126

So we are 95% confident that k will be between -43.3 and -23.3 in 2050. Finally we combine k and the a and b schedules to forecast and plot age-specific mortality (in the log scale)

. gen p2050  = lc_a + lc_b *  km
(32 missing values generated)

. gen lb2050 = lc_a + lc_b * (km - 1.96*sd)
(32 missing values generated)

. gen ub2050 = lc_a + lc_b * (km + 1.96*sd)
(32 missing values generated)

. twoway (rarea ub2050 lb2050 agem, color(ltblue)) (line p2050 agem) ///
>      , legend(off) title(Forecast for 2050 with 95% CI) xt(age)

. graph export lcf3a.png, width(400) replace
(note: file lcf3a.png not found)
(file lcf3a.png written in PNG format)

Corrected Jump-Off

The final issue concerns the choice of jumpoff. Below we will make a "forecast" for 2005, the latest year for which we have data. The prediction is not too good for young adults because of a slight lack of fit in 1989 which propagates in the forecast. The solution is to "jump-off" from the actual 1989 rates.

. gen p2005 =     lc_a + lc_b * (-11.05 -0.365*(2005-1989))
(32 missing values generated)

. gen j2005 = logm1989 + lc_b * (       -0.365*(2005-1989))
(32 missing values generated)

. twoway (scatter logm2005 agem if age < 110) ///
>        (line p2005 agem, lpat(dash)) (line j2005 agem if age < 110) ///
>         , title(Observed and Forecast for 2005) xt(age) ///
>           legend(order(2 "standard" 3 "jumpoff 1989") ring(0) pos(5) cols(1))

. graph export lcf3b.png, width(400) replace
(note: file lcf3b.png not found)
(file lcf3b.png written in PNG format)

And that's all folks!