Eco572: Research Methods in Demography

Solutions to Problem Set 2

[1] U.S. Mortality

(a) Calculate age-specific period mortality rates for 1992 for blacks and whites

. insheet using http://data.princeton.edu/eco572/datasets/us1992a.dat
(5 vars, 19 obs)

. gen mw = deathsw/expow

. gen mb = deathsb/expob

. list age mw mb

     +--------------------------+
     | age        mw         mb |
     |--------------------------|
  1. |   0   .007809    .019579 |
  2. |   1   .000426    .000776 |
  3. |   5   .000213    .000375 |
  4. |  10   .000282    .000449 |
  5. |  15    .00106    .002184 |
     |--------------------------|
  6. |  20   .001354     .00321 |
  7. |  25   .001533    .003617 |
  8. |  30   .001958    .004644 |
  9. |  35   .002455    .006096 |
 10. |  40   .003122    .008032 |
     |--------------------------|
 11. |  45   .004325    .010657 |
 12. |  50   .006634    .014193 |
 13. |  55   .010715    .021036 |
 14. |  60   .017297    .029243 |
 15. |  65   .026885    .040291 |
     |--------------------------|
 16. |  70   .040124   .0572491 |
 17. |  75   .061488     .07502 |
 18. |  80   .097005   .1096976 |
 19. |  85   .179562   .1671715 |
     +--------------------------+

(b) Plot the rates against age on a log scale. Exclude the last group, which is actually 85+.

. gen agem = (age+age[_n+1])/2    // excludes 85+
(1 missing value generated)

. line mw mb agem, title(U.S. Males 1992) ///
>         yscale(log) ytitle(asmr) xtitle(age)  ///
>         legend(order(1 "White" 2 "Black") ring(0) pos(5) cols(1))

. graph export ps2fig1.png, replace
(file ps2fig1.png written in PNG format)

(c) Fit a Gompertz model over an appropriate age range.

Seems clear that black mortality rates are Gompertz over a wider range than those for whites, which seeem to show a change of slope around age 50. To avoid biases I decided to start at 40-44. To simplify interpretation of the intercept I subtract 40 from age. (You could plot the fits but this was not required.)

. gen lmw=log(mw)

. gen lmb=log(mb)

. gen lb40 = agem-40
(1 missing value generated)

. reg lmw lb40 if age >= 40

      Source |       SS       df       MS              Number of obs =       9
-------------+------------------------------           F(  1,     7) = 6841.82
       Model |  11.4650463     1  11.4650463           Prob > F      =  0.0000
    Residual |  .011730118     7  .001675731           R-squared     =  0.9990
-------------+------------------------------           Adj R-squared =  0.9988
       Total |  11.4767764     8  1.43459705           Root MSE      =  .04094

------------------------------------------------------------------------------
         lmw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        lb40 |   .0874263    .001057    82.72   0.000      .084927    .0899256
       _cons |  -6.053246   .0274181  -220.78   0.000    -6.118079   -5.988412
------------------------------------------------------------------------------

. di exp(_b[_cons]) ", " exp(_b[lb40])
.00235022, 1.0913619

. reg lmb lb40 if age >= 40

      Source |       SS       df       MS              Number of obs =       9
-------------+------------------------------           F(  1,     7) = 6962.81
       Model |  6.50170578     1  6.50170578           Prob > F      =  0.0000
    Residual |  .006536431     7  .000933776           R-squared     =  0.9990
-------------+------------------------------           Adj R-squared =  0.9989
       Total |  6.50824221     8  .813530276           Root MSE      =  .03056

------------------------------------------------------------------------------
         lmb |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        lb40 |   .0658367    .000789    83.44   0.000      .063971    .0677024
       _cons |   -5.02427   .0204671  -245.48   0.000    -5.072667   -4.975873
------------------------------------------------------------------------------

. di exp(_b[_cons]) ", " exp(_b[lb40])
.00657639, 1.0680523

(d) Interpret the resulting parameters in terms of the underlying hazard rate and the rate of aging.

The hazard for blacks is almost triple that of whites at age 40 (.00658 versus .00235, from exponentiating the intercepts), but blacks "age" more slowly as the hazard increases only 6.8% per year, compared to 9.1% for whites (from exponentiating the slopes).

(e) Could one include the 85+ group? What assumption(s) might be needed?

Yes, one could, but assumptions are required. We attributed other hazards to the midpoint of the age group. The question is what to do with 85+. (1) Use 87.5, as if the group was 85-90, or maybe 90. (2) Assume constant hazard after 85. Then expectation of life at 85 is the reciprocal of the death rate, and adding 85 gives us midpoints of 90.6 for whites and 91 for blacks. (3) Assume Gompertz hazards. Use the model fitted to ages below 85 to see what age would correspond to the observed mortality rate. This gives 102 for whites and 95.3 for blacks. Given the uncertainty, the choice of omitting 85+ seems prudent. (Also, there is some evidence that after age 80 the hazard increases more slowly than in a Gompertz model, but that's a different issue.)

[2] Model Life Tables

Consider the life table of South African females you estimated in problem set 1. (I reestimate the table here using the latest data, which include an extra column for AIDS deaths to be used in [5].)

. drop _all

. set type double

. infile age expo deaths aids using ///
>         http://data.princeton.edu/eco572/datasets/safeda.dat
(18 observations read)

. gen m = deaths/expo

. // interval widths
. gen n = age[_n+1]-age  // interval width
(1 missing value generated)

. // time lived by deaths
. gen a = 2.5

. replace a =  .200 in 1  // or .16 if you prefer
(1 real change made)

. replace a = 1.442 in 2  // or 1.5
(1 real change made)

. replace a = 1/m in -1
(1 real change made)

. // life table
. gen q = n*m/(1+(n-a)*m)
(1 missing value generated)

. replace q = 1 in -1
(1 real change made)

. gen lx = 100000

. replace lx = lx[_n-1] * (1-q[_n-1]) in 2/-1
(17 real changes made)

. gen d = lx * q

. gen Lx = lx*n - d*a
(1 missing value generated)

. replace L = lx/m in -1
(1 real change made)

. quietly summarize L

. gen T = r(sum) - sum(L) + L

. gen e = T/lx

. format %9.0fc lx d L T

. format %6.3f a e

. list age a lx d L T e

     +---------------------------------------------------------------+
     | age       a        lx        d        Lx           T        e |
     |---------------------------------------------------------------|
  1. |   0   0.200   100,000    5,051    98,990   5,686,750   56.867 |
  2. |   1   1.442    94,949    5,246   372,233   5,587,760   58.850 |
  3. |   5   2.500    89,703      674   446,831   5,215,527   58.142 |
  4. |  10   2.500    89,029      327   444,329   4,768,696   53.563 |
  5. |  15   2.500    88,703    1,036   440,922   4,324,366   48.751 |
     |---------------------------------------------------------------|
  6. |  20   2.500    87,666    2,753   431,449   3,883,445   44.298 |
  7. |  25   2.500    84,913    4,579   413,119   3,451,996   40.653 |
  8. |  30   2.500    80,334    5,421   388,119   3,038,877   37.828 |
  9. |  35   2.500    74,913    5,413   361,036   2,650,758   35.384 |
 10. |  40   2.500    69,501    4,556   336,115   2,289,722   32.945 |
     |---------------------------------------------------------------|
 11. |  45   2.500    64,945    3,389   316,253   1,953,607   30.081 |
 12. |  50   2.500    61,556    3,022   300,226   1,637,354   26.599 |
 13. |  55   2.500    58,534    3,349   284,299   1,337,128   22.844 |
 14. |  60   2.500    55,185    4,342   265,072   1,052,829   19.078 |
 15. |  65   2.500    50,844    5,989   239,247     787,757   15.494 |
     |---------------------------------------------------------------|
 16. |  70   2.500    44,855    9,109   201,502     548,510   12.228 |
 17. |  75   2.500    35,746   10,690   152,004     347,008    9.708 |
 18. |  80   7.783    25,056   25,056   195,003     195,003    7.783 |
     +---------------------------------------------------------------+

. save safeda, replace  // for use in question 5
file safeda.dta saved

(a) Compute Brass's logit transformation and plot both the data and a linear fit against the general standard

. sort age

. merge age using http://data.princeton.edu/eco572/datasets/brassrlm5

. list age _merge if _merge != 3

     +--------------+
     | age   _merge |
     |--------------|
  1. |   0        1 |
 19. |  85        2 |
 20. |  90        2 |
 21. |  95        2 |
 22. | 100        2 |
     +--------------+

. drop if _merge != 3  // we don't want age 0 or ages above 80
(5 observations deleted)

. gen yobs = 0.5*log( (100000-lx)/lx )

. reg yobs ys

      Source |       SS       df       MS              Number of obs =      17
-------------+------------------------------           F(  1,    15) =  204.49
       Model |  4.61296171     1  4.61296171           Prob > F      =  0.0000
    Residual |  .338378129    15  .022558542           R-squared     =  0.9317
-------------+------------------------------           Adj R-squared =  0.9271
       Total |  4.95133984    16   .30945874           Root MSE      =   .1502

------------------------------------------------------------------------------
        yobs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          ys |   .9615473   .0672413    14.30   0.000     .8182258    1.104869
       _cons |  -.4211604   .0366016   -11.51   0.000    -.4991748   -.3431459
------------------------------------------------------------------------------

. predict yfit
(option xb assumed; fitted values)

. twoway scatter yobs ys || line yfit ys, xtitle(Standard) ///
>         title(Brass Logit of Survival) legend(off) name(A, replace)

I will also compute and plot the qx's for comparison with the Coale-Demeny fit below. This was not required.

. gen qobs = d/lx

. gen lfit = 100000*exp(-2*yfit)/(1+exp(-2*yfit))

. gen qfit = (lfit-lfit[_n+1])/lfit
(1 missing value generated)

. gen agem=(age+age[_n+1])/2
(1 missing value generated)

. scatter qobs agem || line qfit agem, yscale(log) xtitle(age) ///
>         title(Conditional Probability of Dying) legend(off) name(B, replace)

. graph combine A B, xsize(6) ysize(3) ///
>         title("Relational Logit Fit") subtitle(South African Females) 

. graph export ps2fig2.png, replace
(file ps2fig2.png written in PNG format)

The model doesn't fit very well, it overestimates mortality at low and high ages and underestimates it in the middle.

(b) Can you find a table in the Coale-Demeny family that does a better job? (Your search doesn't have to be exhaustive, just pick one or two indicators to match in each family.)

The idea here is to pick one table from each family and see which one does best. Because these are one-parameter families, all we do is match one characteristic. The best one to pick is life expectancy at birth because it summarizes all age-specific rates. (It's also how the Coale-Demeny tables are indexed.) For a female life expectancy of 56.9 you should be looking at levels 15 and 16.

My own search suggested that the West family matched the rates a bit better (well, less badly) than the others, although North was also a possibility. The best choice among the published levels was 16, although a table with a life expectancy a bit above that would be marginally better. Here are the values for level 16

. input cdqx

           cdqx
  1.  .082306
  2.  .041572
  3.  .013368
  4.  .010379
  5.  .014964
  6.  .019634
  7.  .022534
  8.  .025731
  9.  .029521
 10.  .034348
 11.  .041981
 12.  .057336
 13.  .078602
 14.  .117346
 15.  .172555
 16.  .261811
 17.  .384501

The right-hand-side panel of the figure below shows the observed and fitted nqx's. I also computed the fitted lx and took Brass logits so we could compare the fit on that scale as well.

. scatter qobs agem || line cdqx agem , yscale(log) xtitle(age) ///
>         title(Conditional Probability of Dying) legend(off) name(B, replace)

. gen cdlx = 1

. qui replace cdlx = cdlx[_n-1]*(1-cdqx[_n-1]) if _n>1

. gen cdy = 0.5*log((1-cdlx)/cdlx)
(1 missing value generated)

. scatter yobs cdy || line cdy cdy, xtitle(Logit of Model) ///
>         title(Brass Logit of Survival) legend(off) name(A, replace)

. graph combine A B, xsize(6) ysize(3) ///
>         title("Coale-Demeny Fit") subtitle(South African Females) 

. graph export ps2fig3.png, replace
(file ps2fig3.png written in PNG format)

We see once again that the model life table underestimates mortality in the middle of the range. Obviously the South African pattern doesn't correspond to the structures represented in this model because of its excess middle age mortality. The reason will become apparent in part [5].

[3] Child Survival

Refer to the Pebley-Stupp data in the problem set. I will enter the age groups and hazards

. drop _all

. input age h

            age           h
  1.    0 .0457006
  2.    1 .0033105
  3.    6 .0040912
  4.   12 .0024781
  5.   24 .0004648
  6. end

(a) What's the probability of infant death (by 12.0 months) for the reference group?

. gen n = age[_n+1]-age // widths
(1 missing value generated)

. gen H = sum(n*h) // applies to END of each age group

. gen S = exp(-H)  // ditto

. gen agep = age[_n+1] // to avoid confusion
(1 missing value generated)

. list age agep S

     +------------------------+
     | age   agep           S |
     |------------------------|
  1. |   0      1   .95532794 |
  2. |   1      6   .93964503 |
  3. |   6     12   .91686017 |
  4. |  12     24   .88999672 |
  5. |  24      .   .88999672 |
     +------------------------+

. di 1 - S[3]
.08313983

The probability of infant death is 1 - 0.91686 = .08314 or 8.31%. (This is usually expressed as 83 per thousand.)

(b) Describe the effect of age of mother on child survival

There is a U-shaped relationship. Children whose mothers are under 20 have 20% higher risk, and those those mothers are 40 or older have 53% higher risk, than children whose mothers are 20-39. This applies at every age in the first couple of years of life.

(c) What's the probability of infant death for a 6th child of a 44 year old mother?

We introduce both risk factors (which effectively doubles the hazard, as 1.5327 times 1.3094 = 2.007) and note that multiplying the hazard by a proportionality factor is equivalent to raising the survival function to that factor.

. di 1 - S[3]^( 1.532668 * 1.309283)
.15985588

The probability of infant death for high-parity births to mothers age 40 or older is 0.16 (or 160 per thousand), almost double that of the reference group of lower parity births to younger mothers.

[4] Unobserved Heterogeneity

Consider two groups of individuals who face a constant hazard from age 20 onwards but are heterogenous, with gamma distributed frailty that averages 0.02 in group one and 0.03 in group two at age 20, and has a variance of one in both groups. Note that we only work from age 20 onwards.

(a) Group two has 50% higher risk at age 20. What's the risk ratio at age 40? At age 60? At age 80?

I'll create observations for ages 20, 40, 60, and 80. The population risk depends on the expected frailty of survivors as shown in Equation 3 in the handout:

. clear

. set obs 4
obs was 0, now 4

. gen age = 20*_n

. gen h1 = .02/(1+0.02*(age-20))

. gen h2 = .03/(1+0.03*(age-20))

. gen hratio = h2/h1

. list age h1 h2 hratio

     +-----------------------------------------+
     | age          h1          h2      hratio |
     |-----------------------------------------|
  1. |  20         .02         .03         1.5 |
  2. |  40   .01428571      .01875      1.3125 |
  3. |  60   .01111111   .01363636   1.2272727 |
  4. |  80   .00909091   .01071429   1.1785714 |
     +-----------------------------------------+

The risk start 50% higher but by age 80 it's only 18% higher.

(b) Suppose the high risk group represented half the total population at age 20. What fraction of the total would it be at age 40? 60? 80?

The population survival is given on Equation 2 in the handout

. gen S1 = 1/(1+0.02*(age-20))

. gen S2 = 1/(1+0.03*(age-20)) 

. gen prop2 = S2/(S1+S2)

. list age S1 S2 prop2

     +-----------------------------------------+
     | age          S1          S2       prop2 |
     |-----------------------------------------|
  1. |  20           1           1          .5 |
  2. |  40   .71428571        .625   .46666667 |
  3. |  60   .55555556   .45454545         .45 |
  4. |  80   .45454545   .35714286         .44 |
     +-----------------------------------------+

The high risk group starts at 50% but by age 80 it represents only 44% of the total, as the frail are selected out more quickly in higher risk groups.

Challenge: Can you answer these questions under the more realistic assumption that the baseline hazard is Gompertz from age 20 onwards, with a log hazard of -7 at age 20 and an aging rate of 0.08 for group one?

To answer this question we need to apply Equations 2 and 3 again, but with a Gompertz hazard. This can be done analytically or numerically.

Competing Risks

We now reuse the file we saved in question 2 above.

. use safeda, clear

(a) Construct a multiple decrement life table distinguishing deaths due to AIDS and to other causes.

. gen di = d * aids/deaths

. gen qi = di / lx

. format %9.0fc di 

. format %6.4f q qi

. list age lx d di q qi

     +--------------------------------------------------+
     | age        lx        d      di        q       qi |
     |--------------------------------------------------|
  1. |   0   100,000    5,051   1,271   0.0505   0.0127 |
  2. |   1    94,949    5,246   2,788   0.0553   0.0294 |
  3. |   5    89,703      674       0   0.0075   0.0000 |
  4. |  10    89,029      327     116   0.0037   0.0013 |
  5. |  15    88,703    1,036     754   0.0117   0.0085 |
     |--------------------------------------------------|
  6. |  20    87,666    2,753   2,347   0.0314   0.0268 |
  7. |  25    84,913    4,579   4,007   0.0539   0.0472 |
  8. |  30    80,334    5,421   4,727   0.0675   0.0588 |
  9. |  35    74,913    5,413   4,509   0.0723   0.0602 |
 10. |  40    69,501    4,556   3,358   0.0656   0.0483 |
     |--------------------------------------------------|
 11. |  45    64,945    3,389   1,790   0.0522   0.0276 |
 12. |  50    61,556    3,022     862   0.0491   0.0140 |
 13. |  55    58,534    3,349     438   0.0572   0.0075 |
 14. |  60    55,185    4,342     220   0.0787   0.0040 |
 15. |  65    50,844    5,989      98   0.1178   0.0019 |
     |--------------------------------------------------|
 16. |  70    44,855    9,109      33   0.2031   0.0007 |
 17. |  75    35,746   10,690       5   0.2991   0.0001 |
 18. |  80    25,056   25,056       0   1.0000   0.0000 |
     +--------------------------------------------------+

I list the conditional and unconditional probabilities of dying of AIDS at each age.

(b) What's the probability that a woman who dies before age 50 dies of AIDS?

. quietly summarize d if age < 50

. scalar db50 = r(sum)

. quietly summarize di if age < 50

. di r(sum)/db50
.66764641

So two-thirds of female deaths before age 50 are due to AIDS.

(c) Construct the cause-deleted table in the absence of AIDS.

There are different ways of doing this, depending on how you handle nax. You could use the same factors as for the overall lifetable, assume constant risk in each interval, or use the more elaborate procedure described in the text. I'll use the first approach for consistency with the table in Q. 2.

. gen Rd=(deaths-aids)/deaths

. gen pd = (1-q)^Rd       // Chiang's method

. gen ld = 100000

. replace ld = ld[_n-1] * pd[_n-1] in 2/-1
(17 real changes made)

. gen dd = ld*(1-pd)

. replace dd = ld in -1
(0 real changes made)

. gen Ld = ld * n - dd * a
(1 missing value generated)

. replace Ld = ld/(m*Rd) in -1
(1 real change made)

. quietly sum Ld

. gen Td = r(sum) - sum(Ld) + Ld

. gen ed = Td/ld

. format %9.0fc ld dd Ld Td

. format %6.3f ed

. list age a ld dd Ld Td ed

     +---------------------------------------------------------------+
     | age       a        ld       dd        Ld          Td       ed |
     |---------------------------------------------------------------|
  1. |   0   0.200   100,000    3,805    99,239   6,905,118   69.051 |
  2. |   1   1.442    96,195    2,528   381,136   6,805,879   70.751 |
  3. |   5   2.500    93,668      704   466,578   6,424,743   68.591 |
  4. |  10   2.500    92,964      221   464,267   5,958,165   64.091 |
  5. |  15   2.500    92,743      297   462,974   5,493,898   59.238 |
     |---------------------------------------------------------------|
  6. |  20   2.500    92,447      434   461,148   5,030,924   54.420 |
  7. |  25   2.500    92,013      635   458,477   4,569,776   49.665 |
  8. |  30   2.500    91,378      813   454,857   4,111,299   44.992 |
  9. |  35   2.500    90,565    1,126   450,009   3,656,442   40.374 |
 10. |  40   2.500    89,439    1,581   443,242   3,206,433   35.851 |
     |---------------------------------------------------------------|
 11. |  45   2.500    87,858    2,194   433,807   2,763,190   31.451 |
 12. |  50   2.500    85,665    3,028   420,753   2,329,384   27.192 |
 13. |  55   2.500    82,637    4,126   402,869   1,908,631   23.097 |
 14. |  60   2.500    78,511    5,875   377,866   1,505,762   19.179 |
 15. |  65   2.500    72,635    8,424   342,118   1,127,896   15.528 |
     |---------------------------------------------------------------|
 16. |  70   2.500    64,212   12,999   288,562     785,778   12.237 |
 17. |  75   2.500    51,213   15,310   217,791     497,216    9.709 |
 18. |  80   7.783    35,903   35,903   279,426     279,426    7.783 |
     +---------------------------------------------------------------+

(d) How many years of life expectancy are lost to AIDS?

. di ed[1]-e[1]
12.183684

Life expectancy would increase from 56.8 to 69.1, a gain of 12.2 years (a 21.4% increase).

(e) What assumption underlies the calculations in c and d? How do you think the true impact of AIDS on life expectancy differs from your estimate in d?

The key assumption is that the underlying risks are independent. If women who die of AIDS are also at high risk of dying due to other causes, as seems likely, the gain in life expectancy that would obtain from eliminating AIDS would be less than 12 years.