Eco572: Research Methods in Demography

Solutions to Problem Set 1

[1] World Population

(a) The U.S. Census Bureau (http://www.census.gov/ipc/www/popclockworld.html) estimates the world population at the time this problem set was due, 3/1/06, as 6,500,609,361. (For an up-to-the-second answer see http://opr.princeton.edu/popclock).

The same page shows the population going from 6,451,058,790 to 6,525,486,603 between 7/1/05 and 7/1/06, for an annual growth rate of

. scalar r = log(6525486603/6451058790)

. display r
.01147125

(Can you figure out how they interpolate between these numbers to get monthly figures or run the population clock?)

(b) The time it takes to reach 12 billion from now is

. scalar p0 = 6500609361

. scalar t = log(12000000000/p0)/r

. display t
53.438856

and the date is

. scalar d = date("3/1/2006","mdy") + t * 365.25

. display month(d) "/" day(d) "/" year(d)
8/8/2059

(c) To postpone reaching 12 billion until 12/31/2100 we would need an average rate of

. scalar y = (date("12/31/2100","mdy")-date("3/1/2006","mdy"))/365.25

. scalar ar = log(12000000000/p0)/y

. display ar
.00646406

If the growth rate was to decline linearly from its present value of r, it would have to be

. display 2 * ar - r

.00145687

by the time we reach 12 billion to meet the average. In other words, we would need to reach almost zero population growth by the end of the century.

[2] Standardization

I cut and pasted the data, put quotes around the age group labels, and then read it into Stata:

. clear

. input str5 ageg urbann urbanuse ruraln ruraluse

          ageg      urbann    urbanuse      ruraln    ruraluse
  1. "15-19" 298 37 1451 88 
  2. "20-24" 332 116 1020 281 
  3. "25-29" 267 132 926 370 
  4. "30-34" 185 103 714 299 
  5. "35-39" 162 73 782 343 
  6. "40-44" 69 37 556 213 
  7. "45-49" 68 17 452 119 
  8. end

The calculations are a lot easier if we stack rural below urban (which you can do by hand or using a reshape command) as you can then follow the handout using tabstat.

(a) I reshape, compute the prevalence rates, and average them using the actual n's to get crude rates.

. quietly reshape long @n @use, i(ageg) j(tpr) string

. gen prev = use/n

. tabstat prev [fw=n], by(tpr)

Summary for variables: prev
     by categories of: tpr 

  tpr |      mean
------+----------
rural |  .2902898
urban |  .3729182
------+----------
Total |  .3059599
-----------------

The urban prevalence rate is much higher than the rural one, 37.3 versus 29.0%. To get standardized rates I get the urban and rural compositions, average them, and use that as weight

. egen comp = pc(n), by(tpr)

. egen avgcomp = mean(comp), by(ageg)

. tabstat prev [aw=avgcomp], by(tpr)

Summary for variables: prev
     by categories of: tpr 

  tpr |      mean
------+----------
rural |  .2932567
urban |  .3690886
------+----------
Total |  .3311726
-----------------

The rural sample is younger than the urban, but that accounts for only a small part of the difference, as the standardized rates are 36.9 and 29.3%. We can also use the overall age ditribution

. egen totcomp = sum(comp), by (ageg)

. tabstat prev [w=totcomp], by(tpr)
(analytic weights assumed)

Summary for variables: prev
     by categories of: tpr 

  tpr |      mean
------+----------
rural |  .2932567
urban |  .3690886
------+----------
Total |  .3311726
-----------------

(b) I now compute the average prevalence rates and average these using the observed numbers of women to see how much difference the rates make

. egen avgprev = mean(prev), by(ageg)

. tabstat avgprev [fw=n], by(tpr)

Summary for variables: avgprev
     by categories of: tpr 

  tpr |      mean
------+----------
rural |  .3277744
urban |  .3345709
------+----------
Total |  .3290633
-----------------

If we made urban women younger the prevalence rate would go down only a little bit. Here's the final decomposition 'by hand'

. display _newline  "Difference= " .3729182 - .2902898 ///

>          "; Same Composition = " .3690886 - .2932567 ///
>          "; Same Rates = "       .3345709 - .3277744

Difference= .0826284; Same Composition = .0758319; Same Rates = .0067965

So 92% of the difference is due to differences in rates and only 8% to differences in age structure.

(c) The question here is whether to compare urban and rural we need all seven age-specific prevalence rates for each region, or can do with a summary. As it happens, contraceptive use is higher in urban than rural ares in all age groups, averaging about 7.5 percentage points higher, albeit with variations from age to age. This is very similar to the difference we get from the standardized prevalence rates, so reporting that number is not misleading; it provides a useful summar while controlling for age structure.

[3] Smoothing

We read the data for Ghana 1979-80 from the course website

. clear

. infile age n using ///
>         http://data.princeton.edu/eco572/datasets/ghhhpop.dat
(97 observations read)

(a) We are told that age is top coded at 95 and that 99 means not stated, so we only have information in single years up to 94. That means we can only go up to 89 to have each digit appear the same number of times. In order to leave room for blending, we need to work with the range 0 to 79:

. myers age [fw=n], range(0 79)

 Last digit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     36,759       15.13       15.13
          1 |     19,208        7.91       23.04
          2 |     25,324       10.42       33.46
          3 |     20,600        8.48       41.94
          4 |     21,470        8.84       50.77
          5 |     29,564       12.17       62.94
          6 |     23,852        9.82       72.76
          7 |     19,308        7.95       80.71
          8 |     24,837       10.22       90.93
          9 |     22,040        9.07      100.00
------------+-----------------------------------
      Total |    242,962      100.00

Myers' Blended Index = 7.9432998

We see substantial preference for ages ending in 0 and 5. The value of the index means that we would need to reshuffle almost 8% of the observations to obtain the expected 10% in each digit for the blended population.

(b) For smoothing we can go all the way to 94, but you should still exclude 95 and 99. A good way to start is to plot the actual counts versus age and then try lowess with different badwidths. You should find that the default over-smooths, but a bandwidth in the vicinity of 0.25 seems to do a reasonable job:

. lowess n age if age < 95, bwidth(.25) gen(lowess)

An alternative is a cubic spline. I thought I would try only two internal knots, one at 20 and one at 50, just because there is more going on at young ages. Here's my fit and a plot that compares the two smoothers:

. bspline if age < 95, xvar(age) knots(0 20 50 100) p(3) gen(_bs)

(2 missing values generated)

. quietly regress n _bs*, noconstant

. predict spline
(option xb assumed; fitted values)
(2 missing values generated)

. twoway (scatter n age) (line lowess age, lp(dash))      ///     
>         (line spline age)       ,                                       ///
>         title("Ghana 1979-80") subtitle(WFS Household Survey) ///
>         legend(order(2 "Lowess" 3 "Spline") ring(0) pos(2) cols(1))

. graph export ps1fig1.png, replace
(file ps1fig1.png written in PNG format)

There's very little to choose between these two. I think the spline is a bit better but then, isn't it always?

(c) We can now generate the last digit and tabulate with the smooth and observed frequencies, saving the results in a couple of matrices

. gen lastdigit = mod(age,10) if age < 95

(2 missing values generated)

. tab lastdigit [w=n], matcell(obs)
(frequency weights assumed)

  lastdigit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      4,682       16.17       16.17
          1 |      2,654        9.17       25.34
          2 |      3,259       11.26       36.60
          3 |      2,720        9.40       45.99
          4 |      2,659        9.19       55.18
          5 |      3,381       11.68       66.86
          6 |      2,686        9.28       76.14
          7 |      2,116        7.31       83.45
          8 |      2,580        8.91       92.36
          9 |      2,212        7.64      100.00
------------+-----------------------------------
      Total |     28,949      100.00

. tab lastdigit [aw=spline], matcell(spline)

  lastdigit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 11.0801683       11.66       11.66
          1 | 10.7378799       11.30       22.97
          2 | 10.3957421       10.94       33.91
          3 | 10.0552701       10.58       44.49
          4 |  9.7180278       10.23       54.72
          5 |  9.2876905        9.78       64.50
          6 | 8.93940788        9.41       73.91
          7 | 8.59612969        9.05       82.96
          8 | 8.25928994        8.69       91.65
          9 | 7.93039371        8.35      100.00
------------+-----------------------------------
      Total |         95      100.00

I had to use analytic weights the second time because the smooth frequencies are fractional. To compare the two vectors of percents I proceed just as in the handout, with a one-line call to Mata to compute half the sum of absolute differences:

. matrix diff = obs/28949 - spline/95

. mata: sum(abs(st_matrix("diff")))/2
  .0694565698

So we would need to reshuffle almost 7% of the observations. As you can see the Myers index does a reasonable job.

[4] Life Tables

Let us read the data from South Africa. (I just added another column to the dataset, with AIDS deaths, hence my infile may differ form yours.) I will also compute the age-specific rates.

. clear

. infile ageg expo deaths aids using ///
>         http://data.princeton.edu/eco572/datasets/safed.dat
(18 observations read)

. set type double // all generates will be double precision

. gen m = deaths/expo

(a) The big question here concerns the choice of "separation" factors. You might be tempted to borrow factors from Austria or the textbook, but it is not clear why those values should apply to South Africa.

Fortunately, it is only the very young ages that matter. I will use 2.5 everywhere, and will "make up" values for the first two age groups using the Coale-Demeny formula for West female model life tables, just as in the handout.

. gen a = 2.5

. replace a = cond(m[1] >= 0.107,  .350,  .053 + 2.800 * m[1]) in 1
(1 real change made)

. replace a = cond(m[1] >= 0.107, 1.361, 1.522 - 1.518 * m[1]) in 2
(1 real change made)

. list age a in 1/3

     +------------------+
     | ageg           a |
     |------------------|
  1. |    0    .2003725 |
  2. |    1   1.4421031 |
  3. |    5         2.5 |
     +------------------+

The blackboard site for last year's class has a calculated life table that used 1.5 for ages 1-4 and 0.159 for ages 0-1. I don't know the origin of the 0.159, but considering that South Africa has pretty high infant mortality it might be a better estimate than 0.200.

The next step is to set the width of the age groups, estimate nqx, and compute the remaining life table functions. The last (open-ended) age group requires special treatment, as usual.

. gen n = age[_n+1]-age // width of age intervals

(1 missing value generated)

. gen q = n * m/(1+(n-a)*m)
(1 missing value generated)

. replace q = 1 in -1
(1 real change made)

. gen p = 1-q

. gen lx = 100000 in 1
(17 missing values generated)

. quietly replace lx = lx[_n-1] * p[_n-1] in 2/-1

. gen d = lx - lx[_n+1]
(1 missing value generated)

. replace d = lx in -1
(1 real change made)

. gen L = n * lx[_n+1] + a*d
(1 missing value generated)

. replace L = lx/m in -1
(1 real change made)

. quietly summarize L

. gen T = r(sum) - sum(L) + L

. gen e = T/lx

For printing I will use the same formats as in the handout

. format %6.3f a e

. format %8.6f m q 

. format %9.0fc lx d L T

. list age m a q lx d L T e

     +--------------------------------------------------------------------------------------+
     | ageg          m       a          q        lx        d         L           T        e |
     |--------------------------------------------------------------------------------------|
  1. |    0   0.052633   0.200   0.050507   100,000    5,051    95,961   5,677,861   56.779 |
  2. |    1   0.014319   1.442   0.055251    94,949    5,246   366,378   5,581,900   58.788 |
  3. |    5   0.001508   2.500   0.007514    89,703      674   446,831   5,215,522   58.142 |
  4. |   10   0.000735   2.500   0.003670    89,029      327   444,329   4,768,691   53.563 |
  5. |   15   0.002350   2.500   0.011683    88,702    1,036   440,921   4,324,362   48.751 |
     |--------------------------------------------------------------------------------------|
  6. |   20   0.006380   2.500   0.031401    87,666    2,753   431,448   3,883,441   44.298 |
  7. |   25   0.011084   2.500   0.053927    84,913    4,579   413,119   3,451,992   40.653 |
  8. |   30   0.013967   2.500   0.067478    80,334    5,421   388,119   3,038,874   37.828 |
  9. |   35   0.014992   2.500   0.072250    74,913    5,412   361,036   2,650,755   35.384 |
 10. |   40   0.013555   2.500   0.065552    69,501    4,556   336,115   2,289,719   32.945 |
     |--------------------------------------------------------------------------------------|
 11. |   45   0.010716   2.500   0.052180    64,945    3,389   316,253   1,953,605   30.081 |
 12. |   50   0.010066   2.500   0.049093    61,556    3,022   300,226   1,637,352   26.599 |
 13. |   55   0.011780   2.500   0.057215    58,534    3,349   284,298   1,337,126   22.844 |
 14. |   60   0.016379   2.500   0.078673    55,185    4,342   265,072   1,052,828   19.078 |
 15. |   65   0.025031   2.500   0.117783    50,844    5,988   239,247     787,756   15.494 |
     |--------------------------------------------------------------------------------------|
 16. |   70   0.045207   2.500   0.203082    44,855    9,109   201,502     548,510   12.228 |
 17. |   75   0.070327   2.500   0.299054    35,746   10,690   152,004     347,008    9.708 |
 18. |   80   0.128490   2.500   1.000000    25,056   25,056   195,003     195,003    7.783 |
     +--------------------------------------------------------------------------------------+

We get a life expectancy of only 56.8 years. We will look further into this result in the next problem set.

(b) If we assume that the hazard is constant in each age group we can estimate it with the observed rate and convert to a probability using nqx = 1-exp(-n nmx):

. gen q2 = 1 - exp(-n*m)

(1 missing value generated)

To see how much difference this makes I compute absolute and relative differences, summarize them, and list noteworthy ages

. gen absd = abs(q-q2) if q < 1

(1 missing value generated)

. gen reld = absd/q
(1 missing value generated)

. sum absd reld

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        absd |        17    .0002864    .0006474   4.13e-09   .0025926
        reld |        17    .0023034    .0042379   1.12e-06   .0151376

. list age q q2 absd reld if absd > 0.0005 | reld > .005 

     +-----------------------------------------------------+
     | ageg          q          q2        absd        reld |
     |-----------------------------------------------------|
  1. |    0   0.050507    .0512719   .00076456    .0151376 |
  2. |    1   0.055251   .05566552   .00041429   .00749823 |
 16. |   70   0.203082   .20230828   .00077322   .00380746 |
 17. |   75   0.299054   .29646151   .00259258   .00866928 |
 18. |   80   1.000000           .           .           . |
     +-----------------------------------------------------+

The method only makes noticeable differences at the extremes, in particular age 0-1, where the relative difference is 1.6%, and to a lesser extent 75-79, where it is almost one percent.

(c) The infant mortality "rate" is a bit of a misnomer because demographers usually mean 1q0, which we estimated to be 50.5 per thousand, so roughly five percent of South African girls born around 1998 died before their first birthday. Note that there is some uncertainty on this estimate due to the fact that we don't know the average age at death for infant deaths.

(The other common definition of infant mortality rate is deaths under age one in a calendar year divided by births in that year. This is also trying to approximate the probability of surviving from birth to age one.)

(d) The probability of surviving from ages 20 to 40 is easily obtained from rows 6 and 10 in the survival function:

. di l[10]/l[6]

.79279105

The probability is 79.3%, which is remarkably low. At current mortality levels, one in five 20-year olds will die before age 40.

(e) The expectation of life at age one can be higher than at birth when infant mortality is very high relative to mortality after age one. It is easy to see how this can happen in a heterogenous population, where the frail are more likely to die early, but it can also occur in a homogeneous population. The key is to remember that e1 is a mean conditional on having survived to age one.

Life expectancy at birth is the same as at age one (so children don't "age" in the first year) when the infant mortality rate equals the proportion of person-years spent at age zero. To prove this result recall that ex=Tx/lx. Setting these equal for ages 0 and 1 leads to l1/l0 = T1/T0, so the probability of surviving to age one must equal the proportion of time spent after age one to balance things out. Subtracting from one 1q0 = 1 - T1 / T0 .

[5] More on Life Tables

(a) The latest U.S. life table is for 2002 and is available on the NCHS website. The table is constructed using data on births and deaths by calendar year, combined with separation factors that attemp to distinguish infant deaths by cohort. See the NCHS documentation for further information. Our main interest is to discover the implied nax factors, so we can compare them with the Coale-Demeny estimate. The key quantities needed are

. clear

. input age lx deaths L

            age          lx      deaths           L
  1. 0 100000 697   99389
  2. 1  99303 122  396922
  3. 5  99180 . .
  4. end

We can then see the implied time lived by deaths in 0-1 and 1-5:

. di (L[1] - lx[2])/d[1]

.12338594

. di (L[2] - 4*lx[3])/d[2]
1.6557377

The Coale-Demeny factors for males can be estimated as

. scalar m = d[1]/L[1]

. di cond(m >= 0.107,  .330,  .045 + 2.684 * m)
.06382249

. di cond(m >= 0.107, 1.352, 1.651 - 2.816 * m)
1.6312518

As you can see, the values are quite different for age 0 but very similar for age 1-4.

(b) Social Security. The U.S. life tables actually used in this debate are the 2002 life tables from part (a), which are available by ethnicity. The relevant numbers are show below. (I added lx to verify the life expectancies but is otherwise not needed.)

. clear

. set type double

. input age lw Tw lb Tb

            age          lw          Tw          lb          Tb
  1.  0 100000 7510719 100000 6876522
  2. 20  98605 5528046  97368 4914598
  3. 65  79874 1327634  65695  960629
  4. end

. gen ew = Tw/lw

. gen eb = Tb/lb

. list age ew eb

     +-----------------------------+
     | age          ew          eb |
     |-----------------------------|
  1. |   0    75.10719    68.76522 |
  2. |  20   56.062532   50.474468 |
  3. |  65   16.621604   14.622559 |
     +-----------------------------+

So all the statistics quoted in the debate are in fact correct. In particular, whites live only 13.7% longer than blacks conditional on reaching age 65, although unconditionally they spend on average 2.7 times as many years after age 65.

Let us calculate the person-years spent in retirement (collecting) over those spent working (paying)

. display Tw[3]/(Tw[2]-Tw[3])

.31607233

. display Tb[3]/(Tb[2]-Tb[3])
.24295309

So whites, as a cohort, spend 0.316 years (or almost four months) in retirement for every year they spend working. The comparable ratio for blacks is 0.243 years (or less than three months). So, relative to the time spent working, whites spend 30% longer in retirement.

(c) To see how much of the difference in life expectancy is due to infant and child mortality we follow the procedure outlined in Preston et al., page 65. We need the little l, big L, and T columns for ages 0, 1, and 5:

. clear

. set type double

. input age lw Lw Tw lb Lb Tb

            age          lw          Lw          Tw          lb          Lb          Tb
  1. 0 100000  99439 7510719 100000  98650 6876522
  2. 1  99358 397142 7411080  98461 393347 6777872
  3. 5  99234 495971 7013938  98249 490958 6384525
  4. end

. gen delta = (Lw/lw - Lb/lb) * lb/100000 + ///
>  (lb/lw - lb[_n+1]/lw[_n+1]) * Tw[_n+1]/100000 
(1 missing value generated)

. scalar diff = (Tw[1]-Tb[1])/100000

. gen pc = delta/diff
(1 missing value generated)

. list age delta pc in 1/2

     +-----------------------------+
     | age       delta          pc |
     |-----------------------------|
  1. |   0    .6769593   .10674275 |
  2. |   1   .06508653   .01026283 |
     +-----------------------------+

So 10.7% of the difference in life expectancy between blacks and whites can be attributed to infant mortality, and another one percent is due to mortality between ages 1 and 5, for a total 11.7% difference due to child mortality under age 5.

It's interesting to do this for all ages, either using single years of abridged life tables. The contributions from adult ages are larger than I would have guessed. (This graph, of course, was not required.)