![]() |
Eco572: Research Methods in Demography | ![]() |
![]() | ||
(a) The U.S. Census Bureau (http://www.census.gov/ipc/www/popclockworld.html) estimates the world population at the time this problem set was due, 3/1/06, as 6,500,609,361. (For an up-to-the-second answer see http://opr.princeton.edu/popclock).
The same page shows the population going from 6,451,058,790 to 6,525,486,603 between 7/1/05 and 7/1/06, for an annual growth rate of
. scalar r = log(6525486603/6451058790) . display r .01147125
(Can you figure out how they interpolate between these numbers to get monthly figures or run the population clock?)
(b) The time it takes to reach 12 billion from now is
. scalar p0 = 6500609361 . scalar t = log(12000000000/p0)/r . display t 53.438856
and the date is
. scalar d = date("3/1/2006","mdy") + t * 365.25
. display month(d) "/" day(d) "/" year(d)
8/8/2059
(c) To postpone reaching 12 billion until 12/31/2100 we would need an average rate of
. scalar y = (date("12/31/2100","mdy")-date("3/1/2006","mdy"))/365.25
. scalar ar = log(12000000000/p0)/y
. display ar
.00646406
If the growth rate was to decline linearly from its present value of r, it would have to be
. display 2 * ar - r .00145687
by the time we reach 12 billion to meet the average. In other words, we would need to reach almost zero population growth by the end of the century.
I cut and pasted the data, put quotes around the age group labels, and then read it into Stata:
. clear
. input str5 ageg urbann urbanuse ruraln ruraluse
ageg urbann urbanuse ruraln ruraluse
1. "15-19" 298 37 1451 88
2. "20-24" 332 116 1020 281
3. "25-29" 267 132 926 370
4. "30-34" 185 103 714 299
5. "35-39" 162 73 782 343
6. "40-44" 69 37 556 213
7. "45-49" 68 17 452 119
8. end
The calculations are a lot easier if we stack rural below urban
(which you can do by hand or using a reshape command)
as you can then follow the handout using tabstat.
(a) I reshape, compute the prevalence rates, and average them using the actual n's to get crude rates.
. quietly reshape long @n @use, i(ageg) j(tpr) string
. gen prev = use/n
. tabstat prev [fw=n], by(tpr)
Summary for variables: prev
by categories of: tpr
tpr | mean
------+----------
rural | .2902898
urban | .3729182
------+----------
Total | .3059599
-----------------
The urban prevalence rate is much higher than the rural one, 37.3 versus 29.0%. To get standardized rates I get the urban and rural compositions, average them, and use that as weight
. egen comp = pc(n), by(tpr)
. egen avgcomp = mean(comp), by(ageg)
. tabstat prev [aw=avgcomp], by(tpr)
Summary for variables: prev
by categories of: tpr
tpr | mean
------+----------
rural | .2932567
urban | .3690886
------+----------
Total | .3311726
-----------------
The rural sample is younger than the urban, but that accounts for only a small part of the difference, as the standardized rates are 36.9 and 29.3%. We can also use the overall age ditribution
. egen totcomp = sum(comp), by (ageg)
. tabstat prev [w=totcomp], by(tpr)
(analytic weights assumed)
Summary for variables: prev
by categories of: tpr
tpr | mean
------+----------
rural | .2932567
urban | .3690886
------+----------
Total | .3311726
-----------------
(b) I now compute the average prevalence rates and average these using the observed numbers of women to see how much difference the rates make
. egen avgprev = mean(prev), by(ageg)
. tabstat avgprev [fw=n], by(tpr)
Summary for variables: avgprev
by categories of: tpr
tpr | mean
------+----------
rural | .3277744
urban | .3345709
------+----------
Total | .3290633
-----------------
If we made urban women younger the prevalence rate would go down only a little bit. Here's the final decomposition 'by hand'
. display _newline "Difference= " .3729182 - .2902898 /// > "; Same Composition = " .3690886 - .2932567 /// > "; Same Rates = " .3345709 - .3277744 Difference= .0826284; Same Composition = .0758319; Same Rates = .0067965
So 92% of the difference is due to differences in rates and only 8% to differences in age structure.
(c) The question here is whether to compare urban and rural we need all seven age-specific prevalence rates for each region, or can do with a summary. As it happens, contraceptive use is higher in urban than rural ares in all age groups, averaging about 7.5 percentage points higher, albeit with variations from age to age. This is very similar to the difference we get from the standardized prevalence rates, so reporting that number is not misleading; it provides a useful summar while controlling for age structure.
We read the data for Ghana 1979-80 from the course website
. clear . infile age n using /// > http://data.princeton.edu/eco572/datasets/ghhhpop.dat (97 observations read)
(a) We are told that age is top coded at 95 and that 99 means not stated, so we only have information in single years up to 94. That means we can only go up to 89 to have each digit appear the same number of times. In order to leave room for blending, we need to work with the range 0 to 79:
. myers age [fw=n], range(0 79)
Last digit | Freq. Percent Cum.
------------+-----------------------------------
0 | 36,759 15.13 15.13
1 | 19,208 7.91 23.04
2 | 25,324 10.42 33.46
3 | 20,600 8.48 41.94
4 | 21,470 8.84 50.77
5 | 29,564 12.17 62.94
6 | 23,852 9.82 72.76
7 | 19,308 7.95 80.71
8 | 24,837 10.22 90.93
9 | 22,040 9.07 100.00
------------+-----------------------------------
Total | 242,962 100.00
Myers' Blended Index = 7.9432998
We see substantial preference for ages ending in 0 and 5. The value of the index means that we would need to reshuffle almost 8% of the observations to obtain the expected 10% in each digit for the blended population.
(b) For smoothing we can go all the way to 94, but you should
still exclude 95 and 99. A good way to start is to plot the
actual counts versus age and then try lowess
with different badwidths. You should find that the default
over-smooths, but a bandwidth in the vicinity of 0.25 seems
to do a reasonable job:
. lowess n age if age < 95, bwidth(.25) gen(lowess)
An alternative is a cubic spline. I thought I would try only two internal knots, one at 20 and one at 50, just because there is more going on at young ages. Here's my fit and a plot that compares the two smoothers:
. bspline if age < 95, xvar(age) knots(0 20 50 100) p(3) gen(_bs)
(2 missing values generated)
. quietly regress n _bs*, noconstant
. predict spline
(option xb assumed; fitted values)
(2 missing values generated)
. twoway (scatter n age) (line lowess age, lp(dash)) ///
> (line spline age) , ///
> title("Ghana 1979-80") subtitle(WFS Household Survey) ///
> legend(order(2 "Lowess" 3 "Spline") ring(0) pos(2) cols(1))
. graph export ps1fig1.png, replace
(file ps1fig1.png written in PNG format)

There's very little to choose between these two. I think the spline is a bit better but then, isn't it always?
(c) We can now generate the last digit and tabulate with the smooth and observed frequencies, saving the results in a couple of matrices
. gen lastdigit = mod(age,10) if age < 95
(2 missing values generated)
. tab lastdigit [w=n], matcell(obs)
(frequency weights assumed)
lastdigit | Freq. Percent Cum.
------------+-----------------------------------
0 | 4,682 16.17 16.17
1 | 2,654 9.17 25.34
2 | 3,259 11.26 36.60
3 | 2,720 9.40 45.99
4 | 2,659 9.19 55.18
5 | 3,381 11.68 66.86
6 | 2,686 9.28 76.14
7 | 2,116 7.31 83.45
8 | 2,580 8.91 92.36
9 | 2,212 7.64 100.00
------------+-----------------------------------
Total | 28,949 100.00
. tab lastdigit [aw=spline], matcell(spline)
lastdigit | Freq. Percent Cum.
------------+-----------------------------------
0 | 11.0801683 11.66 11.66
1 | 10.7378799 11.30 22.97
2 | 10.3957421 10.94 33.91
3 | 10.0552701 10.58 44.49
4 | 9.7180278 10.23 54.72
5 | 9.2876905 9.78 64.50
6 | 8.93940788 9.41 73.91
7 | 8.59612969 9.05 82.96
8 | 8.25928994 8.69 91.65
9 | 7.93039371 8.35 100.00
------------+-----------------------------------
Total | 95 100.00
I had to use analytic weights the second time because the smooth frequencies are fractional. To compare the two vectors of percents I proceed just as in the handout, with a one-line call to Mata to compute half the sum of absolute differences:
. matrix diff = obs/28949 - spline/95
. mata: sum(abs(st_matrix("diff")))/2
.0694565698
So we would need to reshuffle almost 7% of the observations. As you can see the Myers index does a reasonable job.
Let us read the data from South Africa. (I just added another
column to the dataset, with AIDS deaths, hence my infile
may differ form yours.) I will also compute the age-specific rates.
. clear . infile ageg expo deaths aids using /// > http://data.princeton.edu/eco572/datasets/safed.dat (18 observations read) . set type double // all generates will be double precision . gen m = deaths/expo
(a) The big question here concerns the choice of "separation" factors. You might be tempted to borrow factors from Austria or the textbook, but it is not clear why those values should apply to South Africa.
Fortunately, it is only the very young ages that matter. I will use 2.5 everywhere, and will "make up" values for the first two age groups using the Coale-Demeny formula for West female model life tables, just as in the handout.
. gen a = 2.5
. replace a = cond(m[1] >= 0.107, .350, .053 + 2.800 * m[1]) in 1
(1 real change made)
. replace a = cond(m[1] >= 0.107, 1.361, 1.522 - 1.518 * m[1]) in 2
(1 real change made)
. list age a in 1/3
+------------------+
| ageg a |
|------------------|
1. | 0 .2003725 |
2. | 1 1.4421031 |
3. | 5 2.5 |
+------------------+
The blackboard site for last year's class has a calculated life table that used 1.5 for ages 1-4 and 0.159 for ages 0-1. I don't know the origin of the 0.159, but considering that South Africa has pretty high infant mortality it might be a better estimate than 0.200.
The next step is to set the width of the age groups, estimate nqx, and compute the remaining life table functions. The last (open-ended) age group requires special treatment, as usual.
. gen n = age[_n+1]-age // width of age intervals (1 missing value generated) . gen q = n * m/(1+(n-a)*m) (1 missing value generated) . replace q = 1 in -1 (1 real change made) . gen p = 1-q . gen lx = 100000 in 1 (17 missing values generated) . quietly replace lx = lx[_n-1] * p[_n-1] in 2/-1 . gen d = lx - lx[_n+1] (1 missing value generated) . replace d = lx in -1 (1 real change made) . gen L = n * lx[_n+1] + a*d (1 missing value generated) . replace L = lx/m in -1 (1 real change made) . quietly summarize L . gen T = r(sum) - sum(L) + L . gen e = T/lx
For printing I will use the same formats as in the handout
. format %6.3f a e
. format %8.6f m q
. format %9.0fc lx d L T
. list age m a q lx d L T e
+--------------------------------------------------------------------------------------+
| ageg m a q lx d L T e |
|--------------------------------------------------------------------------------------|
1. | 0 0.052633 0.200 0.050507 100,000 5,051 95,961 5,677,861 56.779 |
2. | 1 0.014319 1.442 0.055251 94,949 5,246 366,378 5,581,900 58.788 |
3. | 5 0.001508 2.500 0.007514 89,703 674 446,831 5,215,522 58.142 |
4. | 10 0.000735 2.500 0.003670 89,029 327 444,329 4,768,691 53.563 |
5. | 15 0.002350 2.500 0.011683 88,702 1,036 440,921 4,324,362 48.751 |
|--------------------------------------------------------------------------------------|
6. | 20 0.006380 2.500 0.031401 87,666 2,753 431,448 3,883,441 44.298 |
7. | 25 0.011084 2.500 0.053927 84,913 4,579 413,119 3,451,992 40.653 |
8. | 30 0.013967 2.500 0.067478 80,334 5,421 388,119 3,038,874 37.828 |
9. | 35 0.014992 2.500 0.072250 74,913 5,412 361,036 2,650,755 35.384 |
10. | 40 0.013555 2.500 0.065552 69,501 4,556 336,115 2,289,719 32.945 |
|--------------------------------------------------------------------------------------|
11. | 45 0.010716 2.500 0.052180 64,945 3,389 316,253 1,953,605 30.081 |
12. | 50 0.010066 2.500 0.049093 61,556 3,022 300,226 1,637,352 26.599 |
13. | 55 0.011780 2.500 0.057215 58,534 3,349 284,298 1,337,126 22.844 |
14. | 60 0.016379 2.500 0.078673 55,185 4,342 265,072 1,052,828 19.078 |
15. | 65 0.025031 2.500 0.117783 50,844 5,988 239,247 787,756 15.494 |
|--------------------------------------------------------------------------------------|
16. | 70 0.045207 2.500 0.203082 44,855 9,109 201,502 548,510 12.228 |
17. | 75 0.070327 2.500 0.299054 35,746 10,690 152,004 347,008 9.708 |
18. | 80 0.128490 2.500 1.000000 25,056 25,056 195,003 195,003 7.783 |
+--------------------------------------------------------------------------------------+
We get a life expectancy of only 56.8 years. We will look further into this result in the next problem set.
(b) If we assume that the hazard is constant in each age group we can estimate it with the observed rate and convert to a probability using nqx = 1-exp(-n nmx):
. gen q2 = 1 - exp(-n*m) (1 missing value generated)
To see how much difference this makes I compute absolute and relative differences, summarize them, and list noteworthy ages
. gen absd = abs(q-q2) if q < 1
(1 missing value generated)
. gen reld = absd/q
(1 missing value generated)
. sum absd reld
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
absd | 17 .0002864 .0006474 4.13e-09 .0025926
reld | 17 .0023034 .0042379 1.12e-06 .0151376
. list age q q2 absd reld if absd > 0.0005 | reld > .005
+-----------------------------------------------------+
| ageg q q2 absd reld |
|-----------------------------------------------------|
1. | 0 0.050507 .0512719 .00076456 .0151376 |
2. | 1 0.055251 .05566552 .00041429 .00749823 |
16. | 70 0.203082 .20230828 .00077322 .00380746 |
17. | 75 0.299054 .29646151 .00259258 .00866928 |
18. | 80 1.000000 . . . |
+-----------------------------------------------------+
The method only makes noticeable differences at the extremes, in particular age 0-1, where the relative difference is 1.6%, and to a lesser extent 75-79, where it is almost one percent.
(c) The infant mortality "rate" is a bit of a misnomer because demographers usually mean 1q0, which we estimated to be 50.5 per thousand, so roughly five percent of South African girls born around 1998 died before their first birthday. Note that there is some uncertainty on this estimate due to the fact that we don't know the average age at death for infant deaths.
(The other common definition of infant mortality rate is deaths under age one in a calendar year divided by births in that year. This is also trying to approximate the probability of surviving from birth to age one.)
(d) The probability of surviving from ages 20 to 40 is easily obtained from rows 6 and 10 in the survival function:
. di l[10]/l[6] .79279105
The probability is 79.3%, which is remarkably low. At current mortality levels, one in five 20-year olds will die before age 40.
(e) The expectation of life at age one can be higher than at birth when infant mortality is very high relative to mortality after age one. It is easy to see how this can happen in a heterogenous population, where the frail are more likely to die early, but it can also occur in a homogeneous population. The key is to remember that e1 is a mean conditional on having survived to age one.
Life expectancy at birth is the same as at age one (so children don't "age" in the first year) when the infant mortality rate equals the proportion of person-years spent at age zero. To prove this result recall that ex=Tx/lx. Setting these equal for ages 0 and 1 leads to l1/l0 = T1/T0, so the probability of surviving to age one must equal the proportion of time spent after age one to balance things out. Subtracting from one 1q0 = 1 - T1 / T0 .
(a) The latest U.S. life table is for 2002 and is available on the NCHS website. The table is constructed using data on births and deaths by calendar year, combined with separation factors that attemp to distinguish infant deaths by cohort. See the NCHS documentation for further information. Our main interest is to discover the implied nax factors, so we can compare them with the Coale-Demeny estimate. The key quantities needed are
. clear
. input age lx deaths L
age lx deaths L
1. 0 100000 697 99389
2. 1 99303 122 396922
3. 5 99180 . .
4. end
We can then see the implied time lived by deaths in 0-1 and 1-5:
. di (L[1] - lx[2])/d[1] .12338594 . di (L[2] - 4*lx[3])/d[2] 1.6557377
The Coale-Demeny factors for males can be estimated as
. scalar m = d[1]/L[1] . di cond(m >= 0.107, .330, .045 + 2.684 * m) .06382249 . di cond(m >= 0.107, 1.352, 1.651 - 2.816 * m) 1.6312518
As you can see, the values are quite different for age 0 but very similar for age 1-4.
(b) Social Security. The U.S. life tables actually used in this debate are the 2002 life tables from part (a), which are available by ethnicity. The relevant numbers are show below. (I added lx to verify the life expectancies but is otherwise not needed.)
. clear
. set type double
. input age lw Tw lb Tb
age lw Tw lb Tb
1. 0 100000 7510719 100000 6876522
2. 20 98605 5528046 97368 4914598
3. 65 79874 1327634 65695 960629
4. end
. gen ew = Tw/lw
. gen eb = Tb/lb
. list age ew eb
+-----------------------------+
| age ew eb |
|-----------------------------|
1. | 0 75.10719 68.76522 |
2. | 20 56.062532 50.474468 |
3. | 65 16.621604 14.622559 |
+-----------------------------+
So all the statistics quoted in the debate are in fact correct. In particular, whites live only 13.7% longer than blacks conditional on reaching age 65, although unconditionally they spend on average 2.7 times as many years after age 65.
Let us calculate the person-years spent in retirement (collecting) over those spent working (paying)
. display Tw[3]/(Tw[2]-Tw[3]) .31607233 . display Tb[3]/(Tb[2]-Tb[3]) .24295309
So whites, as a cohort, spend 0.316 years (or almost four months) in retirement for every year they spend working. The comparable ratio for blacks is 0.243 years (or less than three months). So, relative to the time spent working, whites spend 30% longer in retirement.
(c) To see how much of the difference in life expectancy is due to infant and child mortality we follow the procedure outlined in Preston et al., page 65. We need the little l, big L, and T columns for ages 0, 1, and 5:
. clear
. set type double
. input age lw Lw Tw lb Lb Tb
age lw Lw Tw lb Lb Tb
1. 0 100000 99439 7510719 100000 98650 6876522
2. 1 99358 397142 7411080 98461 393347 6777872
3. 5 99234 495971 7013938 98249 490958 6384525
4. end
. gen delta = (Lw/lw - Lb/lb) * lb/100000 + ///
> (lb/lw - lb[_n+1]/lw[_n+1]) * Tw[_n+1]/100000
(1 missing value generated)
. scalar diff = (Tw[1]-Tb[1])/100000
. gen pc = delta/diff
(1 missing value generated)
. list age delta pc in 1/2
+-----------------------------+
| age delta pc |
|-----------------------------|
1. | 0 .6769593 .10674275 |
2. | 1 .06508653 .01026283 |
+-----------------------------+
So 10.7% of the difference in life expectancy between blacks and whites can be attributed to infant mortality, and another one percent is due to mortality between ages 1 and 5, for a total 11.7% difference due to child mortality under age 5.

It's interesting to do this for all ages, either using single years of abridged life tables. The contributions from adult ages are larger than I would have guessed. (This graph, of course, was not required.)
Copyright © 2006, Germán Rodríguez, Office of Population Research, Princeton University