Rates and Standardization (Revised)
We will work through the example in Preston et al, sections 2.2 and 2.3. (This revised version of the handout de-emphasizes programming Stata and focuses on canned procedures.)
Sample Data
I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file which is available in the course website. The file is in "long" format and can be read into Stata using
. infile str10 country str5 ageg pop deaths /// > using http://data.princeton.edu/eco572/datasets/preston21long.dat (38 observations read)
The first thing we do is calculate the age-specific rates, diviing deaths by population and multiplying by 1000:
. gen rates = 1000 * deaths / pop
Crude death rates are just a weighted average of age-specific rates
using the population in each age group as the weight. We can easily
compute them in Stata using the tabstat command:
. tabstat rates [fw=pop], by(country)
Summary for variables: rates
by categories of: country
country | mean
-----------+----------
Kazakhstan | 7.423042
Sweden | 10.54756
-----------+----------
Total | 8.470285
----------------------
The interesting result here is that mortality appears to be lower in Kazakhstan than in Sweden.
Standardized Rates
Following Preston et al., we will standardize the rates using the
unweighted average of the two population compositions as the standard.
To do this we first compute the percent distribution for each country
using egen, and then compute the average percent in each
age:
. egen pcpop = pc(pop), by(country) . egen avgcomp = mean(pcpop), by(ageg)
You may want to list the data to verify that avg has the
same values for the two countries.
Now we can compute the standardized rate in one line:
. tabstat rates [aw=avgcomp], by(country)
Summary for variables: rates
by categories of: country
country | mean
-----------+----------
Kazakhstan | 11.882
Sweden | 7.374094
-----------+----------
Total | 9.628045
----------------------
The only difference is that I specified aw, an "analytic"
weight, instead of fw a "frequency" weight, so Stata wouldn't
complaint about non-integer weights. Both compute means the same way:
multiply each observation by the weight, sum, and divide by the sum of
the weights.
Indirect Standardization
Frequently we don't have age-specific rates but can easily obtain the age distribution. We can still do a form of standardization applying the rates of one country (or any other standard) to the two age distributions.
Let us first create a variable that has the Swedish rates for both countries. We do this sorting by age and then country, and for each age we pick the rate of the second country:
. bysort ageg (country): gen swrates = rates[2]
The by command is a very powerful feature of Stata that
can repeat a command for subgroups. The data must be sorted, but if
you specify bysort Stata will take care of that. In this
case the computation is done by age, but putting
country in parentheses tells Stata to also sort by country
(so Sweden is always the second observation). We can now average the
Swedish rates using each country's population
. tabstat swrates [fw=pop], by(country)
Summary for variables: swrates
by categories of: country
country | mean
-----------+----------
Kazakhstan | 4.200627
Sweden | 10.54756
-----------+----------
Total | 6.327926
----------------------
So if Kazakhstan had Swedish rates the CDR would be 4.2. The ratio of this number to the actual CDR is known as the standardized mortality ratio (SMR), and turns out to be
. di 7.42/4.20 1.7666667
Once again we see that mortality is in fact higher in Kazakhstan than in Sweden.
Decomposition of Differences Between Rates
Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. We are half-way there because we already computed (standardized) rates using the average composition for both countries. To complete the calculations we now compute the average rates
. egen avgrates = mean(rates), by(ageg)
and average those using the actual population counts:
. tabstat avgrates [fw=pop], by(country)
Summary for variables: avgrates
by categories of: country
country | mean
-----------+----------
Kazakhstan | 5.811834
Sweden | 13.44426
-----------+----------
Total | 8.36999
----------------------
We see that with the same rates, the older composition of Sweden would result in much higher mortality as measured by a crude rate. You can now verify the decomposition:
. display "Difference = " 10.55-7.42 /// > "; Composition = " 13.44-5.81 /// > "; Rates = " 7.37-11.88 Difference = 3.13; Composition = 7.63; Rates = -4.51
I wrote a little program that can do all these calculations
but it requires data in "wide" format. See ddrate
in the course Stata site. (In net-aware Stata type
net from http://data.princeton.edu/eco572/stata.)
Age-Patterns of Mortality
So far we have focused on attempts to summarize mortality
in a single number, but why not look at the complete set
of rates? I would like to plot the rates using the mid-points
of the ages groups (with 87.5 for 85+). To compute these I
first convert age from a string to numbers
(1 to 19) and then figure out the midpoint
. encode ageg, gen(agei) // "05-9" becomes 3 . gen agem = 5*agei - 7.5 // 3 becomes 7.5 . replace agem = 0.5 if ageg == "0" (2 real changes made) . replace agem = 3 if ageg == "01-4" (2 real changes made)
We now plot the rates using a log-scale (otherwise we don't see much)
. twoway (line rate agem if country=="Sweden") ///
> (line rate agem if country=="Kazakhstan") ///
> , yscale(log) ylabel(50 100) ///
> title("Age-Specific Mortality Rates") ///
> xtitle("Age") ytitle("log(ASMR)") ///
> legend(order(1 "Sweden" 2 "Kazakhstan") ring(0) pos(5))
. graph export swkzasmr.png, replace
(note: file swkzasmr.png not found)
(file swkzasmr.png written in PNG format)

The Stata Standardization Commands
As noted earlier Stata has commands for direct and
indirect standardization. By default, dstdize
uses a weighted average of the two populations as
the standard. Unfortunately it is hard to trick it into
using an unweighted average as we did here, because it
insists on having integer population counts. So we will
use a weighted average.
. dstdize deaths pop ageg, by(country)
----------------------------------------------------------
-> country= Kazakhstan
-----Unadjusted----- Std.
Pop. Stratum Pop.
Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P
----------------------------------------------------------
0 174078 3720 0.020 0.0214 0.018 0.0004
01-4 754758 1220 0.087 0.0016 0.075 0.0001
05-9 879129 396 0.101 0.0005 0.086 0.0000
10-14 808510 298 0.093 0.0004 0.080 0.0000
15-19 720161 561 0.083 0.0008 0.075 0.0001
20-24 622988 673 0.072 0.0011 0.070 0.0001
25-29 733057 752 0.084 0.0010 0.080 0.0001
30-34 732312 965 0.084 0.0013 0.077 0.0001
35-39 612825 1113 0.070 0.0018 0.069 0.0001
40-44 487996 1405 0.056 0.0029 0.061 0.0002
45-49 284799 1226 0.033 0.0043 0.046 0.0002
50-54 503608 2878 0.058 0.0057 0.057 0.0003
55-59 301879 3266 0.035 0.0108 0.039 0.0004
60-64 374317 5212 0.043 0.0139 0.045 0.0006
65-69 256247 6866 0.029 0.0268 0.037 0.0010
70-74 154623 6182 0.018 0.0400 0.029 0.0012
75-79 149917 8199 0.017 0.0547 0.026 0.0014
80-84 88716 9013 0.010 0.1016 0.018 0.0018
85+ 58940 10627 0.007 0.1803 0.013 0.0023
----------------------------------------------------------
Totals: 8698860 64572 Adjusted Cases: 90573.0
Crude Rate: 0.0074
Adjusted Rate: 0.0104
95% Conf. Interval: [0.0103, 0.0105]
----------------------------------------------------------
-> country= Sweden
-----Unadjusted----- Std.
Pop. Stratum Pop.
Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P
----------------------------------------------------------
0 59727 279 0.014 0.0047 0.018 0.0001
01-4 229775 42 0.052 0.0002 0.075 0.0000
05-9 245172 31 0.056 0.0001 0.086 0.0000
10-14 240110 33 0.055 0.0001 0.080 0.0000
15-19 264957 61 0.060 0.0002 0.075 0.0000
20-24 287176 87 0.065 0.0003 0.070 0.0000
25-29 311111 98 0.071 0.0003 0.080 0.0000
30-34 280991 140 0.064 0.0005 0.077 0.0000
35-39 286899 197 0.065 0.0007 0.069 0.0000
40-44 308238 362 0.070 0.0012 0.061 0.0001
45-49 320172 643 0.073 0.0020 0.046 0.0001
50-54 242230 738 0.055 0.0030 0.057 0.0002
55-59 210785 972 0.048 0.0046 0.039 0.0002
60-64 216058 1640 0.049 0.0076 0.045 0.0003
65-69 224479 2752 0.051 0.0123 0.037 0.0005
70-74 222578 4509 0.051 0.0203 0.029 0.0006
75-79 184102 6745 0.042 0.0366 0.026 0.0009
80-84 140667 9587 0.032 0.0682 0.018 0.0012
85+ 110242 17340 0.025 0.1573 0.013 0.0020
----------------------------------------------------------
Totals: 4385469 46256 Adjusted Cases: 27750.9
Crude Rate: 0.0105
Adjusted Rate: 0.0063
95% Conf. Interval: [0.0063, 0.0064]
Summary of Study Populations:
country N Crude Adj_Rate Confidence Interval
--------------------------------------------------------------------------
Kazakhsta 8698860 0.007423 0.010412 [ 0.010333, 0.010491]
Sweden 4385469 0.010548 0.006328 [ 0.006270, 0.006385]
The output is quite detailed, but you should pay attention
to the bottom line:
it confirms the crude rates of 7.42 and 10.5,
and gets new standardized rates of 10.4 and 6.3.
(You should make sure you can reproduce these using
tabstat.)
The Modern Regression Approach
The aim of standardization is to control for a compositional variable. This, of course, is also one of the main aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.
This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modelled using a linear predictor just as in linear regression) times the log of exposure (which is defined as an offset, or fixed part of the linear predictor). That's why the first thing we do below is compute the log of our mid-year population counts.
We also want the rates to depend on age and country,
so we define appropriate dummy variables and run the model.
(You could use tab ageg, gen(age) to
generate dummy variables for age but they are called
age1, age2,....
You could also use the xi: prefix and
specify i.ageg to have Stata generate dummies
"on the fly", but the generated names are even worse.
With a bit of work we can get more meaningful names.)
. gen logexpo = log(pop)
. gen kz = country == "Kazakhstan"
. gen age1to4 = ageg == "01-4"
. gen age5to9 = ageg == "05-9"
. forvalues bot=10(5)80 {
2. local top = `bot' + 4
3. gen age`bot'to`top' = ageg == "`bot'-`top'"
4. }
. gen age85p = ageg == "85+"
. poisson deaths age1to4-age85p kz, offset(logexpo) nolog
Poisson regression Number of obs = 38
LR chi2(19) = 315966.62
Prob > chi2 = 0.0000
Log likelihood = -1274.853 Pseudo R2 = 0.9920
------------------------------------------------------------------------------
deaths | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age1to4 | -2.600295 .0322872 -80.54 0.000 -2.663577 -2.537013
age5to9 | -3.823097 .0509118 -75.09 0.000 -3.922882 -3.723311
age10to14 | -4.003534 .0571946 -70.00 0.000 -4.115633 -3.891434
age15to19 | -3.29342 .043102 -76.41 0.000 -3.377899 -3.208942
age20to24 | -2.99395 .0395719 -75.66 0.000 -3.071509 -2.91639
age25to29 | -3.02696 .03777 -80.14 0.000 -3.100988 -2.952932
age30to34 | -2.743422 .033986 -80.72 0.000 -2.810034 -2.676811
age35to39 | -2.436497 .0318358 -76.53 0.000 -2.498894 -2.3741
age40to44 | -1.985014 .0285733 -69.47 0.000 -2.041017 -1.929012
age45to49 | -1.588591 .0280586 -56.62 0.000 -1.643584 -1.533597
age50to54 | -1.231015 .0229505 -53.64 0.000 -1.275997 -1.186032
age55to59 | -.6591381 .0220602 -29.88 0.000 -.7023753 -.6159008
age60to64 | -.3400267 .0199076 -17.08 0.000 -.3790448 -.3010086
age65to69 | .2502514 .0188483 13.28 0.000 .2133094 .2871935
age70to74 | .6572624 .0186292 35.28 0.000 .6207498 .693775
age75to79 | 1.094818 .0178766 61.24 0.000 1.05978 1.129855
age80to84 | 1.719851 .017546 98.02 0.000 1.685462 1.754241
age85p | 2.451387 .0170575 143.71 0.000 2.417955 2.484819
kz | .4794218 .0062738 76.42 0.000 .4671254 .4917181
_cons | -4.445508 .0166385 -267.18 0.000 -4.478119 -4.412897
logexpo | (offset)
------------------------------------------------------------------------------
. di exp( _b[kz] )
1.609837
The coefficient of kz tells us that the log
of the rates is 0.479 points higher in Kazahstan than in
Sweden. Exponentiating we find that rates in Kazahstan
are 61.5 percent higher than in Sweden.
Direct standardization found them 61.2 percent higher
(11.88/7.37 = 1.612), so the two approaches give very
similar results for this dataset.
The age coefficients show an age pattern of mortality
which is not unlike the average rates we computed earlier.
An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case adding the interaction would make the model saturated for this data, so we just test goodness of fit:
. estat gof
Goodness-of-fit chi2 = 2218.456
Prob > chi2(18) = 0.0000
The chi-squared of more than two thousand on 18 d.f. is highly significant, so the deaths rates are not proportional. With data based on national counts, however, even small differences can be significant. This interaction, while significant, represents less than five percent of the combined effect of age and country:
. di r(chi2)/(r(chi2) + 315966.6) .04453829
I could have typed in 2218.5, but the test stores the
chi2 value in r(chi2), and using that is
more precise and less error prone.
If the discussion in this last section doesn't make much sense to you don't worry. Just make a note to come back and re-read this section after you take wws509.
