## Rates and Standardization (Revised)

We will work through the example in Preston et al, sections 2.2 and 2.3. (This revised version of the handout de-emphasizes programming Stata and focuses on canned procedures.)

#### Sample Data

I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file which is available in the course website. The file is in "long" format and can be read into Stata using

. infile str10 country str5 ageg pop deaths /// > using http://data.princeton.edu/eco572/datasets/preston21long.dat (38 observations read)

The first thing we do is calculate the age-specific rates, dividing deaths by population and multiplying by 1000:

. gen rates = 1000 * deaths / pop

Crude death rates are just a weighted average of age-specific rates
using the population in each age group as the weight. We can easily
compute them in Stata using the `tabstat`

command:

. tabstat rates [fw=pop], by(country) Summary for variables: rates by categories of: country country | mean -----------+---------- Kazakhstan | 7.423042 Sweden | 10.54756 -----------+---------- Total | 8.470285 ----------------------

The interesting result here is that mortality appears to be
*lower* in Kazakhstan than in Sweden.

#### Standardized Rates

Following Preston et al., we will standardize the rates using the
unweighted average of the two population compositions as the standard.
To do this we first compute the percent distribution for each country
using `egen`

, and then compute the average percent in each
age:

. egen pcpop = pc(pop), by(country) . egen avgcomp = mean(pcpop), by(ageg)

You may want to list the data to verify that `avg`

has the
same values for the two countries.
Now we can compute the standardized rate in one line:

. tabstat rates [aw=avgcomp], by(country) Summary for variables: rates by categories of: country country | mean -----------+---------- Kazakhstan | 11.882 Sweden | 7.374094 -----------+---------- Total | 9.628045 ----------------------

The only difference is that I specified `aw`

, an "analytic"
weight, instead of `fw`

a "frequency" weight, so Stata wouldn't
complaint about non-integer weights. Both compute means the same way:
multiply each observation by the weight, sum, and divide by the sum of
the weights.

#### Indirect Standardization

Frequently we don't have age-specific rates but can easily obtain the age distribution. We can still do a form of standardization applying the rates of one country (or any other standard) to the two age distributions.

Let us first create a variable that has the Swedish rates for both countries. We do this sorting by age and then country, and for each age we pick the rate of the second country:

. bysort ageg (country): gen swrates = rates[2]

The `by`

command is a very powerful feature of Stata that
can repeat a command for subgroups. The data must be sorted, but if
you specify `bysort`

Stata will take care of that. In this
case the computation is done by `age`

, but putting
`country`

in parentheses tells Stata to also sort by country
(so Sweden is always the second observation). We can now average the
Swedish rates using each country's population

. tabstat swrates [fw=pop], by(country) Summary for variables: swrates by categories of: country country | mean -----------+---------- Kazakhstan | 4.200627 Sweden | 10.54756 -----------+---------- Total | 6.327926 ----------------------

So if Kazakhstan had Swedish rates the CDR would be 4.2. The
ratio of this number to the actual CDR is known as the
*standardized mortality ratio (SMR)*, and turns out to be

. di 7.42/4.20 1.7666667

Once again we see that mortality is in fact higher in Kazakhstan than in Sweden.

#### Decomposition of Differences Between Rates

Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. We are half-way there because we already computed (standardized) rates using the average composition for both countries. To complete the calculations we now compute the average rates

. egen avgrates = mean(rates), by(ageg)

and average those using the actual population counts:

. tabstat avgrates [fw=pop], by(country) Summary for variables: avgrates by categories of: country country | mean -----------+---------- Kazakhstan | 5.811834 Sweden | 13.44426 -----------+---------- Total | 8.36999 ----------------------

We see that with the same rates, the older composition of Sweden would result in much higher mortality as measured by a crude rate. You can now verify the decomposition:

. display "Difference = " 10.55-7.42 /// > "; Composition = " 13.44-5.81 /// > "; Rates = " 7.37-11.88 Difference = 3.13; Composition = 7.63; Rates = -4.51

I wrote a little program that can do all these calculations
but it requires data in "wide" format. See `ddrate`

in the course Stata site. (In net-aware Stata type
`net from http://data.princeton.edu/eco572/stata`

.)

#### Age-Patterns of Mortality

So far we have focused on attempts to summarize mortality
in a single number, but why not look at the complete set
of rates? I would like to plot the rates using the mid-points
of the ages groups (with 87.5 for 85+). To compute these I
first convert `age`

from a string to numbers
(1 to 19) and then figure out the midpoint

. encode ageg, gen(agei) // "05-9" becomes 3 . gen agem = 5*agei - 7.5 // 3 becomes 7.5 . replace agem = 0.5 if ageg == "0" (2 real changes made) . replace agem = 3 if ageg == "01-4" (2 real changes made)

We now plot the rates using a log-scale (otherwise we don't see much)

. twoway (line rate agem if country=="Sweden") /// > (line rate agem if country=="Kazakhstan") /// > , yscale(log) ylabel(50 100) /// > title("Age-Specific Mortality Rates") /// > xtitle("Age") ytitle("log(ASMR)") /// > legend(order(1 "Sweden" 2 "Kazakhstan") ring(0) pos(5)) . graph export swkzasmr.png, replace (note: file swkzasmr.png not found) (file swkzasmr.png written in PNG format)

*every*age. The lines are not exactly parallel, as the difference seems to become smaller at older ages. (We will discuss later how heterogeneity can produce this type of convergence.)

#### The Stata Standardization Commands

As noted earlier Stata has commands for direct and
indirect standardization. By default, ` dstdize`

uses a *weighted* average of the two populations as
the standard. Unfortunately it is hard to trick it into
using an unweighted average as we did here, because it
insists on having integer population counts. So we will
use a weighted average.

. dstdize deaths pop ageg, by(country) ---------------------------------------------------------- -> country= Kazakhstan -----Unadjusted----- Std. Pop. Stratum Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P ---------------------------------------------------------- 0 174078 3720 0.020 0.0214 0.018 0.0004 01-4 754758 1220 0.087 0.0016 0.075 0.0001 05-9 879129 396 0.101 0.0005 0.086 0.0000 10-14 808510 298 0.093 0.0004 0.080 0.0000 15-19 720161 561 0.083 0.0008 0.075 0.0001 20-24 622988 673 0.072 0.0011 0.070 0.0001 25-29 733057 752 0.084 0.0010 0.080 0.0001 30-34 732312 965 0.084 0.0013 0.077 0.0001 35-39 612825 1113 0.070 0.0018 0.069 0.0001 40-44 487996 1405 0.056 0.0029 0.061 0.0002 45-49 284799 1226 0.033 0.0043 0.046 0.0002 50-54 503608 2878 0.058 0.0057 0.057 0.0003 55-59 301879 3266 0.035 0.0108 0.039 0.0004 60-64 374317 5212 0.043 0.0139 0.045 0.0006 65-69 256247 6866 0.029 0.0268 0.037 0.0010 70-74 154623 6182 0.018 0.0400 0.029 0.0012 75-79 149917 8199 0.017 0.0547 0.026 0.0014 80-84 88716 9013 0.010 0.1016 0.018 0.0018 85+ 58940 10627 0.007 0.1803 0.013 0.0023 ---------------------------------------------------------- Totals: 8698860 64572 Adjusted Cases: 90573.0 Crude Rate: 0.0074 Adjusted Rate: 0.0104 95% Conf. Interval: [0.0103, 0.0105] ---------------------------------------------------------- -> country= Sweden -----Unadjusted----- Std. Pop. Stratum Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P ---------------------------------------------------------- 0 59727 279 0.014 0.0047 0.018 0.0001 01-4 229775 42 0.052 0.0002 0.075 0.0000 05-9 245172 31 0.056 0.0001 0.086 0.0000 10-14 240110 33 0.055 0.0001 0.080 0.0000 15-19 264957 61 0.060 0.0002 0.075 0.0000 20-24 287176 87 0.065 0.0003 0.070 0.0000 25-29 311111 98 0.071 0.0003 0.080 0.0000 30-34 280991 140 0.064 0.0005 0.077 0.0000 35-39 286899 197 0.065 0.0007 0.069 0.0000 40-44 308238 362 0.070 0.0012 0.061 0.0001 45-49 320172 643 0.073 0.0020 0.046 0.0001 50-54 242230 738 0.055 0.0030 0.057 0.0002 55-59 210785 972 0.048 0.0046 0.039 0.0002 60-64 216058 1640 0.049 0.0076 0.045 0.0003 65-69 224479 2752 0.051 0.0123 0.037 0.0005 70-74 222578 4509 0.051 0.0203 0.029 0.0006 75-79 184102 6745 0.042 0.0366 0.026 0.0009 80-84 140667 9587 0.032 0.0682 0.018 0.0012 85+ 110242 17340 0.025 0.1573 0.013 0.0020 ---------------------------------------------------------- Totals: 4385469 46256 Adjusted Cases: 27750.9 Crude Rate: 0.0105 Adjusted Rate: 0.0063 95% Conf. Interval: [0.0063, 0.0064] Summary of Study Populations: country N Crude Adj_Rate Confidence Interval -------------------------------------------------------------------------- Kazakhsta 8698860 0.007423 0.010412 [ 0.010333, 0.010491] Sweden 4385469 0.010548 0.006328 [ 0.006270, 0.006385]

The output is quite detailed, but you should pay attention
to the bottom line:
it confirms the crude rates of 7.42 and 10.5,
and gets new standardized rates of 10.4 and 6.3.
(You should make sure you can reproduce these using
`tabstat`

.)

#### The Modern Regression Approach

The aim of standardization is to control for a compositional variable. This, of course, is also one of the main aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.

This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modelled using a linear predictor just as in linear regression) times the log of exposure (which is defined as an offset, or fixed part of the linear predictor). That's why the first thing we do below is compute the log of our mid-year population counts.

We also want the rates to depend on age and country,
so we define appropriate dummy variables and run the model.
(You could use `tab ageg, gen(age)`

to
generate dummy variables for age but they are called
`age1, age2,...`

.
You could also use the `xi:`

prefix and
specify `i.ageg`

to have Stata generate dummies
"on the fly", but the generated names are even worse.
With a bit of work we can get more meaningful names.)

. gen logexpo = log(pop) . gen kz = country == "Kazakhstan" . gen age1to4 = ageg == "01-4" . gen age5to9 = ageg == "05-9" . forvalues bot=10(5)80 { 2. local top = `bot' + 4 3. gen age`bot'to`top' = ageg == "`bot'-`top'" 4. } . gen age85p = ageg == "85+" . poisson deaths age1to4-age85p kz, offset(logexpo) nolog Poisson regression Number of obs = 38 LR chi2(19) = 315966.62 Prob > chi2 = 0.0000 Log likelihood = -1274.853 Pseudo R2 = 0.9920 ------------------------------------------------------------------------------ deaths | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age1to4 | -2.600295 .0322872 -80.54 0.000 -2.663577 -2.537013 age5to9 | -3.823097 .0509118 -75.09 0.000 -3.922882 -3.723311 age10to14 | -4.003534 .0571946 -70.00 0.000 -4.115633 -3.891434 age15to19 | -3.29342 .043102 -76.41 0.000 -3.377899 -3.208942 age20to24 | -2.99395 .0395719 -75.66 0.000 -3.071509 -2.91639 age25to29 | -3.02696 .03777 -80.14 0.000 -3.100988 -2.952932 age30to34 | -2.743422 .033986 -80.72 0.000 -2.810034 -2.676811 age35to39 | -2.436497 .0318358 -76.53 0.000 -2.498894 -2.3741 age40to44 | -1.985014 .0285733 -69.47 0.000 -2.041017 -1.929012 age45to49 | -1.588591 .0280586 -56.62 0.000 -1.643584 -1.533597 age50to54 | -1.231015 .0229505 -53.64 0.000 -1.275997 -1.186032 age55to59 | -.6591381 .0220602 -29.88 0.000 -.7023753 -.6159008 age60to64 | -.3400267 .0199076 -17.08 0.000 -.3790448 -.3010086 age65to69 | .2502514 .0188483 13.28 0.000 .2133094 .2871935 age70to74 | .6572624 .0186292 35.28 0.000 .6207498 .693775 age75to79 | 1.094818 .0178766 61.24 0.000 1.05978 1.129855 age80to84 | 1.719851 .017546 98.02 0.000 1.685462 1.754241 age85p | 2.451387 .0170575 143.71 0.000 2.417955 2.484819 kz | .4794218 .0062738 76.42 0.000 .4671254 .4917181 _cons | -4.445508 .0166385 -267.18 0.000 -4.478119 -4.412897 logexpo | (offset) ------------------------------------------------------------------------------ . di exp( _b[kz] ) 1.609837

The coefficient of `kz`

tells us that the log
of the rates is 0.479 points higher in Kazahstan than in
Sweden. Exponentiating we find that rates in Kazahstan
are 61.5 percent higher than in Sweden.
Direct standardization found them 61.2 percent higher
(11.88/7.37 = 1.612), so the two approaches give very
similar results for this dataset.
The age coefficients show an age pattern of mortality
which is not unlike the average rates we computed earlier.

An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case adding the interaction would make the model saturated for this data, so we just test goodness of fit:

. estat gof Goodness-of-fit chi2 = 2218.456 Prob > chi2(18) = 0.0000

The chi-squared of more than two thousand on 18 d.f. is highly significant, so the deaths rates are not proportional. With data based on national counts, however, even small differences can be significant. This interaction, while significant, represents less than five percent of the combined effect of age and country:

. di r(chi2)/(r(chi2) + 315966.6) .04453829

I could have typed in 2218.5, but the test stores the
chi2 value in `r(chi2)`

, and using that is
more precise and less error prone.

If the discussion in this last section doesn't make much sense to you don't worry. Just make a note to come back and re-read this section after you take wws509.