Germán Rodríguez
Demographic Methods Princeton University

## Rates and Standardization (Revised)

We will work through the example in Preston et al, sections 2.2 and 2.3. (This revised version of the handout de-emphasizes programming Stata and focuses on canned procedures.)

#### Sample Data

I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file which is available in the course website. The file is in "long" format and can be read into Stata using

```. infile str10 country str5 ageg pop deaths ///
>         using http://data.princeton.edu/eco572/datasets/preston21long.dat
```

The first thing we do is calculate the age-specific rates, dividing deaths by population and multiplying by 1000:

```. gen rates = 1000 * deaths / pop
```

Crude death rates are just a weighted average of age-specific rates using the population in each age group as the weight. We can easily compute them in Stata using the `tabstat` command:

```. tabstat rates [fw=pop], by(country)

Summary for variables: rates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |  7.423042
Sweden |  10.54756
-----------+----------
Total |  8.470285
----------------------
```

The interesting result here is that mortality appears to be lower in Kazakhstan than in Sweden.

#### Standardized Rates

Following Preston et al., we will standardize the rates using the unweighted average of the two population compositions as the standard. To do this we first compute the percent distribution for each country using `egen`, and then compute the average percent in each age:

```. egen pcpop = pc(pop), by(country)

. egen avgcomp = mean(pcpop), by(ageg)
```

You may want to list the data to verify that `avg` has the same values for the two countries. Now we can compute the standardized rate in one line:

```. tabstat rates [aw=avgcomp], by(country)

Summary for variables: rates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |    11.882
Sweden |  7.374094
-----------+----------
Total |  9.628045
----------------------
```

The only difference is that I specified `aw`, an "analytic" weight, instead of `fw` a "frequency" weight, so Stata wouldn't complaint about non-integer weights. Both compute means the same way: multiply each observation by the weight, sum, and divide by the sum of the weights.

#### Indirect Standardization

Frequently we don't have age-specific rates but can easily obtain the age distribution. We can still do a form of standardization applying the rates of one country (or any other standard) to the two age distributions.

Let us first create a variable that has the Swedish rates for both countries. We do this sorting by age and then country, and for each age we pick the rate of the second country:

```. bysort ageg (country): gen swrates = rates[2]
```

The `by` command is a very powerful feature of Stata that can repeat a command for subgroups. The data must be sorted, but if you specify `bysort` Stata will take care of that. In this case the computation is done by `age`, but putting `country` in parentheses tells Stata to also sort by country (so Sweden is always the second observation). We can now average the Swedish rates using each country's population

```. tabstat swrates [fw=pop], by(country)

Summary for variables: swrates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |  4.200627
Sweden |  10.54756
-----------+----------
Total |  6.327926
----------------------
```

So if Kazakhstan had Swedish rates the CDR would be 4.2. The ratio of this number to the actual CDR is known as the standardized mortality ratio (SMR), and turns out to be

```. di 7.42/4.20

1.7666667
```

Once again we see that mortality is in fact higher in Kazakhstan than in Sweden.

#### Decomposition of Differences Between Rates

Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. We are half-way there because we already computed (standardized) rates using the average composition for both countries. To complete the calculations we now compute the average rates

```. egen avgrates = mean(rates), by(ageg)
```

and average those using the actual population counts:

```. tabstat avgrates [fw=pop], by(country)

Summary for variables: avgrates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |  5.811834
Sweden |  13.44426
-----------+----------
Total |   8.36999
----------------------
```

We see that with the same rates, the older composition of Sweden would result in much higher mortality as measured by a crude rate. You can now verify the decomposition:

```. display "Difference = " 10.55-7.42 ///
>          "; Composition = " 13.44-5.81 ///
>          "; Rates = " 7.37-11.88
Difference = 3.13; Composition = 7.63; Rates = -4.51
```

I wrote a little program that can do all these calculations but it requires data in "wide" format. See `ddrate` in the course Stata site. (In net-aware Stata type `net from http://data.princeton.edu/eco572/stata`.)

#### Age-Patterns of Mortality

So far we have focused on attempts to summarize mortality in a single number, but why not look at the complete set of rates? I would like to plot the rates using the mid-points of the ages groups (with 87.5 for 85+). To compute these I first convert `age` from a string to numbers (1 to 19) and then figure out the midpoint

```. encode ageg, gen(agei)  // "05-9" becomes 3

. gen agem = 5*agei - 7.5  // 3 becomes 7.5

. replace agem = 0.5 if ageg == "0"

. replace agem = 3   if ageg == "01-4"
```

We now plot the rates using a log-scale (otherwise we don't see much)

```. twoway (line rate agem if country=="Sweden")      ///
>            (line rate agem if country=="Kazakhstan")  ///
>          , yscale(log) ylabel(50 100)               ///
>          title("Age-Specific Mortality Rates")      ///
>          xtitle("Age") ytitle("log(ASMR)")          ///
>          legend(order(1 "Sweden" 2 "Kazakhstan") ring(0) pos(5))

. graph export swkzasmr.png, replace
(file swkzasmr.png written in PNG format)
```

We see clearly that mortality is higher in Kazakhstan than in Sweden at every age. The lines are not exactly parallel, as the difference seems to become smaller at older ages. (We will discuss later how heterogeneity can produce this type of convergence.)

#### The Stata Standardization Commands

As noted earlier Stata has commands for direct and indirect standardization. By default, ` dstdize` uses a weighted average of the two populations as the standard. Unfortunately it is hard to trick it into using an unweighted average as we did here, because it insists on having integer population counts. So we will use a weighted average.

```. dstdize deaths pop ageg, by(country)

----------------------------------------------------------
-> country= Kazakhstan
Pop.  Stratum  Pop.
Stratum       Pop.     Cases  Dist. Rate[s] Dst[P]  s*P
----------------------------------------------------------
0     174078      3720  0.020 0.0214  0.018 0.0004
01-4     754758      1220  0.087 0.0016  0.075 0.0001
05-9     879129       396  0.101 0.0005  0.086 0.0000
10-14     808510       298  0.093 0.0004  0.080 0.0000
15-19     720161       561  0.083 0.0008  0.075 0.0001
20-24     622988       673  0.072 0.0011  0.070 0.0001
25-29     733057       752  0.084 0.0010  0.080 0.0001
30-34     732312       965  0.084 0.0013  0.077 0.0001
35-39     612825      1113  0.070 0.0018  0.069 0.0001
40-44     487996      1405  0.056 0.0029  0.061 0.0002
45-49     284799      1226  0.033 0.0043  0.046 0.0002
50-54     503608      2878  0.058 0.0057  0.057 0.0003
55-59     301879      3266  0.035 0.0108  0.039 0.0004
60-64     374317      5212  0.043 0.0139  0.045 0.0006
65-69     256247      6866  0.029 0.0268  0.037 0.0010
70-74     154623      6182  0.018 0.0400  0.029 0.0012
75-79     149917      8199  0.017 0.0547  0.026 0.0014
80-84      88716      9013  0.010 0.1016  0.018 0.0018
85+      58940     10627  0.007 0.1803  0.013 0.0023
----------------------------------------------------------
Totals:      8698860     64572    Adjusted Cases:  90573.0
Crude Rate:   0.0074
95% Conf. Interval: [0.0103, 0.0105]

----------------------------------------------------------
-> country= Sweden
Pop.  Stratum  Pop.
Stratum       Pop.     Cases  Dist. Rate[s] Dst[P]  s*P
----------------------------------------------------------
0      59727       279  0.014 0.0047  0.018 0.0001
01-4     229775        42  0.052 0.0002  0.075 0.0000
05-9     245172        31  0.056 0.0001  0.086 0.0000
10-14     240110        33  0.055 0.0001  0.080 0.0000
15-19     264957        61  0.060 0.0002  0.075 0.0000
20-24     287176        87  0.065 0.0003  0.070 0.0000
25-29     311111        98  0.071 0.0003  0.080 0.0000
30-34     280991       140  0.064 0.0005  0.077 0.0000
35-39     286899       197  0.065 0.0007  0.069 0.0000
40-44     308238       362  0.070 0.0012  0.061 0.0001
45-49     320172       643  0.073 0.0020  0.046 0.0001
50-54     242230       738  0.055 0.0030  0.057 0.0002
55-59     210785       972  0.048 0.0046  0.039 0.0002
60-64     216058      1640  0.049 0.0076  0.045 0.0003
65-69     224479      2752  0.051 0.0123  0.037 0.0005
70-74     222578      4509  0.051 0.0203  0.029 0.0006
75-79     184102      6745  0.042 0.0366  0.026 0.0009
80-84     140667      9587  0.032 0.0682  0.018 0.0012
85+     110242     17340  0.025 0.1573  0.013 0.0020
----------------------------------------------------------
Totals:      4385469     46256    Adjusted Cases:  27750.9
Crude Rate:   0.0105
95% Conf. Interval: [0.0063, 0.0064]

Summary of Study Populations:
country             N      Crude     Adj_Rate       Confidence Interval
--------------------------------------------------------------------------
Kazakhsta       8698860   0.007423     0.010412    [  0.010333,    0.010491]
Sweden       4385469   0.010548     0.006328    [  0.006270,    0.006385]
```

The output is quite detailed, but you should pay attention to the bottom line: it confirms the crude rates of 7.42 and 10.5, and gets new standardized rates of 10.4 and 6.3. (You should make sure you can reproduce these using `tabstat`.)

#### The Modern Regression Approach

The aim of standardization is to control for a compositional variable. This, of course, is also one of the main aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.

This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modelled using a linear predictor just as in linear regression) times the log of exposure (which is defined as an offset, or fixed part of the linear predictor). That's why the first thing we do below is compute the log of our mid-year population counts.

We also want the rates to depend on age and country, so we define appropriate dummy variables and run the model. (You could use `tab ageg, gen(age)` to generate dummy variables for age but they are called `age1, age2,...`. You could also use the `xi:` prefix and specify `i.ageg` to have Stata generate dummies "on the fly", but the generated names are even worse. With a bit of work we can get more meaningful names.)

```. gen logexpo = log(pop)

. gen kz = country == "Kazakhstan"

. gen age1to4 = ageg == "01-4"

. gen age5to9 = ageg == "05-9"

. forvalues bot=10(5)80 {
2.   local top = `bot' + 4
3.   gen age`bot'to`top' = ageg == "`bot'-`top'"
4. }

. gen age85p = ageg == "85+"

. poisson deaths age1to4-age85p kz, offset(logexpo) nolog

Poisson regression                                Number of obs   =         38
LR chi2(19)     =  315966.62
Prob > chi2     =     0.0000
Log likelihood =  -1274.853                       Pseudo R2       =     0.9920

------------------------------------------------------------------------------
deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
age1to4 |  -2.600295   .0322872   -80.54   0.000    -2.663577   -2.537013
age5to9 |  -3.823097   .0509118   -75.09   0.000    -3.922882   -3.723311
age10to14 |  -4.003534   .0571946   -70.00   0.000    -4.115633   -3.891434
age15to19 |   -3.29342    .043102   -76.41   0.000    -3.377899   -3.208942
age20to24 |   -2.99395   .0395719   -75.66   0.000    -3.071509    -2.91639
age25to29 |   -3.02696     .03777   -80.14   0.000    -3.100988   -2.952932
age30to34 |  -2.743422    .033986   -80.72   0.000    -2.810034   -2.676811
age35to39 |  -2.436497   .0318358   -76.53   0.000    -2.498894     -2.3741
age40to44 |  -1.985014   .0285733   -69.47   0.000    -2.041017   -1.929012
age45to49 |  -1.588591   .0280586   -56.62   0.000    -1.643584   -1.533597
age50to54 |  -1.231015   .0229505   -53.64   0.000    -1.275997   -1.186032
age55to59 |  -.6591381   .0220602   -29.88   0.000    -.7023753   -.6159008
age60to64 |  -.3400267   .0199076   -17.08   0.000    -.3790448   -.3010086
age65to69 |   .2502514   .0188483    13.28   0.000     .2133094    .2871935
age70to74 |   .6572624   .0186292    35.28   0.000     .6207498     .693775
age75to79 |   1.094818   .0178766    61.24   0.000      1.05978    1.129855
age80to84 |   1.719851    .017546    98.02   0.000     1.685462    1.754241
age85p |   2.451387   .0170575   143.71   0.000     2.417955    2.484819
kz |   .4794218   .0062738    76.42   0.000     .4671254    .4917181
_cons |  -4.445508   .0166385  -267.18   0.000    -4.478119   -4.412897
logexpo |   (offset)
------------------------------------------------------------------------------

. di exp( _b[kz] )
1.609837
```

The coefficient of `kz` tells us that the log of the rates is 0.479 points higher in Kazahstan than in Sweden. Exponentiating we find that rates in Kazahstan are 61.5 percent higher than in Sweden. Direct standardization found them 61.2 percent higher (11.88/7.37 = 1.612), so the two approaches give very similar results for this dataset. The age coefficients show an age pattern of mortality which is not unlike the average rates we computed earlier.

An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case adding the interaction would make the model saturated for this data, so we just test goodness of fit:

```. estat gof
Goodness-of-fit chi2  =  2218.456
Prob > chi2(18)       =    0.0000
```

The chi-squared of more than two thousand on 18 d.f. is highly significant, so the deaths rates are not proportional. With data based on national counts, however, even small differences can be significant. This interaction, while significant, represents less than five percent of the combined effect of age and country:

```. di r(chi2)/(r(chi2) + 315966.6)

.04453829
```

I could have typed in 2218.5, but the test stores the chi2 value in `r(chi2)`, and using that is more precise and less error prone.

If the discussion in this last section doesn't make much sense to you don't worry. Just make a note to come back and re-read this section after you take wws509.