Home | GLMs | Multilevel | Survival | Demography | Stata | R

Rates and Standardization (Revised)

We will work through the example in Preston et al, sections 2.2 and 2.3. (This revised version of the handout de-emphasizes programming Stata and focuses on canned procedures.)

Sample Data

I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file which is available in the course website. The file is in "long" format and can be read into Stata using

. infile str10 country str5 ageg pop deaths ///
>         using http://data.princeton.edu/eco572/datasets/preston21long.dat
(38 observations read)

The first thing we do is calculate the age-specific rates, dividing deaths by population and multiplying by 1000:

. gen rates = 1000 * deaths / pop

Crude death rates are just a weighted average of age-specific rates using the population in each age group as the weight. We can easily compute them in Stata using the tabstat command:

. tabstat rates [fw=pop], by(country)

Summary for variables: rates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |  7.423042
    Sweden |  10.54756
-----------+----------
     Total |  8.470285
----------------------

The interesting result here is that mortality appears to be lower in Kazakhstan than in Sweden.

Standardized Rates

Following Preston et al., we will standardize the rates using the unweighted average of the two population compositions as the standard. To do this we first compute the percent distribution for each country using egen, and then compute the average percent in each age:

. egen pcpop = pc(pop), by(country)

. egen avgcomp = mean(pcpop), by(ageg)

You may want to list the data to verify that avg has the same values for the two countries. Now we can compute the standardized rate in one line:

. tabstat rates [aw=avgcomp], by(country)

Summary for variables: rates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |    11.882
    Sweden |  7.374094
-----------+----------
     Total |  9.628045
----------------------

The only difference is that I specified aw, an "analytic" weight, instead of fw a "frequency" weight, so Stata wouldn't complaint about non-integer weights. Both compute means the same way: multiply each observation by the weight, sum, and divide by the sum of the weights.

Indirect Standardization

Frequently we don't have age-specific rates but can easily obtain the age distribution. We can still do a form of standardization applying the rates of one country (or any other standard) to the two age distributions.

Let us first create a variable that has the Swedish rates for both countries. We do this sorting by age and then country, and for each age we pick the rate of the second country:

. bysort ageg (country): gen swrates = rates[2]

The by command is a very powerful feature of Stata that can repeat a command for subgroups. The data must be sorted, but if you specify bysort Stata will take care of that. In this case the computation is done by age, but putting country in parentheses tells Stata to also sort by country (so Sweden is always the second observation). We can now average the Swedish rates using each country's population

. tabstat swrates [fw=pop], by(country)

Summary for variables: swrates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |  4.200627
    Sweden |  10.54756
-----------+----------
     Total |  6.327926
----------------------

So if Kazakhstan had Swedish rates the CDR would be 4.2. The ratio of this number to the actual CDR is known as the standardized mortality ratio (SMR), and turns out to be

. di 7.42/4.20

1.7666667

Once again we see that mortality is in fact higher in Kazakhstan than in Sweden.

Decomposition of Differences Between Rates

Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. We are half-way there because we already computed (standardized) rates using the average composition for both countries. To complete the calculations we now compute the average rates

. egen avgrates = mean(rates), by(ageg)

and average those using the actual population counts:

. tabstat avgrates [fw=pop], by(country)

Summary for variables: avgrates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |  5.811834
    Sweden |  13.44426
-----------+----------
     Total |   8.36999
----------------------

We see that with the same rates, the older composition of Sweden would result in much higher mortality as measured by a crude rate. You can now verify the decomposition:

. display "Difference = " 10.55-7.42 ///
>          "; Composition = " 13.44-5.81 ///
>          "; Rates = " 7.37-11.88
Difference = 3.13; Composition = 7.63; Rates = -4.51

I wrote a little program that can do all these calculations but it requires data in "wide" format. See ddrate in the course Stata site. (In net-aware Stata type net from http://data.princeton.edu/eco572/stata.)

Age-Patterns of Mortality

So far we have focused on attempts to summarize mortality in a single number, but why not look at the complete set of rates? I would like to plot the rates using the mid-points of the ages groups (with 87.5 for 85+). To compute these I first convert age from a string to numbers (1 to 19) and then figure out the midpoint

. encode ageg, gen(agei)  // "05-9" becomes 3

. gen agem = 5*agei - 7.5  // 3 becomes 7.5

. replace agem = 0.5 if ageg == "0"
(2 real changes made)

. replace agem = 3   if ageg == "01-4"
(2 real changes made)

We now plot the rates using a log-scale (otherwise we don't see much)

. twoway (line rate agem if country=="Sweden")      ///
>            (line rate agem if country=="Kazakhstan")  ///
>          , yscale(log) ylabel(50 100)               ///
>          title("Age-Specific Mortality Rates")      ///
>          xtitle("Age") ytitle("log(ASMR)")          ///
>          legend(order(1 "Sweden" 2 "Kazakhstan") ring(0) pos(5))

. graph export swkzasmr.png, replace
(note: file swkzasmr.png not found)
(file swkzasmr.png written in PNG format)

We see clearly that mortality is higher in Kazakhstan than in Sweden at every age. The lines are not exactly parallel, as the difference seems to become smaller at older ages. (We will discuss later how heterogeneity can produce this type of convergence.)

The Stata Standardization Commands

As noted earlier Stata has commands for direct and indirect standardization. By default, dstdize uses a weighted average of the two populations as the standard. Unfortunately it is hard to trick it into using an unweighted average as we did here, because it insists on having integer population counts. So we will use a weighted average.

. dstdize deaths pop ageg, by(country)

----------------------------------------------------------
-> country= Kazakhstan 
                         -----Unadjusted-----  Std.
                                Pop.  Stratum  Pop.  
  Stratum       Pop.     Cases  Dist. Rate[s] Dst[P]  s*P
----------------------------------------------------------
        0     174078      3720  0.020 0.0214  0.018 0.0004
     01-4     754758      1220  0.087 0.0016  0.075 0.0001
     05-9     879129       396  0.101 0.0005  0.086 0.0000
    10-14     808510       298  0.093 0.0004  0.080 0.0000
    15-19     720161       561  0.083 0.0008  0.075 0.0001
    20-24     622988       673  0.072 0.0011  0.070 0.0001
    25-29     733057       752  0.084 0.0010  0.080 0.0001
    30-34     732312       965  0.084 0.0013  0.077 0.0001
    35-39     612825      1113  0.070 0.0018  0.069 0.0001
    40-44     487996      1405  0.056 0.0029  0.061 0.0002
    45-49     284799      1226  0.033 0.0043  0.046 0.0002
    50-54     503608      2878  0.058 0.0057  0.057 0.0003
    55-59     301879      3266  0.035 0.0108  0.039 0.0004
    60-64     374317      5212  0.043 0.0139  0.045 0.0006
    65-69     256247      6866  0.029 0.0268  0.037 0.0010
    70-74     154623      6182  0.018 0.0400  0.029 0.0012
    75-79     149917      8199  0.017 0.0547  0.026 0.0014
    80-84      88716      9013  0.010 0.1016  0.018 0.0018
      85+      58940     10627  0.007 0.1803  0.013 0.0023
----------------------------------------------------------
Totals:      8698860     64572    Adjusted Cases:  90573.0
                                      Crude Rate:   0.0074
                                   Adjusted Rate:   0.0104
                       95% Conf. Interval: [0.0103, 0.0105]

----------------------------------------------------------
-> country= Sweden 
                         -----Unadjusted-----  Std.
                                Pop.  Stratum  Pop.  
  Stratum       Pop.     Cases  Dist. Rate[s] Dst[P]  s*P
----------------------------------------------------------
        0      59727       279  0.014 0.0047  0.018 0.0001
     01-4     229775        42  0.052 0.0002  0.075 0.0000
     05-9     245172        31  0.056 0.0001  0.086 0.0000
    10-14     240110        33  0.055 0.0001  0.080 0.0000
    15-19     264957        61  0.060 0.0002  0.075 0.0000
    20-24     287176        87  0.065 0.0003  0.070 0.0000
    25-29     311111        98  0.071 0.0003  0.080 0.0000
    30-34     280991       140  0.064 0.0005  0.077 0.0000
    35-39     286899       197  0.065 0.0007  0.069 0.0000
    40-44     308238       362  0.070 0.0012  0.061 0.0001
    45-49     320172       643  0.073 0.0020  0.046 0.0001
    50-54     242230       738  0.055 0.0030  0.057 0.0002
    55-59     210785       972  0.048 0.0046  0.039 0.0002
    60-64     216058      1640  0.049 0.0076  0.045 0.0003
    65-69     224479      2752  0.051 0.0123  0.037 0.0005
    70-74     222578      4509  0.051 0.0203  0.029 0.0006
    75-79     184102      6745  0.042 0.0366  0.026 0.0009
    80-84     140667      9587  0.032 0.0682  0.018 0.0012
      85+     110242     17340  0.025 0.1573  0.013 0.0020
----------------------------------------------------------
Totals:      4385469     46256    Adjusted Cases:  27750.9
                                      Crude Rate:   0.0105
                                   Adjusted Rate:   0.0063
                       95% Conf. Interval: [0.0063, 0.0064]

Summary of Study Populations:
  country             N      Crude     Adj_Rate       Confidence Interval
 --------------------------------------------------------------------------
Kazakhsta       8698860   0.007423     0.010412    [  0.010333,    0.010491]
   Sweden       4385469   0.010548     0.006328    [  0.006270,    0.006385]

The output is quite detailed, but you should pay attention to the bottom line: it confirms the crude rates of 7.42 and 10.5, and gets new standardized rates of 10.4 and 6.3. (You should make sure you can reproduce these using tabstat.)

The Modern Regression Approach

The aim of standardization is to control for a compositional variable. This, of course, is also one of the main aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.

This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modelled using a linear predictor just as in linear regression) times the log of exposure (which is defined as an offset, or fixed part of the linear predictor). That's why the first thing we do below is compute the log of our mid-year population counts.

We also want the rates to depend on age and country, so we define appropriate dummy variables and run the model. (You could use tab ageg, gen(age) to generate dummy variables for age but they are called age1, age2,.... You could also use the xi: prefix and specify i.ageg to have Stata generate dummies "on the fly", but the generated names are even worse. With a bit of work we can get more meaningful names.)

. gen logexpo = log(pop)

. gen kz = country == "Kazakhstan"

. gen age1to4 = ageg == "01-4"

. gen age5to9 = ageg == "05-9"

. forvalues bot=10(5)80 {
  2.   local top = `bot' + 4
  3.   gen age`bot'to`top' = ageg == "`bot'-`top'" 
  4. }

. gen age85p = ageg == "85+"

. poisson deaths age1to4-age85p kz, offset(logexpo) nolog

Poisson regression                                Number of obs   =         38
                                                  LR chi2(19)     =  315966.62
                                                  Prob > chi2     =     0.0000
Log likelihood =  -1274.853                       Pseudo R2       =     0.9920

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     age1to4 |  -2.600295   .0322872   -80.54   0.000    -2.663577   -2.537013
     age5to9 |  -3.823097   .0509118   -75.09   0.000    -3.922882   -3.723311
   age10to14 |  -4.003534   .0571946   -70.00   0.000    -4.115633   -3.891434
   age15to19 |   -3.29342    .043102   -76.41   0.000    -3.377899   -3.208942
   age20to24 |   -2.99395   .0395719   -75.66   0.000    -3.071509    -2.91639
   age25to29 |   -3.02696     .03777   -80.14   0.000    -3.100988   -2.952932
   age30to34 |  -2.743422    .033986   -80.72   0.000    -2.810034   -2.676811
   age35to39 |  -2.436497   .0318358   -76.53   0.000    -2.498894     -2.3741
   age40to44 |  -1.985014   .0285733   -69.47   0.000    -2.041017   -1.929012
   age45to49 |  -1.588591   .0280586   -56.62   0.000    -1.643584   -1.533597
   age50to54 |  -1.231015   .0229505   -53.64   0.000    -1.275997   -1.186032
   age55to59 |  -.6591381   .0220602   -29.88   0.000    -.7023753   -.6159008
   age60to64 |  -.3400267   .0199076   -17.08   0.000    -.3790448   -.3010086
   age65to69 |   .2502514   .0188483    13.28   0.000     .2133094    .2871935
   age70to74 |   .6572624   .0186292    35.28   0.000     .6207498     .693775
   age75to79 |   1.094818   .0178766    61.24   0.000      1.05978    1.129855
   age80to84 |   1.719851    .017546    98.02   0.000     1.685462    1.754241
      age85p |   2.451387   .0170575   143.71   0.000     2.417955    2.484819
          kz |   .4794218   .0062738    76.42   0.000     .4671254    .4917181
       _cons |  -4.445508   .0166385  -267.18   0.000    -4.478119   -4.412897
     logexpo |   (offset)
------------------------------------------------------------------------------

. di exp( _b[kz] )
1.609837

The coefficient of kz tells us that the log of the rates is 0.479 points higher in Kazahstan than in Sweden. Exponentiating we find that rates in Kazahstan are 61.5 percent higher than in Sweden. Direct standardization found them 61.2 percent higher (11.88/7.37 = 1.612), so the two approaches give very similar results for this dataset. The age coefficients show an age pattern of mortality which is not unlike the average rates we computed earlier.

An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case adding the interaction would make the model saturated for this data, so we just test goodness of fit:

. estat gof
         Goodness-of-fit chi2  =  2218.456
         Prob > chi2(18)       =    0.0000

The chi-squared of more than two thousand on 18 d.f. is highly significant, so the deaths rates are not proportional. With data based on national counts, however, even small differences can be significant. This interaction, while significant, represents less than five percent of the combined effect of age and country:

. di r(chi2)/(r(chi2) + 315966.6)

.04453829

I could have typed in 2218.5, but the test stores the chi2 value in r(chi2), and using that is more precise and less error prone.

If the discussion in this last section doesn't make much sense to you don't worry. Just make a note to come back and re-read this section after you take wws509.