Germán Rodríguez
Demographic Methods Princeton University

Rates and Standardization

We will work through the example in Preston et al., sections 2.2 and 2.3.

Sample Data

I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file which is available in the course website. The file is in "long" format. As usual I show the analysis in R and Stata.

. infile str10 country str5 ageg pop deaths ///
>   using http://data.princeton.edu/eco572/datasets/preston21long.dat
(38 observations read)

. list in 1/3

     +-------------------------------------+
     |    country   ageg      pop   deaths |
     |-------------------------------------|
  1. | Kazakhstan      0   174078     3720 |
  2. | Kazakhstan   01-4   754758     1220 |
  3. | Kazakhstan   05-9   879129      396 |
     +-------------------------------------+
> p21 <- read.table("http://data.princeton.edu/eco572/datasets/preston21long.dat",
+   header=FALSE, col.names = c("country", "ageg", "pop", "deaths"))

> head(p21, 3)
     country ageg    pop deaths
1 Kazakhstan    0 174078   3720
2 Kazakhstan 01-4 754758   1220
3 Kazakhstan 05-9 879129    396

The first thing to do is to calculate the age-specific rates, dividing deaths by population and multiplying by 1000:

. gen rates = 1000 * deaths / pop
> p21 <- mutate(p21, rates = 1000 * deaths / pop)

Crude death rates are just a weighted average of age-specific rates using the population in each age group as the weight. We can easily compute them like this:

. tabstat rates [fw=pop], by(country)

Summary for variables: rates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |  7.423042
    Sweden |  10.54756
-----------+----------
     Total |  8.470285
----------------------
> group_by(p21, country) %>%
+   summarize(crude = weighted.mean(rates, pop))
Source: local data frame [2 x 2]

     country     crude
      (fctr)     (dbl)
1 Kazakhstan  7.423042
2     Sweden 10.547561

The interesting result here is that mortality appears to be lower in Kazakhstan than in Sweden.

Standardized Rates

Following Preston et al., we will standardize the rates using the weighted average of the two population compositions as the standard. To do this we calculate first the percent distribution in each country and then average. While we are at it we will also calculate the average rates.

> p21 <- group_by(p21, country) %>%
+          mutate(pct = pop/sum(pop))

> avg <- group_by(p21, ageg) %>%
+          summarize(pop = mean(pct), rates = mean(rates))
. egen pcpop = pc(pop), by(country)

. egen avgcomp = mean(pcpop), by(ageg)

Now we can easily compute the standardized rates using the average composition as the standard. Because we assigned the grouped data to p21 we can skip the grouping step.

> summarize(p21, std = weighted.mean(rates, avg$pop))
Source: local data frame [2 x 2]

     country       std
      (fctr)     (dbl)
1 Kazakhstan 11.881997
2     Sweden  7.374094
. tabstat rates [aw=avgcomp], by(country)

Summary for variables: rates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |    11.882
    Sweden |  7.374094
-----------+----------
     Total |  9.628045
----------------------

We see that once we adjust for age composition the mortality rate is in fact higher in Kzakhstan than in Sweden, in fact 61.1% higher.

Indirect Standarization

Frequently we don't have age-specific rates, but can easily obtain the age distribution. We can still do a form of standardization applying the rates of one country (or any other standard) to the two age distributions. Let us use the rates for Sweden and apply them to both countries. To do this we sort by age and country, and for each age pick the second value

. bysort age (country): gen swrates = rates[2]

. tabstat swrates [fw=pop], by(country)

Summary for variables: swrates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |  4.200627
    Sweden |  10.54756
-----------+----------
     Total |  6.327926
----------------------
> sweden <- filter(p21, country == "Sweden")

> summarize(p21, istd = weighted.mean(sweden$rates, pop))
Source: local data frame [2 x 2]

     country      istd
      (fctr)     (dbl)
1 Kazakhstan  4.200627
2     Sweden 10.547561

So if Kazakhstan has Swedish rates the CDR would be 4.2. The ratio of this number to the actual CDR is known as the standardized mortality ratio and turns out to be

. quietly sum rates [fw=pop] if country == "Kazakhstan"

. scalar kzcmr = r(mean)

. quietly sum swrates [fw=pop] if country == "Kazakhstan"

. di "SMR = ", kzcmr/r(mean)
SMR =  1.7671271
> filter(p21, country == "Kazakhstan") %>%
+   summarize(istd = weighted.mean(sweden$rates,pop), 
+   crude = weighted.mean(rates,pop), smr = crude/istd)
Source: local data frame [1 x 4]

     country     istd    crude      smr
      (fctr)    (dbl)    (dbl)    (dbl)
1 Kazakhstan 4.200627 7.423042 1.767127

Mortality is in fact 77% higher than it would be at Swedish rates with the observed composition.

Decomposition of Differences Between Rates

Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. The first part is obtained applying the average composition to the observed rates. To assess the second component we apply the observed composition to the average rates.

. egen avgrates = mean(rates), by(ageg)

. tabstat avgrates [fw=pop], by(country)

Summary for variables: avgrates
     by categories of: country 

   country |      mean
-----------+----------
Kazakhstan |  5.811834
    Sweden |  13.44426
-----------+----------
     Total |   8.36999
----------------------
> dbr <- summarize(p21,
+    crude = weighted.mean(rates,pop),
+    comp = weighted.mean(rates, avg$pop), 
+    rates = weighted.mean(avg$rates, pop)); dbr
Source: local data frame [2 x 4]

     country     crude      comp     rates
      (fctr)     (dbl)     (dbl)     (dbl)
1 Kazakhstan  7.423042 11.881997  5.811834
2     Sweden 10.547561  7.374094 13.444256

We see that with the same rates the older composition of Sweden would result in much higher mortality than observed. We can now verify the decomposition:

. display "Difference = " 10.55-7.42 ///
>   "; Composition = " 13.44-5.81 ///
>      "; Rates = " 7.37-11.88    
Difference = 3.13; Composition = 7.63; Rates = -4.51
> c(-1, 1) %*% as.matrix(data.frame(dbr[,-1]))
        crude      comp    rates
[1,] 3.124519 -4.507903 7.632422

So a difference in 3.13 points in the CBR between Sweden and Kazakhstan results from a compositional effect of +7.63 and a difference in rates of -4.51. I wrote a Stata program that can do these calculations for data in "wide" format. In net-aware Stata type net from http://data.princeton.edu/eco572/stata and check out ddrate.

Age-Patterns of Mortality

So far we have focused on attempts to summarize mortality in a single number, but why not look at the complete set of rates? I would like to plot the rates using the mid-points of the ages groups (with 87.5 for 85+). To compute these I first convert age from a string to numbers (1 to 19) and then figure out the midpoint of each group.

. encode ageg, gen(agei)  // "05-9" becomes 3

. gen agem = 5*agei - 7.5  // 3 becomes 7.5

. replace agem = 0.5 if ageg == "0"
(2 real changes made)

. replace agem = 3   if ageg == "01-4"
(2 real changes made)
> p21 <- mutate(p21, 
+   agei = as.numeric(ageg), 
+   age = ifelse(agei > 2, 5*agei - 7.5, ifelse(agei == 1, 0, 3)))

We then plot the rates using a log-scale (otherwise we wouldn't see much)

. twoway (line rate agem if country=="Sweden")      ///
>         (line rate agem if country=="Kazakhstan")  ///
>       , yscale(log) ylabel(50 100)               ///
>       title("Age-Specific Mortality Rates")      ///
>       xtitle("Age") ytitle("log(ASMR)")          ///
>       legend(order(2 "Kazakhstan" 1 "Sweden") ring(0) pos(5) cols(1))

. graph export swkzasmr.png, replace width(500) height(400)
(file swkzasmr.png written in PNG format)

ASMR

> ggplot(p21, aes(x=age, y=log(rates), color=country)) + geom_line()

> ggsave("swkzasmrr.png", width=500/72, height=400/72, dpi=72) 

ASMR We see clearly that mortality is higher in Kazakhstan than in Sweden at every age. The lines are not exactly parallel, in fact the difference seems to become smaller at older ages. (We will discuss later how heterogeneity can produce this type of convergence.)

The Modern Regression Approach

The aim of standardization is to control for a compositional variable. This, of course, is also one of the aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.

This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modelled using a linear predictor, just as in linear regression) times the log of exposure (which is defined as an offset, or a fixed part of the linear predictor). That's why the code below computes the log of the mid-year population counts. We also represent country using a dummy variable for Kazakhstan.

. gen logexpo = log(pop)

. gen kz  = country == "Kazakhstan"

. poisson deaths i.agei kz, offset(logexpo)

Iteration 0:   log likelihood = -49623.369  
Iteration 1:   log likelihood = -10315.221  
Iteration 2:   log likelihood = -1482.9267  
Iteration 3:   log likelihood = -1275.2697  
Iteration 4:   log likelihood =  -1274.853  
Iteration 5:   log likelihood =  -1274.853  

Poisson regression                              Number of obs     =         38
                                                LR chi2(19)       =  315966.62
                                                Prob > chi2       =     0.0000
Log likelihood =  -1274.853                     Pseudo R2         =     0.9920

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        agei |
       01-4  |  -2.600295   .0322872   -80.54   0.000    -2.663577   -2.537013
       05-9  |  -3.823097   .0509118   -75.09   0.000    -3.922882   -3.723311
      10-14  |  -4.003534   .0571946   -70.00   0.000    -4.115633   -3.891434
      15-19  |   -3.29342    .043102   -76.41   0.000    -3.377899   -3.208942
      20-24  |   -2.99395   .0395719   -75.66   0.000    -3.071509    -2.91639
      25-29  |   -3.02696     .03777   -80.14   0.000    -3.100988   -2.952932
      30-34  |  -2.743422    .033986   -80.72   0.000    -2.810034   -2.676811
      35-39  |  -2.436497   .0318358   -76.53   0.000    -2.498894     -2.3741
      40-44  |  -1.985014   .0285733   -69.47   0.000    -2.041017   -1.929012
      45-49  |  -1.588591   .0280586   -56.62   0.000    -1.643584   -1.533597
      50-54  |  -1.231015   .0229505   -53.64   0.000    -1.275997   -1.186032
      55-59  |  -.6591381   .0220602   -29.88   0.000    -.7023753   -.6159008
      60-64  |  -.3400267   .0199076   -17.08   0.000    -.3790448   -.3010086
      65-69  |   .2502514   .0188483    13.28   0.000     .2133094    .2871935
      70-74  |   .6572624   .0186292    35.28   0.000     .6207498     .693775
      75-79  |   1.094818   .0178766    61.24   0.000      1.05978    1.129855
      80-84  |   1.719851    .017546    98.02   0.000     1.685462    1.754241
        85+  |   2.451387   .0170575   143.71   0.000     2.417955    2.484819
             |
          kz |   .4794218   .0062738    76.42   0.000     .4671254    .4917181
       _cons |  -4.445508   .0166385  -267.18   0.000    -4.478119   -4.412897
     logexpo |          1  (offset)
------------------------------------------------------------------------------

. di exp(_b[kz])
1.6151402
> p21 <- mutate(p21, logexpo = log(pop), kz = as.numeric(country=="Kazakhstan"))

> fit <- glm(deaths ~ ageg + kz + offset(log(pop)), data = p21, family = poisson)

> coef(summary(fit))
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) -4.4455076 0.01663845 -267.18275  0.000000e+00
ageg01-4    -2.6002957 0.03228717  -80.53649  0.000000e+00
ageg05-9    -3.8230970 0.05091182  -75.09253  0.000000e+00
ageg10-14   -4.0035340 0.05719463  -69.99842  0.000000e+00
ageg15-19   -3.2934201 0.04310201  -76.40989  0.000000e+00
ageg20-24   -2.9939502 0.03957193  -75.65843  0.000000e+00
ageg25-29   -3.0269600 0.03777003  -80.14185  0.000000e+00
ageg30-34   -2.7434226 0.03398604  -80.72205  0.000000e+00
ageg35-39   -2.4364968 0.03183579  -76.53326  0.000000e+00
ageg40-44   -1.9850146 0.02857332  -69.47090  0.000000e+00
ageg45-49   -1.5885907 0.02805860  -56.61690  0.000000e+00
ageg50-54   -1.2310144 0.02295053  -53.63773  0.000000e+00
ageg55-59   -0.6591382 0.02206023  -29.87903 3.685603e-196
ageg60-64   -0.3400269 0.01990755  -17.08030  2.080363e-65
ageg65-69    0.2502514 0.01884833   13.27711  3.142725e-40
ageg70-74    0.6572621 0.01862924   35.28122 1.139967e-272
ageg75-79    1.0948175 0.01787660   61.24306  0.000000e+00
ageg80-84    1.7198508 0.01754596   98.01978  0.000000e+00
ageg85+      2.4513869 0.01705754  143.71277  0.000000e+00
kz           0.4794215 0.00627376   76.41694  0.000000e+00

> exp(coef(fit)["kz"])
     kz 
1.61514 

Exponentiating the coefficient of Kazakhstan we see that on average age-specific mortality is 61.5% higher than in Sweden, a result very similar to the 61.2% that we obtained using direct standardization.

An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case, adding the interaction would make the model saturated for this data, so we just test goodness of fit.

. estat gof 

         Deviance goodness-of-fit =  2218.456
         Prob > chi2(18)          =    0.0000

         Pearson goodness-of-fit  =  2026.899
         Prob > chi2(18)          =    0.0000
> data.frame(deviance(fit), df=fit$df.residual)
  deviance.fit. df
1      2218.549 18

The deviance of 2218 on 18 d.f. is highly significant, so we conclude that the rates are not proportional. With data based on national counts, however, even small differences can be significant. This interaction, while significant, represents less than one percent of the combined effect of age and country

. di r(chi2_d)/e(chi2)
.00702117
> deviance(fit)/(fit$null.deviance - deviance(fit))
[1] 0.007021467

If the discussion in the last section doesn't make much sense to you don't worry. Just make a note to come back and re-read this part after you take WWS509/POP507, the generalized linear models course.

Updated 22-Jan-2016