Germán Rodríguez
Demographic Methods Princeton University

## Rates and Standardization

We will work through the example in Preston et al., sections 2.2 and 2.3.

### Sample Data

I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file which is available in the course website. The file is in "long" format. As usual I show the analysis in R and Stata.

```. infile str10 country str5 ageg pop deaths ///
>   using http://data.princeton.edu/eco572/datasets/preston21long.dat

. list in 1/3

+-------------------------------------+
|    country   ageg      pop   deaths |
|-------------------------------------|
1. | Kazakhstan      0   174078     3720 |
2. | Kazakhstan   01-4   754758     1220 |
3. | Kazakhstan   05-9   879129      396 |
+-------------------------------------+
```
```> p21 <- read.table("http://data.princeton.edu/eco572/datasets/preston21long.dat",
+   header=FALSE, col.names = c("country", "ageg", "pop", "deaths"))

country ageg    pop deaths
1 Kazakhstan    0 174078   3720
2 Kazakhstan 01-4 754758   1220
3 Kazakhstan 05-9 879129    396
```

The first thing to do is to calculate the age-specific rates, dividing deaths by population and multiplying by 1000:

```. gen rates = 1000 * deaths / pop
```
```> p21 <- mutate(p21, rates = 1000 * deaths / pop)
```

Crude death rates are just a weighted average of age-specific rates using the population in each age group as the weight. We can easily compute them like this:

```. tabstat rates [fw=pop], by(country)

Summary for variables: rates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |  7.423042
Sweden |  10.54756
-----------+----------
Total |  8.470285
----------------------
```
```> group_by(p21, country) %>%
+   summarize(crude = weighted.mean(rates, pop))
Source: local data frame [2 x 2]

country     crude
(fctr)     (dbl)
1 Kazakhstan  7.423042
2     Sweden 10.547561
```

The interesting result here is that mortality appears to be lower in Kazakhstan than in Sweden.

### Standardized Rates

Following Preston et al., we will standardize the rates using the weighted average of the two population compositions as the standard. To do this we calculate first the percent distribution in each country and then average. While we are at it we will also calculate the average rates.

```> p21 <- group_by(p21, country) %>%
+          mutate(pct = pop/sum(pop))

> avg <- group_by(p21, ageg) %>%
+          summarize(pop = mean(pct), rates = mean(rates))
```
```. egen pcpop = pc(pop), by(country)

. egen avgcomp = mean(pcpop), by(ageg)
```

Now we can easily compute the standardized rates using the average composition as the standard. Because we assigned the grouped data to `p21` we can skip the grouping step.

```> summarize(p21, std = weighted.mean(rates, avg\$pop))
Source: local data frame [2 x 2]

country       std
(fctr)     (dbl)
1 Kazakhstan 11.881997
2     Sweden  7.374094
```
```. tabstat rates [aw=avgcomp], by(country)

Summary for variables: rates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |    11.882
Sweden |  7.374094
-----------+----------
Total |  9.628045
----------------------
```

We see that once we adjust for age composition the mortality rate is in fact higher in Kzakhstan than in Sweden, in fact 61.1% higher.

### Indirect Standarization

Frequently we don't have age-specific rates, but can easily obtain the age distribution. We can still do a form of standardization applying the rates of one country (or any other standard) to the two age distributions. Let us use the rates for Sweden and apply them to both countries. To do this we sort by age and country, and for each age pick the second value

```. bysort age (country): gen swrates = rates[2]

. tabstat swrates [fw=pop], by(country)

Summary for variables: swrates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |  4.200627
Sweden |  10.54756
-----------+----------
Total |  6.327926
----------------------
```
```> sweden <- filter(p21, country == "Sweden")

> summarize(p21, istd = weighted.mean(sweden\$rates, pop))
Source: local data frame [2 x 2]

country      istd
(fctr)     (dbl)
1 Kazakhstan  4.200627
2     Sweden 10.547561
```

So if Kazakhstan has Swedish rates the CDR would be 4.2. The ratio of the actual CDR to the expected rate under the standard age structure is known as the standardized mortality ratio and turns out to be

```. quietly sum rates [fw=pop] if country == "Kazakhstan"

. scalar kzcmr = r(mean)

. quietly sum swrates [fw=pop] if country == "Kazakhstan"

. di "SMR = ", kzcmr/r(mean)
SMR =  1.7671271
```
```> filter(p21, country == "Kazakhstan") %>%
+   summarize(istd = weighted.mean(sweden\$rates,pop),
+   crude = weighted.mean(rates,pop), smr = crude/istd)
Source: local data frame [1 x 4]

country     istd    crude      smr
(fctr)    (dbl)    (dbl)    (dbl)
1 Kazakhstan 4.200627 7.423042 1.767127
```

Mortality is in fact 77% higher than it would be at Swedish rates with the observed composition.

### Decomposition of Differences Between Rates

Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. The first part is obtained applying the average composition to the observed rates. To assess the second component we apply the observed composition to the average rates.

```. egen avgrates = mean(rates), by(ageg)

. tabstat avgrates [fw=pop], by(country)

Summary for variables: avgrates
by categories of: country

country |      mean
-----------+----------
Kazakhstan |  5.811834
Sweden |  13.44426
-----------+----------
Total |   8.36999
----------------------
```
```> dbr <- summarize(p21,
+    crude = weighted.mean(rates,pop),
+    comp = weighted.mean(rates, avg\$pop),
+    rates = weighted.mean(avg\$rates, pop)); dbr
Source: local data frame [2 x 4]

country     crude      comp     rates
(fctr)     (dbl)     (dbl)     (dbl)
1 Kazakhstan  7.423042 11.881997  5.811834
2     Sweden 10.547561  7.374094 13.444256
```

We see that with the same rates the older composition of Sweden would result in much higher mortality than observed. We can now verify the decomposition:

```. display "Difference = " 10.55-7.42 ///
>   "; Composition = " 13.44-5.81 ///
>      "; Rates = " 7.37-11.88
Difference = 3.13; Composition = 7.63; Rates = -4.51
```
```> c(-1, 1) %*% as.matrix(data.frame(dbr[,-1]))
crude      comp    rates
[1,] 3.124519 -4.507903 7.632422
```

So a difference in 3.13 points in the CBR between Sweden and Kazakhstan results from a compositional effect of +7.63 and a difference in rates of -4.51. I wrote a Stata program that can do these calculations for data in "wide" format. In net-aware Stata type `net from http://data.princeton.edu/eco572/stata` and check out `ddrate`.

### Age-Patterns of Mortality

So far we have focused on attempts to summarize mortality in a single number, but why not look at the complete set of rates? I would like to plot the rates using the mid-points of the ages groups (with 87.5 for 85+). To compute these I first convert age from a string to numbers (1 to 19) and then figure out the midpoint of each group.

```. encode ageg, gen(agei)  // "05-9" becomes 3

. gen agem = 5*agei - 7.5  // 3 becomes 7.5

. replace agem = 0.5 if ageg == "0"

. replace agem = 3   if ageg == "01-4"
```
```> p21 <- mutate(p21,
+   agei = as.numeric(ageg),
+   age = ifelse(agei > 2, 5*agei - 7.5, ifelse(agei == 1, 0, 3)))
```

We then plot the rates using a log-scale (otherwise we wouldn't see much)

```. twoway (line rate agem if country=="Sweden")      ///
>         (line rate agem if country=="Kazakhstan")  ///
>       , yscale(log) ylabel(50 100)               ///
>       title("Age-Specific Mortality Rates")      ///
>       xtitle("Age") ytitle("log(ASMR)")          ///
>       legend(order(2 "Kazakhstan" 1 "Sweden") ring(0) pos(5) cols(1))

. graph export swkzasmr.png, replace width(500) height(400)
(file swkzasmr.png written in PNG format)
```

```> ggplot(p21, aes(x=age, y=log(rates), color=country)) + geom_line()

> ggsave("swkzasmrr.png", width=500/72, height=400/72, dpi=72)
```

We see clearly that mortality is higher in Kazakhstan than in Sweden at every age. The lines are not exactly parallel, in fact the difference seems to become smaller at older ages. (We will discuss later how heterogeneity can produce this type of convergence.)

### The Modern Regression Approach

The aim of standardization is to control for a compositional variable. This, of course, is also one of the aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.

This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modelled using a linear predictor, just as in linear regression) times the log of exposure (which is defined as an offset, or a fixed part of the linear predictor). That's why the code below computes the log of the mid-year population counts. We also represent country using a dummy variable for Kazakhstan.

```. gen logexpo = log(pop)

. gen kz  = country == "Kazakhstan"

. poisson deaths i.agei kz, offset(logexpo)

Iteration 0:   log likelihood = -49623.369
Iteration 1:   log likelihood = -10315.221
Iteration 2:   log likelihood = -1482.9267
Iteration 3:   log likelihood = -1275.2697
Iteration 4:   log likelihood =  -1274.853
Iteration 5:   log likelihood =  -1274.853

Poisson regression                              Number of obs     =         38
LR chi2(19)       =  315966.62
Prob > chi2       =     0.0000
Log likelihood =  -1274.853                     Pseudo R2         =     0.9920

------------------------------------------------------------------------------
deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
agei |
01-4  |  -2.600295   .0322872   -80.54   0.000    -2.663577   -2.537013
05-9  |  -3.823097   .0509118   -75.09   0.000    -3.922882   -3.723311
10-14  |  -4.003534   .0571946   -70.00   0.000    -4.115633   -3.891434
15-19  |   -3.29342    .043102   -76.41   0.000    -3.377899   -3.208942
20-24  |   -2.99395   .0395719   -75.66   0.000    -3.071509    -2.91639
25-29  |   -3.02696     .03777   -80.14   0.000    -3.100988   -2.952932
30-34  |  -2.743422    .033986   -80.72   0.000    -2.810034   -2.676811
35-39  |  -2.436497   .0318358   -76.53   0.000    -2.498894     -2.3741
40-44  |  -1.985014   .0285733   -69.47   0.000    -2.041017   -1.929012
45-49  |  -1.588591   .0280586   -56.62   0.000    -1.643584   -1.533597
50-54  |  -1.231015   .0229505   -53.64   0.000    -1.275997   -1.186032
55-59  |  -.6591381   .0220602   -29.88   0.000    -.7023753   -.6159008
60-64  |  -.3400267   .0199076   -17.08   0.000    -.3790448   -.3010086
65-69  |   .2502514   .0188483    13.28   0.000     .2133094    .2871935
70-74  |   .6572624   .0186292    35.28   0.000     .6207498     .693775
75-79  |   1.094818   .0178766    61.24   0.000      1.05978    1.129855
80-84  |   1.719851    .017546    98.02   0.000     1.685462    1.754241
85+  |   2.451387   .0170575   143.71   0.000     2.417955    2.484819
|
kz |   .4794218   .0062738    76.42   0.000     .4671254    .4917181
_cons |  -4.445508   .0166385  -267.18   0.000    -4.478119   -4.412897
logexpo |          1  (offset)
------------------------------------------------------------------------------

. di exp(_b[kz])
1.6151402
```
```> p21 <- mutate(p21, logexpo = log(pop), kz = as.numeric(country=="Kazakhstan"))

> fit <- glm(deaths ~ ageg + kz + offset(log(pop)), data = p21, family = poisson)

> coef(summary(fit))
Estimate Std. Error    z value      Pr(>|z|)
(Intercept) -4.4455076 0.01663845 -267.18275  0.000000e+00
ageg01-4    -2.6002957 0.03228717  -80.53649  0.000000e+00
ageg05-9    -3.8230970 0.05091182  -75.09253  0.000000e+00
ageg10-14   -4.0035340 0.05719463  -69.99842  0.000000e+00
ageg15-19   -3.2934201 0.04310201  -76.40989  0.000000e+00
ageg20-24   -2.9939502 0.03957193  -75.65843  0.000000e+00
ageg25-29   -3.0269600 0.03777003  -80.14185  0.000000e+00
ageg30-34   -2.7434226 0.03398604  -80.72205  0.000000e+00
ageg35-39   -2.4364968 0.03183579  -76.53326  0.000000e+00
ageg40-44   -1.9850146 0.02857332  -69.47090  0.000000e+00
ageg45-49   -1.5885907 0.02805860  -56.61690  0.000000e+00
ageg50-54   -1.2310144 0.02295053  -53.63773  0.000000e+00
ageg55-59   -0.6591382 0.02206023  -29.87903 3.685603e-196
ageg60-64   -0.3400269 0.01990755  -17.08030  2.080363e-65
ageg65-69    0.2502514 0.01884833   13.27711  3.142725e-40
ageg70-74    0.6572621 0.01862924   35.28122 1.139967e-272
ageg75-79    1.0948175 0.01787660   61.24306  0.000000e+00
ageg80-84    1.7198508 0.01754596   98.01978  0.000000e+00
ageg85+      2.4513869 0.01705754  143.71277  0.000000e+00
kz           0.4794215 0.00627376   76.41694  0.000000e+00

> exp(coef(fit)["kz"])
kz
1.61514
```

Exponentiating the coefficient of Kazakhstan we see that on average age-specific mortality is 61.5% higher than in Sweden, a result very similar to the 61.2% that we obtained using direct standardization.

An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case, adding the interaction would make the model saturated for this data, so we just test goodness of fit.

```. estat gof

Deviance goodness-of-fit =  2218.456
Prob > chi2(18)          =    0.0000

Pearson goodness-of-fit  =  2026.899
Prob > chi2(18)          =    0.0000
```
```> data.frame(deviance(fit), df=fit\$df.residual)
deviance.fit. df
1      2218.549 18
```

The deviance of 2218 on 18 d.f. is highly significant, so we conclude that the rates are not proportional. With data based on national counts, however, even small differences can be significant. This interaction, while significant, represents less than one percent of the combined effect of age and country

```. di r(chi2_d)/e(chi2)
.00702117
```
```> deviance(fit)/(fit\$null.deviance - deviance(fit))
[1] 0.007021467
```

If the discussion in the last section doesn't make much sense to you don't worry. Just make a note to come back and re-read this part after you take WWS509/POP507, the generalized linear models course.

Updated 22-Jan-2016