## 4. Poisson Models in Stata

This unit illustrates the use of Poisson regression for modeling
count data. We will be using the `poisson`

command,
often followed by `estat gof`

to compute the model's
deviance, which we can use as a goodness of fit test
with both individual and grouped data.

An alternative way to fit these models is to use the `glm`

command to fit generalized linear models in the Poisson family with
link log. An advantage of that command is that it reports the deviance
and Pearson's chi-squared statistics. There's also an option to
adjust standard errors for extra-Poisson variation. We will
illustrate its use in the context of models for overdispersed count data.

### 4.3 A Poisson Model for Counts

We will use the data from Fiji on children ever born that appear on Table 4.1 of the lecture notes. The data are available on our datasets page at http://data.princeton.edu/wws509/datasets in both plain text and Stata formats. We will read the Stata file:

. use http://data.princeton.edu/wws509/datasets/ceb, clear (Children Ever Born Data, Fiji, 1976)

The file has 70 observations, one for each cell in the table. Each observation has a sequence number, numeric codes for marriage duration, residence and education, the mean and variance of children ever born, and the number of women in the cell.

#### Mean-Variance Relationship

We start by doing Figure 4.1, plotting the cell variances versus the cell means using a log-log-scale for cell with at least 20 cases. Because Stata has an option to use log scales we don't need to take logs ourselves:

. twoway (scatter var mean if n > 20) /// > (function y=x, range (.7 7.8)) /// > , xscale(log) yscale(log) legend(off) /// > xtitle(Mean CEB) ytitle(Variance of CEB) /// > title("Figure 4.1. Mean-Variance Relationship") /// > subtitle("Children Ever Born in Fiji, 1976") . graph export c4fig1.png, width(500) replace (file c4fig1.png written in PNG format)

Clearly the variance increases with the mean. Most of the points lie below the 45 degree line, indicating that the variance is not exactly equal to the mean. Still, the assumption of proportionality brings as much closer to the data than the assumption of constant variance.

#### Response, Offset and Predictors

The dataset does not have information about the number of children
ever born (CEB) to each woman, but it turns out we can still model
the mean by working with the cell totals and introducing the log
of the number of women in the cell as an *offset*.

If the number of CEB to __one__ woman in a given cell is a
Poisson random variable with mean (and variance) μ,
then the number born to all *n* women in that cell is a
Poisson r.v. with mean (and variance) *n*μ.
The log of the expected sum is log(n)+log(μ), and consists of
a known offset and the quantity we are interested in modeling.
See the notes for further details

We therefore start by computing the outcome, the total CEB in each cell, and the offset:

. gen y = round( mean * n, 1) . gen os = log(n)

We also need dummy variables for the predictors. I will use local macros to store the names of the dummies that go with each predictor:

. // duration (reference is 0-4) . gen dur0509 = dur == 2 . gen dur1014 = dur == 3 . gen dur1519 = dur == 4 . gen dur2024 = dur == 5 . gen dur2529 = dur == 6 . local dur dur0509 dur1014 dur1519 dur2024 dur2529 . // residence (reference is Fiji) . gen urban = res == 2 . gen rural = res == 3 . local res urban rural . // education (reference is None) . gen lowerPri = educ == 2 . gen upperPri = educ == 3 . gen secPlus = educ == 4 . local educ lowerPri upperPri secPlus

#### The Null Model

We are ready to fit the null model, which has an offset but no predictors.

. poisson y, offset(os) Iteration 0: log likelihood = -2080.664 Iteration 1: log likelihood = -2080.664 Poisson regression Number of obs = 70 LR chi2(0) = -0.00 Prob > chi2 = . Log likelihood = -2080.664 Pseudo R2 = -0.0000 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 1.376346 .0097119 141.72 0.000 1.357311 1.395381 os | (offset) ------------------------------------------------------------------------------ . di exp(_b[_cons]) 3.9604033 . quietly sum mean [fw=n] . di r(mean) 3.9604968 . estat gof Goodness-of-fit chi2 = 3731.851 Prob > chi2(69) = 0.0000

The constant is the log of the mean number of children ever born. Exponentiating we see that the estimated mean is almost four children per woman. The estimate coincides with the sample mean, as we verified by averaging the cell means with the number of women as a frequency weight.

The deviance of 3,732 on 69 d.f. gives a clear indication that the model doesn't fit the data. The hypothesis that the expected number of CEB is the same for all women regardless of marriage duration, residence and education, is soundly rejected,

In rate models the offset usually represent the log of exposure,
and Stata lets us specify it directly using the `offset()`

option with the name of the variable representing the offset,
or using the `exposure()`

option with the name of the
variable representing exposure, in which case Stata takes the log.

#### One-Factor Models

Next we fit the three one-factor models, starting with residence:

. poisson y `res', offset(os) Iteration 0: log likelihood = -2051.3779 Iteration 1: log likelihood = -2044.3868 Iteration 2: log likelihood = -2044.3778 Iteration 3: log likelihood = -2044.3778 Poisson regression Number of obs = 70 LR chi2(2) = 72.57 Prob > chi2 = 0.0000 Log likelihood = -2044.3778 Pseudo R2 = 0.0174 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- urban | .1442896 .032448 4.45 0.000 .0806926 .2078866 rural | .2280596 .0278321 8.19 0.000 .1735097 .2826095 _cons | 1.204598 .0249922 48.20 0.000 1.155614 1.253581 os | (offset) ------------------------------------------------------------------------------ . di exp(_b[urban]), exp(_b[rural]) 1.1552186 1.2561602 . estat gof Goodness-of-fit chi2 = 3659.279 Prob > chi2(67) = 0.0000

The estimates show that women in urban and rural areas have on average 16 and 26% more children than women in Suva. The model chi-squared of 73 on 2 d.f. tells us that this model is a significant improvement over the null. The deviance, still in the thousands, tells us that this model is far from fitting the data.

Now for education

. poisson y `educ', offset(os) Iteration 0: log likelihood = -1588.3352 Iteration 1: log likelihood = -1545.4751 Iteration 2: log likelihood = -1545.2371 Iteration 3: log likelihood = -1545.2371 Poisson regression Number of obs = 70 LR chi2(3) = 1070.85 Prob > chi2 = 0.0000 Log likelihood = -1545.2371 Pseudo R2 = 0.2573 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lowerPri | -.2117869 .0216769 -9.77 0.000 -.2542729 -.1693008 upperPri | -.6160532 .0288581 -21.35 0.000 -.6726141 -.5594922 secPlus | -1.224676 .0514108 -23.82 0.000 -1.32544 -1.123913 _cons | 1.647278 .0146932 112.11 0.000 1.61848 1.676076 os | (offset) ------------------------------------------------------------------------------ . mata exp(st_matrix("e(b)")) 1 2 3 4 +---------------------------------------------------------+ 1 | .8091371376 .5400718104 .2938527957 5.192824803 | +---------------------------------------------------------+ . estat gof Goodness-of-fit chi2 = 2660.998 Prob > chi2(66) = 0.0000

The estimates show that the number of CEB declines substantially with education. Women with secondary education or more have 71% fewer children than women with no education (or only 29% as many). The educational differential is highly significant, but this model doesn't fit the data.

Finally, here's duration:

. poisson y `dur', offset(os) Iteration 0: log likelihood = -315.2481 Iteration 1: log likelihood = -297.80021 Iteration 2: log likelihood = -297.77426 Iteration 3: log likelihood = -297.77426 Poisson regression Number of obs = 70 LR chi2(5) = 3565.78 Prob > chi2 = 0.0000 Log likelihood = -297.77426 Pseudo R2 = 0.8569 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dur0509 | 1.044886 .0523975 19.94 0.000 .9421893 1.147584 dur1014 | 1.444947 .0502397 28.76 0.000 1.346479 1.543416 dur1519 | 1.706756 .0497474 34.31 0.000 1.609253 1.80426 dur2024 | 1.877474 .0496492 37.81 0.000 1.780164 1.974785 dur2529 | 2.078855 .047507 43.76 0.000 1.985743 2.171967 _cons | -.1036046 .0441511 -2.35 0.019 -.1901391 -.01707 os | (offset) ------------------------------------------------------------------------------ . estat gof Goodness-of-fit chi2 = 166.072 Prob > chi2(64) = 0.0000

Not surprisingly the number of CEB is much higher for women who have been married longer. This is by far the most important predictor of CEB, with a chi-squared of 3,566 on just 5 d.f. In fact, a demographer wouldn't even have looked at models that did not include a control for duration of marriage. It's nice to see that Poisson regression can uncover the obvious :) Note that this model still doesn't fit the data.

The deviances given in this section are pretty close to the deviances in Table 4.3 of the notes. You will notice small differences due to the use of different rounding procedures. In the notes we multiplied the mean CEB by the number of women and retained a few decimals. Here we rounded the total number of CEB to the nearest integer. If you omit the rounding you will reproduce the results in the notes exactly.

#### Two-Factor Models

We now consider models that take two of the three factors into account. Following the notes we consider only models that include duration of marriage, an essential control when we study cumulative fertility. This leaves two models with main effects of two factors, and another two models that add one interaction.

Because we are only interested in deviances I will run the estimation commands quietly. I will also use factor variables because it simplifies specifying the models. You can always build the dummies from first principles.

So here are the additive models

. quietly poisson y i.dur i.res , offset(os) . estat gof Goodness-of-fit chi2 = 120.6806 Prob > chi2(62) = 0.0000 . quietly poisson y i.dur i.educ, offset(os) . estat gof Goodness-of-fit chi2 = 100.1919 Prob > chi2(61) = 0.0012

And here are the models with one interaction

. quietly poisson y i.dur#i.res , offset(os) . estat gof Goodness-of-fit chi2 = 108.8968 Prob > chi2(52) = 0.0000 . quietly poisson y i.dur#i.educ, offset(os) . estat gof Goodness-of-fit chi2 = 84.53072 Prob > chi2(46) = 0.0005

The best fit so far is the model that includes duration and education, but it exhibits significant lack of fit with a chi-squared of 84.5 on 46 d.f.

#### The Three-Factor Additive Model

We are now ready to look at models that include all three factors. We start with the additive model. I will use dummy variables because it makes the output a bit clearer, but we can get exactly the same result with factor variables.

. poisson y `dur' `res' `educ', offset(os) Iteration 0: log likelihood = -623.59688 Iteration 1: log likelihood = -252.64903 Iteration 2: log likelihood = -250.07248 Iteration 3: log likelihood = -250.07108 Iteration 4: log likelihood = -250.07108 Poisson regression Number of obs = 70 LR chi2(10) = 3661.19 Prob > chi2 = 0.0000 Log likelihood = -250.07108 Pseudo R2 = 0.8798 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dur0509 | .9969348 .0527437 18.90 0.000 .8935591 1.100311 dur1014 | 1.369395 .0510688 26.81 0.000 1.269302 1.469488 dur1519 | 1.613757 .0511949 31.52 0.000 1.513417 1.714097 dur2024 | 1.784911 .0512138 34.85 0.000 1.684534 1.885288 dur2529 | 1.976405 .0500341 39.50 0.000 1.87834 2.07447 urban | .1124186 .0324963 3.46 0.001 .048727 .1761102 rural | .1516602 .0283292 5.35 0.000 .096136 .2071845 lowerPri | .0229728 .0226563 1.01 0.311 -.0214327 .0673783 upperPri | -.1012738 .0309871 -3.27 0.001 -.1620073 -.0405402 secPlus | -.3101495 .0552107 -5.62 0.000 -.4183605 -.2019386 _cons | -.1170972 .0549118 -2.13 0.033 -.2247222 -.0094721 os | (offset) ------------------------------------------------------------------------------ . estat gof Goodness-of-fit chi2 = 70.66559 Prob > chi2(59) = 0.1421

This model passes the goodness of fit hurdle, with a deviance of 70.67 on 59 d.f. and a corresponding P-value of 0.14, so we have no evidence against this model.

To exponentiate the parameter estimates we can reissue with
`poisson`

command with the `irr`

option,
which is short for __i__ncidence-__r__ate __r__atios.

. poisson, irr Poisson regression Number of obs = 70 LR chi2(10) = 3661.19 Prob > chi2 = 0.0000 Log likelihood = -250.07108 Pseudo R2 = 0.8798 ------------------------------------------------------------------------------ y | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dur0509 | 2.709963 .1429334 18.90 0.000 2.443812 3.005099 dur1014 | 3.932972 .2008521 26.81 0.000 3.558369 4.34701 dur1519 | 5.021644 .2570824 31.52 0.000 4.542226 5.551663 dur2024 | 5.95905 .3051855 34.85 0.000 5.389938 6.588254 dur2529 | 7.216753 .3610835 39.50 0.000 6.542636 7.960327 urban | 1.118981 .0363628 3.46 0.001 1.049934 1.192569 rural | 1.163765 .0329685 5.35 0.000 1.100909 1.230209 lowerPri | 1.023239 .0231828 1.01 0.311 .9787954 1.0697 upperPri | .9036856 .0280026 -3.27 0.001 .850435 .9602706 secPlus | .7333373 .040488 -5.62 0.000 .6581249 .8171451 os | (offset) ------------------------------------------------------------------------------

Briefly the estimates indicate that the number of CEB increases rapidly with marital duration, in each category of residence and education women married 15-19 years have five times as many children as those married less than five years. Women who live in urban and rural areas have 12% and 16% more children than women who live in Suva and have the same marriage duration and education. Finally, more educated women have fewer children, with women with secondary or more education reporting 27% fewer children than women with no education who live in the same type of place of residence and have been married just as long.

#### Three-Factor Models with Interactions

We now put the additive model to some "stress tests" by considering
all possible interactions. I will use factor variables for
simplicity and `quietly`

to save space.

. quietly poisson y i.dur i.educ#i.res, offset(os) . estat gof Goodness-of-fit chi2 = 59.92104 Prob > chi2(53) = 0.2391 . quietly poisson y i.dur#i.res i.educ, offset(os) . estat gof Goodness-of-fit chi2 = 57.13525 Prob > chi2(49) = 0.1986 . quietly poisson y i.dur#i.educ i.res, offset(os) . estat gof Goodness-of-fit chi2 = 54.80171 Prob > chi2(44) = 0.1274 . quietly poisson y (i.dur i.res)#i.educ, offset(os) . estat gof Goodness-of-fit chi2 = 44.52355 Prob > chi2(38) = 0.2163 . quietly poisson y (i.dur i.educ)#i.res, offset(os) . estat gof Goodness-of-fit chi2 = 44.31134 Prob > chi2(43) = 0.4161 . quietly poisson y i.dur#(i.res i.educ), offset(os) . estat gof Goodness-of-fit chi2 = 42.65186 Prob > chi2(34) = 0.1467 . quietly poisson y i.dur i.educ i.res /// > i.dur#i.educ i.dur#i.res i.res#i.educ, offset(os) . estat gof Goodness-of-fit chi2 = 30.85619 Prob > chi2(28) = 0.3235

These calculations complete Table 4.3 in the notes. I reported the deviances for consistency with the notes, but could just as well have reported likelihood ratio tests comparing each of these models to the additive model. Make sure you know how to use the output to test, for example, whether we need to add a duration by education interaction. It should be clear from the list of deviances that we don't need to add any of these terms. We conclude that the additive model does a fine job indeed.

it's important to note that the need for interactions depends
exactly on what's being modeled. Here we used the log link,
so all effects are *relative*. In this scale no interactions
are needed. If we used the identity link we would be modeling
the actual number of children ever born and all effects would
be *absolute*. In that scale we would need, at the very
least, interactions with duration of marriage.
See the notes for further discussion.

Note 1: some of these models may fail in older versions of Stata,
which by default allowed up to 40 parameters per model.
The solution is to increase the maximum using the command
`set matsize 60`

.
Stata 9 increased the default to 200, which is more than we
need for all of these models.

Note 2: If you are using the `xi`

prefix and look at
the detailed output you will see that Stata drops some
variables due to multicollinearity. This is usually no cause
for alarm. The `xi`

prefix is not terribly smart
in handling factors involved in more than one interaction and
will try to include the main effects twice. When this happens
the variables dropped are the copies and the originals are
already included in the model, as you can verify by inspecting
the listing.