Extra-Poisson Variation

We use data from Long (1990) on the number of publications produced by Ph.D. biochemists.

The variables in the dataset are

These data have also been analyzed by Long and Freese (2001), and are available from the Stata website:

. use http://www.stata-press.com/data/lf2/couart2

(Academic Biochemists / S Long)

. summarize art

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         art |       915    1.692896    1.926069          0         19

The mean number of articles is 1.69 and the variance is 3.71, a bit more than twice the mean. The data are over-dispersed, but of course we haven't considered any covariates yet.

A Poisson Model

Let us fit the model used by Long and Freese(2001), a simple additive model using all five predictors. We could use poisson to obtain the estimates and then estat goff to obtain the deviance.

. poisson art fem mar kid5 phd ment

Iteration 0:   log likelihood = -1651.4574  
Iteration 1:   log likelihood = -1651.0567  
Iteration 2:   log likelihood = -1651.0563  
Iteration 3:   log likelihood = -1651.0563  

Poisson regression                                Number of obs   =        915
                                                  LR chi2(5)      =     183.03
                                                  Prob > chi2     =     0.0000
Log likelihood = -1651.0563                       Pseudo R2       =     0.0525

------------------------------------------------------------------------------
         art |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.2245942   .0546138    -4.11   0.000    -.3316352   -.1175532
         mar |   .1552434   .0613747     2.53   0.011     .0349512    .2755356
        kid5 |  -.1848827   .0401272    -4.61   0.000    -.2635305   -.1062349
         phd |   .0128226   .0263972     0.49   0.627     -.038915    .0645601
        ment |   .0255427   .0020061    12.73   0.000     .0216109    .0294746
       _cons |   .3046168   .1029822     2.96   0.003     .1027755    .5064581
------------------------------------------------------------------------------

. estat gof

         Goodness-of-fit chi2  =  1634.371
         Prob > chi2(909)      =    0.0000

Alternatively, we can use glm, which gives us not only the estimates and the deviance but Pearson's chi-squared statistic as well

. glm art fem mar kid5 phd ment, family(poisson)

Iteration 0:   log likelihood = -1670.3221  
Iteration 1:   log likelihood = -1651.1048  
Iteration 2:   log likelihood = -1651.0563  
Iteration 3:   log likelihood = -1651.0563  

Generalized linear models                          No. of obs      =       915
Optimization     : ML                              Residual df     =       909
                                                   Scale parameter =         1
Deviance         =  1634.370984                    (1/df) Deviance =  1.797988
Pearson          =   1662.54655                    (1/df) Pearson  =  1.828984

Variance function: V(u) = u                        [Poisson]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  3.621981
Log likelihood   = -1651.056316                    BIC             = -4564.031

------------------------------------------------------------------------------
             |                 OIM
         art |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.2245942   .0546138    -4.11   0.000    -.3316352   -.1175532
         mar |   .1552434   .0613747     2.53   0.011     .0349512    .2755356
        kid5 |  -.1848827   .0401272    -4.61   0.000    -.2635305   -.1062349
         phd |   .0128226   .0263972     0.49   0.627     -.038915    .0645601
        ment |   .0255427   .0020061    12.73   0.000     .0216109    .0294746
       _cons |   .3046168   .1029822     2.96   0.003     .1027755    .5064581
------------------------------------------------------------------------------

We see that the model obviously doesn't fit the data. The five-percent critical value for a chi-squared with 909 d.f. is

. display invchi2tail(909,0.05)

980.25178

and both the deviance and Pearson's chi-squared are in the 1600's.

Before we proceed further we will store the coefficients and their variances in matrices for later use:

. mat bp = e(b)

. mat vp = vecdiag(e(V))

Over-Dispersed Poisson

We now assume that the variance is proportional to the mean and estimate the scale parameter by dividing Pearson's chi-squared by its d.f. Instead of copying numbers from the output we use the stored results

. di e(deviance_p)/e(df)

1.8289841

The variance is about 83% larger than the mean. We can adjust the standard errors multiplying by 1.35, the square root of the scale parameter. We can do this with the scale option of the glm command, which takes either a numeric value (in our case 1.828984) or simply x2 to indicate that the adjustment should be based on Pearson's chi-squared.

. glm art fem mar kid5 phd ment, family(poisson) scale(x2)

Iteration 0:   log likelihood = -1670.3221  
Iteration 1:   log likelihood = -1651.1048  
Iteration 2:   log likelihood = -1651.0563  
Iteration 3:   log likelihood = -1651.0563  

Generalized linear models                          No. of obs      =       915
Optimization     : ML                              Residual df     =       909
                                                   Scale parameter =         1
Deviance         =  1634.370984                    (1/df) Deviance =  1.797988
Pearson          =   1662.54655                    (1/df) Pearson  =  1.828984

Variance function: V(u) = u                        [Poisson]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  3.621981
Log likelihood   = -1651.056316                    BIC             = -4564.031

------------------------------------------------------------------------------
             |                 OIM
         art |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.2245942   .0738596    -3.04   0.002    -.3693564    -.079832
         mar |   .1552434   .0830031     1.87   0.061    -.0074397    .3179265
        kid5 |  -.1848827    .054268    -3.41   0.001     -.291246   -.0785194
         phd |   .0128226   .0356995     0.36   0.719    -.0571472    .0827924
        ment |   .0255427    .002713     9.41   0.000     .0202253    .0308602
       _cons |   .3046168    .139273     2.19   0.029     .0316468    .5775869
------------------------------------------------------------------------------
(Standard errors scaled using square root of Pearson X2-based dispersion)

You can verify that these standard errors are about 35% larger than before. We will save the variances in a matrix for later use

. mat vo = vecdiag(e(V))

Negative Binomial Regression

Let us now fit a negative binomial model with the same predictors:

. nbreg art fem mar kid5 phd ment

Fitting Poisson model:

Iteration 0:   log likelihood = -1651.4574  
Iteration 1:   log likelihood = -1651.0567  
Iteration 2:   log likelihood = -1651.0563  
Iteration 3:   log likelihood = -1651.0563  

Fitting constant-only model:

Iteration 0:   log likelihood = -1625.4242  
Iteration 1:   log likelihood = -1609.9746  
Iteration 2:   log likelihood = -1609.9368  
Iteration 3:   log likelihood = -1609.9367  

Fitting full model:

Iteration 0:   log likelihood = -1565.6652  
Iteration 1:   log likelihood = -1561.0095  
Iteration 2:   log likelihood = -1560.9583  
Iteration 3:   log likelihood = -1560.9583  

Negative binomial regression                      Number of obs   =        915
                                                  LR chi2(5)      =      97.96
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -1560.9583                       Pseudo R2       =     0.0304

------------------------------------------------------------------------------
         art |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.2164184   .0726724    -2.98   0.003    -.3588537   -.0739832
         mar |   .1504895   .0821063     1.83   0.067    -.0104359    .3114148
        kid5 |  -.1764152   .0530598    -3.32   0.001    -.2804105     -.07242
         phd |   .0152712   .0360396     0.42   0.672    -.0553652    .0859075
        ment |   .0290823   .0034701     8.38   0.000     .0222811    .0358836
       _cons |    .256144   .1385604     1.85   0.065    -.0154294    .5277174
-------------+----------------------------------------------------------------
    /lnalpha |  -.8173044   .1199372                     -1.052377   -.5822318
-------------+----------------------------------------------------------------
       alpha |   .4416205   .0529667                      .3491069    .5586502
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) =  180.20 Prob>=chibar2 = 0.000

Stata's alpha is the variance of the unobservable; it is estimated to be 0.44 and is highly significant (non zero). Twice the difference in log-likelihoods between this model and the (nested) Poisson model gives a likelihood ratio statistic of 180.2. Stata reports this as chibar2 because it doesn't have a standard chi-squared distribution with one d.f. Treating it as a chi-squared one anyway gives a conservative test. We have overwhelming evidence of overdispersion.

Note that the parameter estimates are not very different from those based on a Poisson regression model. Let us compare them side by side.

. mat coef = e(b)

. mat bnb = coef[1,1..6]

. mat b = bp' , bnb'

. mat list b

b[6,2]
                   y1          y1
  art:fem  -.22459423  -.21641842
  art:mar   .15524338   .15048945
 art:kid5   -.1848827  -.17641524
  art:phd   .01282258   .01527116
 art:ment   .02554275   .02908234
art:_cons   .30461683   .25614402

Both sets of estimates would lead to the same conclusions.

We now compare the variances of the estimated coefficients based on the Poisson, over-dispersed Poisson, and negative binomial models:

. mat d = vecdiag(e(V))

. mat vnb = d[1,1..6]

. mat v = vp' , vo' , vnb'

. mat list v

v[6,3]
                  r1         r1         r1
  art:fem  .00298266  .00545524  .00528127
  art:mar  .00376685  .00688951  .00674144
 art:kid5  .00161019  .00294501  .00281534
  art:phd  .00069681  .00127446  .00129885
 art:ment  4.024e-06  7.360e-06  .00001204
art:_cons  .01060532  .01939697  .01919898

Note that the Poisson model underestimates standard errors, and that both approaches to over-dispersion lead to very similar estimates.

One way to compute the deviance of the negative binomial model is to feed the estimate of the variance into glm, which can estimate these models for a fixed value of the scale parameter

. glm art fem mar kid5 phd ment, family(nb .4416205)

Iteration 0:   log likelihood = -1563.5534  
Iteration 1:   log likelihood = -1560.9585  
Iteration 2:   log likelihood = -1560.9583  
Iteration 3:   log likelihood = -1560.9583  

Generalized linear models                          No. of obs      =       915
Optimization     : ML                              Residual df     =       909
                                                   Scale parameter =         1
Deviance         =  1004.281481                    (1/df) Deviance =   1.10482
Pearson          =  944.5494396                    (1/df) Pearson  =  1.039108

Variance function: V(u) = u+(.4416205)u^2          [Neg. Binomial]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  3.425046
Log likelihood   = -1560.958338                    BIC             =  -5194.12

------------------------------------------------------------------------------
             |                 OIM
         art |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.2164184   .0726706    -2.98   0.003    -.3588501   -.0739867
         mar |   .1504895   .0821062     1.83   0.067    -.0104358    .3114147
        kid5 |  -.1764152   .0530587    -3.32   0.001    -.2804084   -.0724221
         phd |   .0152712   .0360382     0.42   0.672    -.0553624    .0859047
        ment |   .0290823   .0034657     8.39   0.000     .0222896    .0358751
       _cons |    .256144   .1385256     1.85   0.064    -.0153613    .5276493
------------------------------------------------------------------------------

We see that the negative binomial model fits much better than the Poisson, but still has a deviance (just) above the five percent critical value.

Exercise: Overdispersion may be due to specification errors in the linear predictor. Check to see if we are missing a significant interaction term. Would including that term be enough to get the deviance below the five-percent critical value?