4.3. A Model for Heteroscedastic Counts Table of Contents 5.2. Models for Three-Dimensional Tables

Chapter 5
Log-Linear Models for Contingency Tables

In this chapter we study the application of Poisson regression models to the analysis of contingency tables. This is perhaps one of the most popular applications of log-linear models, and is based on the existence of a very close relationship between the multinomial and Poisson distributions.

5.1  Models for Two-dimensional Tables

We start by considering the simplest possible contingency table: a two-by-two table. However, the concepts to be introduced apply equally well to more general two-way tables where we study the joint distribution of two categorical variables.

5.1.1  The Heart Disease Data

Table was taken from the Framingham longitudinal study of coronary heart disease (Cornfield, 1962; see also Fienberg, 1977). It shows 1329 patients cross-classified by the level or their serum cholesterol (below or above 260) and the presence or absence of heart disease.

Table 5.1: Serum Cholesterol and Heart Disease

Serum
Cholesterol
Heart DiseaseTotal
PresentAbsent
< 260519921043
260+41245286
Total9212371329

There are various sampling schemes that could have led to these data, with consequences for the probability model one would use, the types of questions one would ask, and the analytic techniques that would be employed. Yet, all schemes lead to equivalent analyses. We now explore several approaches to the analysis of these data.

5.1.2  The Multinomial Model

Our first approach will assume that the data were collected by sampling 1329 patients who were then classified according to cholesterol and heart disease. We view these variables as two responses, and we are interested in their joint distribution. In this approach the total sample size is assumed fixed, and all other quantities are considered random.

We will develop the random structure of the data in terms of the row and column variables, and then note what this implies for the counts themselves. Let C denote serum cholesterol and D denote heart disease, both discrete factors with two levels. More generally, we can imagine a row factor with I levels indexed by i and a column factor with J levels indexed by j, forming an I ×J table. In our example I = J = 2.

To describe the joint distribution of these two variables we let pij denote the probability that an observation falls in row i and column j of the table. In our example words, pij is the probability that serum cholesterol C takes the value i and heart disease D takes the value j. In symbols,

pij = Pr{ C = i, D = j},
(5.1)
for i = 1,2,,I and j = 1,2,,J. These probabilities completely describe the joint distribution of the two variables.

We can also consider the marginal distribution of each variable. Let pi. denote the probability that the row variable takes the value i, and let p.j denote the probability that the column variable takes the value j. In our example pi. and p.j represent the marginal distributions of serum cholesterol and heart disease. In symbols,

pi. = Pr{C = i}        and        p.j = Pr{D = j}.
(5.2)
Note that we use a dot as a placeholder for the omitted subscript.

The main hypothesis of interest with two responses is whether they are independent. By definition, two variables are independent if (and only if) their joint distribution is the product of the marginals. Thus, we can write the hypothesis of independence as

H0: pij = pi. p.j
(5.3)
for all i = 1,,I and j = 1,,J. The question now is how to estimate the parameters and how to test the hypothesis of independence.

The traditional approach to testing this hypothesis calculates expected counts under independence and compares observed and expected counts using Pearson's chi-squared statistic. We adopt a more formal approach that relies on maximum likelihood estimation and likelihood ratio tests. In order to implement this approach we consider the distribution of the counts in the table.

Suppose each of n observations is classified independently in one of the IJ cells in the table, and suppose the probability that an observation falls in the (i,j)-th cell is pij. Let Yij denote a random variable representing the number of observations in row i and column j of the table, and let yij denote its observed value. The joint distribution of the counts is then the multinomial distribution, with

Pr{Y = y} = n!
y11! y12! y21! y22!
p11y11 p12y12p21y21p22y22,
(5.4)
where Y is a random vector collecting all four counts and y is a vector of observed values. The term to the right of the fraction represents the probability of obtaining y11 observations in cell (1,1), y12 in cell (1,2), and so on. The fraction itself is a combinatorial term representing the number of ways of obtaining y11 observations in cell (1,1), y12 in cell (1,2), and so on, out of a total of n. The multinomial distribution is a direct extension of the binomial distribution to more than two response categories. In the present case we have four categories, which happen to represent a two-by-two structure. In the special case of only two categories the multinomial distribution reduces to the familiar binomial.

Taking logs and ignoring the combinatorial term, which does not depend on the parameters, we obtain the multinomial log-likelihood function, which for a general I×J table has the form

logL = I

i = 1 
J

j = 1 
yij log( pij ).
(5.5)
To estimate the parameters we need to take derivatives of the log-likelihood function with respect to the probabilities, but in doing so we must take into account the fact that the probabilities add up to one over the entire table. This restriction may be imposed by adding a Lagrange multiplier, or more simply by writing the last probability as the complement of all others. In either case, we find the unrestricted maximum likelihood estimate to be the sample proportion:
^
p
 

ij 
= yij
n
.
Substituting these estimates into the log-likelihood function gives its unrestricted maximum.

Under the hypothesis of independence in Equation 3, the joint probabilities depend on the margins. Taking derivatives with respect to pi. and p.j, and noting that these are also constrained to add up to one over the rows and columns, respectively, we find the m.l.e.'s

^
p
 

i. 
= yi.
n
       and        ^
p
 

.j 
= y.j
n
,
where yi. = j yij denotes the row totals and y.j denotes the column totals. Combining these estimates and multiplying by n to obtain expected counts gives
^
m
 

ij 
= yi. y.j
n
,
which is the familiar result from introductory statistics. In our example, the expected frequencies are
^
m
 

ij 
=



72.2
9970.8
19.8
266.2




.
Substituting these estimates into the log-likelihood function gives its maximum under the restrictions implied by the hypothesis of independence. To test this hypothesis, we calculate twice the difference between the unrestricted and restricted maxima of the log-likelihood function, to obtain the deviance or likelihood ratio test statistic
D = 2

i 


j 
yij log( yij
^
m

ij
).
(5.6)
Note that the numerator and denominator inside the log can be written in terms of estimated probabilities or counts, because the sample size n cancels out. Under the hypothesis of independence, this statistic has approximately in large samples a chi-squared distribution with (I-1)(J-1) d.f.

Going through these calculations for our example we obtain a deviance of 26.43 with one d.f. Comparison of observed and fitted counts in terms of Pearson's chi-squared statistic gives 31.08 with one d.f. Clearly, we reject the hypothesis of independence, concluding that heart disease and serum cholesterol level are associated.

5.1.3  The Poisson Model

An alternative model for the data in Table 5.1 is to treat the four counts as realizations of independent Poisson random variables. A possible physical model is to imagine that there are four groups of people, one for each cell in the table, and that members from each group arrive randomly at a hospital or medical center over a period of time, say for a health check. In this model the total sample size is not fixed in advance, and all counts are therefore random.

Under the assumption that the observations are independent, the joint distribution of the four counts is a product of Poisson distributions

Pr{Y = y} =

i 


j 
mijyij e-mij
yij!
.
(5.7)
Taking logs we obtain the usual Poisson log-likelihood from Chapter 4.

In terms of the systematic structure of the model, we could consider three log-linear models for the expected counts: the null model, the additive model and the saturated model. The null model would assume that all four kinds of patients arrive at the hospital or health center in the same numbers. The additive model would postulate that the arrival rates depend on the level of cholesterol and the presence or absence of heart disease, but not on the combination of the two. The saturated model would say that each group has its own rate or expected number of arrivals.

At this point you may try fitting the Poisson additive model to the four counts in Table 5.1, treating cholesterol and heart disease as factors or discrete predictors. You will discover that the deviance is 26.43 on one d.f. (four observations minus three parameters, the constant and the coefficients of two dummies representing cholesterol and heart disease). If you print the fitted values you will discover that they are exactly the same as in the previous subsection.

This result, of course, is not a coincidence. Testing the hypothesis of independence in the multinomial model is exactly equivalent to testing the goodness of fit of the Poisson additive model. A rigorous proof of this result is beyond the scope of these notes, but we can provide enough information to show that the result is intuitively reasonable and to understand when it can be used.

First, note that if the four counts have independent Poisson distributions, their sum is distributed Poisson with mean equal to the sum of the means. In symbols, if Yij ~ P(mij) then the total Y.. = ij Yij is distributed Poisson with mean m.. = i j mij. Further, the conditional distribution of the four counts given their total is multinomial with probabilities

pij = mij / n,
where we have used n for the observed total y.. = i,j yij. This result follows directly from the fact that the conditional distribution of the counts Y given their total Y.. can be obtained as the ratio of the joint distribution of the counts and the total (which is the same as the joint distribution of the counts, which imply the total) to the marginal distribution of the total. Dividing the joint distribution given in Equation 7 by the marginal, which is Poisson with mean m.., leads directly to the multinomial distribution in Equation 4.

Second, note that the systematic structure of the two models is the same. In the model of independence the joint probability is the product of the marginals, so taking logs we obtain

logpij = logpi. + logp.j,
which is exactly the structure of the additive Poisson model
logmij = h+ ai + bj.
In both cases the log of the expected count depends on the row and the column but not the combination of the two. In fact, it is only the constraints that differ between the two models. The multinomial model restricts the joint and marginal probabilities to add to one. The Poisson model uses the reference cell method and sets a1 = b1 = 0.

If the systematic and random structure of the two models are the same, then it should come as no surprise that they produce the same fitted values and lead to the same tests of hypotheses. There is only one aspect that we glossed over: the equivalence of the two distributions holds conditional on n, but in the Poisson analysis the total n is random and we have not conditioned on its value. Recall, however, that the Poisson model, by including the constant, reproduces exactly the sample total. It turns out that we don't need to condition on n because the model reproduces its exact value anyway.

The morale of this long-winded story is that we do not need to bother with multinomial models and can always resort to the equivalent Poisson model. While the gain is trivial in the case of a two-by-two table, it can be very significant as we move to cross-classifications involving three or more variables, particularly as we don't have to worry about maximizing the multinomial likelihood under constraints. The only trick we need to learn is how to translate the questions of independence that arise in the multinomial context into the corresponding log-linear models in the Poisson context.

5.1.4  The Product Binomial*

(On first reading you may wish to skip this subsection and the next and proceed directly to the discussion of three-dimensional tables in Section 5.2.)

There is a third sampling scheme that may lead to data such as Table 5.1. Suppose that a decision had been made to draw a sample of 1043 patients with low serum cholesterol and an independent sample of 286 patients with high serum cholesterol, and then examine the presence or absence of heart disease in each group.

Interest would then focus on the conditional distribution of heart disease given serum cholesterol level. Let pi denote the probability of heart disease at level i of serum cholesterol. In the notation of the previous subsections,

pi = Pr{ D = 1 | C = i } = pi1
pi.
,
where we have used the fact that the conditional probability of falling in column one given that you are in row i is the ratio of the joint probability pi1 of being in cell (i,1) to the marginal probability pi. of being in row i.

Under this scheme the row margin would be fixed in advance, so we would have n1 observations with low cholesterol and n2 with high. The number of cases with heart disease in category y of cholesterol, denoted Yi1, would then have a binomial distribution with parameters pi and ni independently for i = 1,2. The likelihood function would then be a product of two binomials:

Pr{Y = y} = n1!
y11! y12!
p1y11 (1-p1)y12 n2!
y21! y22!
p2y21 (1-p2)y22,
(5.8)
where we have retained double subscripts and written yi1 and yi2 instead of the more familiar yi and ni-yi to facilitate comparison with Equations 4 and 7.

The main hypothesis of interest would be the hypothesis of homogeneity, where the probability of heart disease is the same in the two groups:

Ho: p1 = p2.
To test this hypothesis you might consider fitting logistic regression models to the data, treating heart disease as the response and serum cholesterol as the predictor, and working with two observations representing the two groups. If you try this, you will discover that the deviance for the null model, which can be interpreted as a likelihood ratio test of the hypothesis of homogeneity, is 26.43 with one d.f., and coincides with the multinomial and Poisson deviances of the previous two subsections.

Again, this is no coincidence, because the random and systematic components of the models are equivalent. The product binomial distribution in Equation 8 can be obtained starting from the assumption that the four counts Yij are independent Poisson with means mij, and then conditioning on the totals Yi. = j Yij, which are Poisson with means mi. = j mij, for i = 1,2. Taking the ratio of the joint distribution of the counts to the marginal distribution of the two totals leads to the product binomial in Equation 8 with pi = mi1/mi..

Similarly, the hypothesis of homogeneity turns out to be equivalent to the hypothesis of independence and hence the additive log-linear model. To see this point note that if two variables are independent, then the conditional distribution of one given the other is the same as its marginal distribution. In symbols, if pij = pi. p.j then the conditional probability, which in general is pj|i = pij/pi., simplifies to pj|i = p.j, which does not depend on i. In terms of our example, under independence or homogeneity the conditional probability of heart disease is the same for the two cholesterol groups.

Again, note that the binomial and Poisson models are equivalent conditioning on the row margin, but in fitting the additive log-linear model we did not impose any conditions. Recall, however, that the Poisson model, by treating serum cholesterol as a factor, reproduces exactly the row margin of the table. Thus, it does not matter that we do not condition on the margin because the model reproduces its exact value anyway.

The importance of this result is that the results of our analyses are in fact independent of the sampling scheme.

Reassuringly, the results will be identical in all three cases, both in terms of fitted counts and in terms of the likelihood ratio statistic.

Note that if the total is fixed and the sampling scheme is multinomial we can always condition on a margin and use binomial models, the choice being up to the investigator. This choice will usually depend on whether one wishes to treat the two variables symmetrically, assuming they are both responses and studying their correlation, or asymmetrically, treating one as a predictor and the other as a response in a regression framework.

If the row margin is fixed and the sampling scheme is binomial then we must use the product binomial model, because we can not estimate the joint distribution of the two variables without further information.

5.1.5  The Hypergeometric Distribution*

There is a fourth distribution that could apply to the data in Table 5.1, namely the hypergeometric distribution. This distribution arises from treating both the row and column margins as fixed. I find it hard to imagine a sampling scheme that would lead to fixed margins, but one could use the following conditioning argument.

Suppose that the central purpose of the enquiry is the possible association between cholesterol and heart disease, as measured, for example, by the odds ratio. Clearly, the total sample size has no information about the odds ratio, so it would make sense to condition on it. Perhaps less obviously, the row and column margins carry very little information about the association between cholesterol and heart disease as measured by the odds ratio. It can therefore be argued that it makes good statistical sense to condition on both margins.

If we start from the assumption that the four counts are independent Poisson with means mij, and then condition on the margins Yi. and Y.j as well as the total Y.. (being careful to use Y1.,Y.1 and Y.. to maintain independence) we obtain the hypergeometric distribution, where

Pr{Y = y} = y.1!
y11! y21!
y.2!
y21! y22!
/ n!
y1.! y2.!
.
In small samples this distribution is the basis of the so-called Fisher's exact test for the two-by-two table. McCullagh and Nelder (1989, Sections 7.3-7.4) discuss a conditional likelihood ratio test that differs from the unconditional one. The question of whether one should use conditional or unconditional tests is still a matter of controversy, see for example Yates (1934, 1984). We will not consider the hypergeometric distribution further.


Continue with 5.2. Models for Three-Dimensional Tables
Copyright © Germán Rodríguez, 1993-2000. Please send feedback to grodri@princeton.edu
Conversion from LaTeX was done using TTH, version 2.34.