![]() |
|
![]() | |||
|
|
|||||
In this chapter we study the application of Poisson regression models to the analysis of contingency tables. This is perhaps one of the most popular applications of log-linear models, and is based on the existence of a very close relationship between the multinomial and Poisson distributions.
We start by considering the simplest possible contingency table: a two-by-two table. However, the concepts to be introduced apply equally well to more general two-way tables where we study the joint distribution of two categorical variables.
Table was taken from the Framingham longitudinal study of coronary heart disease (Cornfield, 1962; see also Fienberg, 1977). It shows 1329 patients cross-classified by the level or their serum cholesterol (below or above 260) and the presence or absence of heart disease.
| Serum Cholesterol | Heart Disease | Total | |
| Present | Absent | ||
| < 260 | 51 | 992 | 1043 |
| 260+ | 41 | 245 | 286 |
| Total | 92 | 1237 | 1329 |
There are various sampling schemes that could have led to these data, with consequences for the probability model one would use, the types of questions one would ask, and the analytic techniques that would be employed. Yet, all schemes lead to equivalent analyses. We now explore several approaches to the analysis of these data.
Our first approach will assume that the data were collected by sampling 1329 patients who were then classified according to cholesterol and heart disease. We view these variables as two responses, and we are interested in their joint distribution. In this approach the total sample size is assumed fixed, and all other quantities are considered random.
We will develop the random structure of the data in terms of the row and column variables, and then note what this implies for the counts themselves. Let C denote serum cholesterol and D denote heart disease, both discrete factors with two levels. More generally, we can imagine a row factor with I levels indexed by i and a column factor with J levels indexed by j, forming an I ×J table. In our example I = J = 2.
To describe the joint distribution of these two variables we let pij denote the probability that an observation falls in row i and column j of the table. In our example words, pij is the probability that serum cholesterol C takes the value i and heart disease D takes the value j. In symbols,
| (5.1) |
We can also consider the marginal distribution of each variable. Let pi. denote the probability that the row variable takes the value i, and let p.j denote the probability that the column variable takes the value j. In our example pi. and p.j represent the marginal distributions of serum cholesterol and heart disease. In symbols,
| (5.2) |
The main hypothesis of interest with two responses is whether they are independent. By definition, two variables are independent if (and only if) their joint distribution is the product of the marginals. Thus, we can write the hypothesis of independence as
| (5.3) |
The traditional approach to testing this hypothesis calculates expected counts under independence and compares observed and expected counts using Pearson's chi-squared statistic. We adopt a more formal approach that relies on maximum likelihood estimation and likelihood ratio tests. In order to implement this approach we consider the distribution of the counts in the table.
Suppose each of n observations is classified independently in one of the IJ cells in the table, and suppose the probability that an observation falls in the (i,j)-th cell is pij. Let Yij denote a random variable representing the number of observations in row i and column j of the table, and let yij denote its observed value. The joint distribution of the counts is then the multinomial distribution, with
| (5.4) |
Taking logs and ignoring the combinatorial term, which does not depend on the parameters, we obtain the multinomial log-likelihood function, which for a general I×J table has the form
| (5.5) |
|
Under the hypothesis of independence in Equation 3, the joint probabilities depend on the margins. Taking derivatives with respect to pi. and p.j, and noting that these are also constrained to add up to one over the rows and columns, respectively, we find the m.l.e.'s
|
|
|
| (5.6) |
Going through these calculations for our example we obtain a deviance of 26.43 with one d.f. Comparison of observed and fitted counts in terms of Pearson's chi-squared statistic gives 31.08 with one d.f. Clearly, we reject the hypothesis of independence, concluding that heart disease and serum cholesterol level are associated.
An alternative model for the data in Table 5.1 is to treat the four counts as realizations of independent Poisson random variables. A possible physical model is to imagine that there are four groups of people, one for each cell in the table, and that members from each group arrive randomly at a hospital or medical center over a period of time, say for a health check. In this model the total sample size is not fixed in advance, and all counts are therefore random.
Under the assumption that the observations are independent, the joint distribution of the four counts is a product of Poisson distributions
| (5.7) |
In terms of the systematic structure of the model, we could consider three log-linear models for the expected counts: the null model, the additive model and the saturated model. The null model would assume that all four kinds of patients arrive at the hospital or health center in the same numbers. The additive model would postulate that the arrival rates depend on the level of cholesterol and the presence or absence of heart disease, but not on the combination of the two. The saturated model would say that each group has its own rate or expected number of arrivals.
At this point you may try fitting the Poisson additive model to the four counts in Table 5.1, treating cholesterol and heart disease as factors or discrete predictors. You will discover that the deviance is 26.43 on one d.f. (four observations minus three parameters, the constant and the coefficients of two dummies representing cholesterol and heart disease). If you print the fitted values you will discover that they are exactly the same as in the previous subsection.
This result, of course, is not a coincidence. Testing the hypothesis of independence in the multinomial model is exactly equivalent to testing the goodness of fit of the Poisson additive model. A rigorous proof of this result is beyond the scope of these notes, but we can provide enough information to show that the result is intuitively reasonable and to understand when it can be used.
First, note that if the four counts have independent Poisson distributions, their sum is distributed Poisson with mean equal to the sum of the means. In symbols, if Yij ~ P(mij) then the total Y.. = ij Yij is distributed Poisson with mean m.. = i j mij. Further, the conditional distribution of the four counts given their total is multinomial with probabilities
|
Second, note that the systematic structure of the two models is the same. In the model of independence the joint probability is the product of the marginals, so taking logs we obtain
|
|
If the systematic and random structure of the two models are the same, then it should come as no surprise that they produce the same fitted values and lead to the same tests of hypotheses. There is only one aspect that we glossed over: the equivalence of the two distributions holds conditional on n, but in the Poisson analysis the total n is random and we have not conditioned on its value. Recall, however, that the Poisson model, by including the constant, reproduces exactly the sample total. It turns out that we don't need to condition on n because the model reproduces its exact value anyway.
The morale of this long-winded story is that we do not need to bother with multinomial models and can always resort to the equivalent Poisson model. While the gain is trivial in the case of a two-by-two table, it can be very significant as we move to cross-classifications involving three or more variables, particularly as we don't have to worry about maximizing the multinomial likelihood under constraints. The only trick we need to learn is how to translate the questions of independence that arise in the multinomial context into the corresponding log-linear models in the Poisson context.
(On first reading you may wish to skip this subsection and the next and proceed directly to the discussion of three-dimensional tables in Section 5.2.)
There is a third sampling scheme that may lead to data such as Table 5.1. Suppose that a decision had been made to draw a sample of 1043 patients with low serum cholesterol and an independent sample of 286 patients with high serum cholesterol, and then examine the presence or absence of heart disease in each group.
Interest would then focus on the conditional distribution of heart disease given serum cholesterol level. Let pi denote the probability of heart disease at level i of serum cholesterol. In the notation of the previous subsections,
|
Under this scheme the row margin would be fixed in advance, so we would have n1 observations with low cholesterol and n2 with high. The number of cases with heart disease in category y of cholesterol, denoted Yi1, would then have a binomial distribution with parameters pi and ni independently for i = 1,2. The likelihood function would then be a product of two binomials:
| (5.8) |
The main hypothesis of interest would be the hypothesis of homogeneity, where the probability of heart disease is the same in the two groups:
|
Again, this is no coincidence, because the random and systematic components of the models are equivalent. The product binomial distribution in Equation 8 can be obtained starting from the assumption that the four counts Yij are independent Poisson with means mij, and then conditioning on the totals Yi. = j Yij, which are Poisson with means mi. = j mij, for i = 1,2. Taking the ratio of the joint distribution of the counts to the marginal distribution of the two totals leads to the product binomial in Equation 8 with pi = mi1/mi..
Similarly, the hypothesis of homogeneity turns out to be equivalent to the hypothesis of independence and hence the additive log-linear model. To see this point note that if two variables are independent, then the conditional distribution of one given the other is the same as its marginal distribution. In symbols, if pij = pi. p.j then the conditional probability, which in general is pj|i = pij/pi., simplifies to pj|i = p.j, which does not depend on i. In terms of our example, under independence or homogeneity the conditional probability of heart disease is the same for the two cholesterol groups.
Again, note that the binomial and Poisson models are equivalent conditioning on the row margin, but in fitting the additive log-linear model we did not impose any conditions. Recall, however, that the Poisson model, by treating serum cholesterol as a factor, reproduces exactly the row margin of the table. Thus, it does not matter that we do not condition on the margin because the model reproduces its exact value anyway.
The importance of this result is that the results of our analyses are in fact independent of the sampling scheme.
If the total is fixed in advance we can treat both the row and column factors as responses and test the hypothesis of independence using the multinomial likelihood.
Or we can treat all counts as random and fit an additive log-linear model using the Poisson likelihood.
Reassuringly, the results will be identical in all three cases, both in terms of fitted counts and in terms of the likelihood ratio statistic.
Note that if the total is fixed and the sampling scheme is multinomial we can always condition on a margin and use binomial models, the choice being up to the investigator. This choice will usually depend on whether one wishes to treat the two variables symmetrically, assuming they are both responses and studying their correlation, or asymmetrically, treating one as a predictor and the other as a response in a regression framework.
If the row margin is fixed and the sampling scheme is binomial then we must use the product binomial model, because we can not estimate the joint distribution of the two variables without further information.
There is a fourth distribution that could apply to the data in Table 5.1, namely the hypergeometric distribution. This distribution arises from treating both the row and column margins as fixed. I find it hard to imagine a sampling scheme that would lead to fixed margins, but one could use the following conditioning argument.
Suppose that the central purpose of the enquiry is the possible association between cholesterol and heart disease, as measured, for example, by the odds ratio. Clearly, the total sample size has no information about the odds ratio, so it would make sense to condition on it. Perhaps less obviously, the row and column margins carry very little information about the association between cholesterol and heart disease as measured by the odds ratio. It can therefore be argued that it makes good statistical sense to condition on both margins.
If we start from the assumption that the four counts are independent Poisson with means mij, and then condition on the margins Yi. and Y.j as well as the total Y.. (being careful to use Y1.,Y.1 and Y.. to maintain independence) we obtain the hypergeometric distribution, where
|
Continue with 5.2. Models for Three-Dimensional Tables
Copyright © Germán Rodríguez, 1993-2000.
Please send feedback to grodri@princeton.edu
Conversion from LaTeX was done using TTH, version 2.34.