Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

6.2  The Multinomial Logit Model

We now consider models for the probabilities pij. In particular, we would like to consider models where these probabilities depend on a vector xi of covariates associated with the i-th individual or group. In terms of our example, we would like to model how the probabilities of being sterilized, using another method or using no method at all depend on the woman's age.

6.2.1  Multinomial Logits

Perhaps the simplest approach to multinomial data is to nominate one of the response categories as a baseline or reference cell, calculate log-odds for all other categories relative to the baseline, and then let the log-odds be a linear function of the predictors.

Typically we pick the last category as a baseline and calculate the odds that a member of group I falls in category j as opposed to the baseline as pi1/piJ. In our example we could look at the odds of being sterilized rather than using no method, and the odds of using another method rather than no method. For women aged 45-49 these odds are 91:183 (or roughly 1 to 2) and 10:183 (or 1 to 18).

Figure 6.1: Log-Odds of Sterilization vs. No Method and
Other Method vs. No Method, by Age

Figure 6.1 shows the empirical log-odds of sterilization and other method (using no method as the reference category) plotted against the mid-points of the age groups. (Ignore for now the solid lines.) Note how the log-odds of sterilization increase rapidly with age to reach a maximum at 30-34 and then decline slightly. The log-odds of using other methods rise gently up to age 25-29 and then decline rapidly.

6.2.2  Modeling the Logits

In the multinomial logit model we assume that the log-odds of each response follow a linear model

hij = log pij
= αj + xiβj,
where αj is a constant and βj is a vector of regression coefficients, for j = 1, 2, , J-1. Note that we have written the constant explicitly, so we will assume henceforth that the model matrix X does not include a column of ones.

This model is analogous to a logistic regression model, except that the probability distribution of the response is multinomial instead of binomial and we have J-1 equations instead of one. The J-1 multinomial logit equations contrast each of categories 1, 2, J-1 with category J, whereas the single logistic regression equation is a contrast between successes and failures. If J = 2 the multinomial logit model reduces to the usual logistic regression model.

Note that we need only J-1 equations to describe a variable with J response categories and that it really makes no difference which category we pick as the reference cell, because we can always convert from one formulation to another. In our example with J = 3 categories we contrast categories 1 versus 3 and 2 versus 3. The missing contrast between categories 1 and 2 can easily be obtained in terms of the other two, since log(pi1/pi2) = log(pi1/pi3) - log(pi2/pi3).

Looking at Figure 6.1, it would appear that the logits are a quadratic function of age. We will therefore entertain the model

hij = αj + βj ai + gj ai2,
where ai is the midpoint of the i-th age group and j = 1,2 for sterilization and other method, respectively.

6.2.3  Modeling the Probabilities

The multinomial logit model may also be written in terms of the original probabilities pij rather than the log-odds. Starting from Equation 6.3 and adopting the convention that hiJ = 0, we can write

pij = exp{ hij }

k = 1 
exp{ hik }
for j = 1, , J. To verify this result exponentiate Equation 6.3 to obtain pij = piJ exp{hij} , and note that the convention hiJ = 0 makes this formula valid for all j. Next sum over j and use the fact that jpij = 1 to obtain piJ = 1/j exp{hij}. Finally, use this result on the formula for pij.

Note that Equation 6.5 will automatically yield probabilities that add up to one for each i.

6.2.4  Maximum Likelihood Estimation

Estimation of the parameters of this model by maximum likelihood proceeds by maximization of the multinomial likelihood (6.2) with the probabilities pij viewed as functions of the αj and βj parameters in Equation 6.3. This usually requires numerical procedures, and Fisher scoring or Newton-Raphson often work rather well. Most statistical packages include a multinomial logit procedure.

In terms of our example, fitting the quadratic multinomial logit model of Equation 6.4 leads to a deviance of 20.5 on 8 d.f. The associated P-value is 0.009, so we have significant lack of fit.

The quadratic age effect has an associated likelihood-ratio c2 of 500.6 on four d.f. (521.1 - 20.5 = 500.6 and 12 - 8 = 4), and is highly significant. Note that we have accounted for 96% of the association between age and method choice (500.6/521.1 = 0.96) using only four parameters.

Table 6.2: Parameter Estimates for Multinomial Logit Model
Fitted to Contraceptive Use Data

Ster. Vs. NoneOther vs. None

Table 6.2 shows the parameter estimates for the two multinomial logit equations. I used these values to calculate fitted logits for each age from 17.5 to 47.5, and plotted these together with the empirical logits in Figure 6.1. The figure suggests that the lack of fit, though significant, is not a serious problem, except possibly for the 15-19 age group, where we overestimate the probability of sterilization.

Under these circumstances, I would probably stick with the quadratic model because it does a reasonable job using very few parameters. However, I urge you to go the extra mile and try a cubic term. The model should pass the goodness of fit test. Are the fitted values reasonable?

6.2.5  The Equivalent Log-Linear Model*

Multinomial logit models may also be fit by maximum likelihood working with an equivalent log-linear model and the Poisson likelihood. (This section will only be of interest to readers interested in the equivalence between these models and may be omitted at first reading.)

Specifically, we treat the random counts Yij as Poisson random variables with means μij satisfying the following log-linear model

logμij = h+ qi + α*j + xiβ*j,
where the parameters satisfy the usual constraints for identifiability. There are three important features of this model:

First, the model includes a separate parameter qi for each multinomial observation, i.e. each individual or group. This assures exact reproduction of the multinomial denominators ni. Note that these denominators are fixed known quantities in the multinomial likelihood, but are treated as random in the Poisson likelihood. Making sure we get them right makes the issue of conditioning moot.

Second, the model includes a separate parameter α*j for each response category. This allows the counts to vary by response category, permitting non-uniform margins.

Third, the model uses interaction terms xiβ*j to represent the effects of the covariates xi on the log-odds of response j. Once again we have a `step-up' situation, where main effects in a logistic model become interactions in the equivalent log-linear model.

The log-odds that observation I will fall in response category j relative to the last response category J can be calculated from Equation 6.6 as

log(μijiJ) = (α*j*J) +xi*j*J).
This equation is identical to the multinomial logit Equation 6.3 with αj = α*j*J and βj = β*j*J. Thus, the parameters in the multinomial logit model may be obtained as differences between the parameters in the corresponding log-linear model. Note that the qi cancel out, and the restrictions needed for identification, namely hiJ = 0, are satisfied automatically.

In terms of our example, we can treat the counts in the original 7 ×3 table as 21 independent Poisson observations, and fit a log-linear model including the main effect of age (treated as a factor), the main effect of contraceptive use (treated as a factor) and the interactions between contraceptive use (a factor) and the linear and quadratic components of age:

logμij = h+ qi + α*j + β*j ai + g*j ai2
In practical terms this requires including six dummy variables representing the age groups, two dummy variables representing the method choice categories, and a total of four interaction terms, obtained as the products of the method choice dummies by the mid-point ai and the square of the mid-point ai2 of each age group. Details are left as an exercise. (But see the Stata notes.)

Continue with 6.3. The Conditional Logit Model