## 6.2 The Multinomial Logit Model

We now consider models for the probabilities p_{ij}.
In particular, we would like to consider models where
these probabilities depend on a vector **x**_{i} of covariates
associated with the i-th individual or group.
In terms of our example, we would like to model how the
probabilities of being sterilized, using another method or
using no method at all depend on the woman's age.

### 6.2.1 Multinomial Logits

Perhaps the simplest approach to multinomial data is to nominate one of the response categories as a baseline or reference cell, calculate log-odds for all other categories relative to the baseline, and then let the log-odds be a linear function of the predictors.

Typically we pick the *last* category as a baseline
and calculate the odds that a member of group I falls in
category j as opposed to the baseline as
p_{i1}/p_{iJ}. In our example we could look
at the odds of being sterilized rather than using no method,
and the odds of using another method rather than no method.
For women aged 45-49 these odds are 91:183 (or roughly 1 to 2)
and 10:183 (or 1 to 18).

Figure 6.1 shows the empirical log-odds of sterilization and other method (using no method as the reference category) plotted against the mid-points of the age groups. (Ignore for now the solid lines.) Note how the log-odds of sterilization increase rapidly with age to reach a maximum at 30-34 and then decline slightly. The log-odds of using other methods rise gently up to age 25-29 and then decline rapidly.

### 6.2.2 Modeling the Logits

In the multinomial logit model we assume that the log-odds of each response follow a linear model

| (6.3) |

_{j}is a constant and β

_{j}is a vector of regression coefficients, for j = 1, 2, , J-1. Note that we have written the constant explicitly, so we will assume henceforth that the model matrix

**X**does not include a column of ones.

This model is analogous to a logistic regression model, except that the probability distribution of the response is multinomial instead of binomial and we have J-1 equations instead of one. The J-1 multinomial logit equations contrast each of categories 1, 2, J-1 with category J, whereas the single logistic regression equation is a contrast between successes and failures. If J = 2 the multinomial logit model reduces to the usual logistic regression model.

Note that we need only J-1 equations to describe a variable
with J response categories and that it really makes no difference
which category we pick as the reference cell, because we can
always convert from one formulation to another.
In our example with J = 3 categories we contrast categories 1 versus 3
and 2 versus 3. The missing contrast between categories 1 and 2
can easily be obtained in terms of the other two, since
log(p_{i1}/p_{i2}) = log(p_{i1}/p_{i3}) - log(p_{i2}/p_{i3}).

Looking at Figure 6.1, it would appear that the logits are a quadratic function of age. We will therefore entertain the model

| (6.4) |

_{i}is the midpoint of the i-th age group and j = 1,2 for sterilization and other method, respectively.

### 6.2.3 Modeling the Probabilities

The multinomial logit model may also be written in terms of the
original probabilities p_{ij} rather than the log-odds.
Starting from Equation 6.3
and adopting the convention that h_{iJ} = 0, we can write

| (6.5) |

_{ij}= p

_{iJ}exp{h

_{ij}} , and note that the convention h

_{iJ}= 0 makes this formula valid for all j. Next sum over j and use the fact that

_{j}p

_{ij}= 1 to obtain p

_{iJ}= 1/

_{j}exp{h

_{ij}}. Finally, use this result on the formula for p

_{ij}.

Note that Equation 6.5 will automatically yield probabilities that add up to one for each i.

### 6.2.4 Maximum Likelihood Estimation

Estimation of the parameters of this model by maximum likelihood
proceeds by maximization of the multinomial likelihood (6.2)
with the probabilities p_{ij} viewed as functions of the
α_{j} and β_{j} parameters in Equation 6.3.
This usually requires numerical procedures,
and Fisher scoring or Newton-Raphson often work rather well.
Most statistical packages include a multinomial logit procedure.

In terms of our example, fitting the quadratic multinomial logit model of Equation 6.4 leads to a deviance of 20.5 on 8 d.f. The associated P-value is 0.009, so we have significant lack of fit.

The quadratic age effect has an associated likelihood-ratio
c^{2} of 500.6 on four d.f. (521.1 - 20.5 = 500.6 and 12 - 8 = 4),
and is highly significant. Note that we have accounted for
96% of the association between age and method choice
(500.6/521.1 = 0.96) using only four parameters.

Parameter | Contrast | |

Ster. Vs. None | Other vs. None | |

Constant | -12.62 | -4.552 |

Linear | 0.7097 | 0.2641 |

Quadratic | -0.009733 | -0.004758 |

Table 6.2 shows the parameter estimates for the two multinomial logit equations. I used these values to calculate fitted logits for each age from 17.5 to 47.5, and plotted these together with the empirical logits in Figure 6.1. The figure suggests that the lack of fit, though significant, is not a serious problem, except possibly for the 15-19 age group, where we overestimate the probability of sterilization.

Under these circumstances, I would probably stick with the quadratic model because it does a reasonable job using very few parameters. However, I urge you to go the extra mile and try a cubic term. The model should pass the goodness of fit test. Are the fitted values reasonable?

### 6.2.5 The Equivalent Log-Linear Model*

Multinomial logit models may also be fit by maximum likelihood working with an equivalent log-linear model and the Poisson likelihood. (This section will only be of interest to readers interested in the equivalence between these models and may be omitted at first reading.)

Specifically, we treat the random counts Y_{ij}
as Poisson random variables with means μ_{ij}
satisfying the following log-linear model

| (6.6) |

First, the model includes a separate parameter q_{i}
for each multinomial observation, i.e. each individual or
group. This assures exact reproduction of the multinomial
denominators n_{i}. Note that these denominators are
fixed known quantities in the multinomial likelihood,
but are treated as random in the Poisson likelihood.
Making sure we get them right makes the issue
of conditioning moot.

Second, the model includes a separate parameter α^{*}_{j}
for each response category. This allows the counts to vary
by response category, permitting non-uniform margins.

Third, the model uses interaction terms **x**_{i}β^{*}_{j} to
represent the effects of the covariates **x**_{i} on the
log-odds of response j.
Once again we have a `step-up' situation,
where main effects in a logistic model become interactions
in the equivalent log-linear model.

The log-odds that observation I will fall in response category j relative to the last response category J can be calculated from Equation 6.6 as

| (6.7) |

_{j}= α

^{*}

_{j}-α

^{*}

_{J}and β

_{j}= β

^{*}

_{j}-β

^{*}

_{J}. Thus, the parameters in the multinomial logit model may be obtained as differences between the parameters in the corresponding log-linear model. Note that the q

_{i}cancel out, and the restrictions needed for identification, namely h

_{iJ}= 0, are satisfied automatically.

In terms of our example, we can treat the counts in the original 7 ×3 table as 21 independent Poisson observations, and fit a log-linear model including the main effect of age (treated as a factor), the main effect of contraceptive use (treated as a factor) and the interactions between contraceptive use (a factor) and the linear and quadratic components of age:

| (6.8) |

_{i}and the square of the mid-point a

_{i}

^{2}of each age group. Details are left as an exercise. (But see the Stata notes.)

Continue with 6.3. The Conditional Logit Model