3.7 Other Choices of Link
All the models considered so far use the logit transformation
of the probabilities, but other choices are possible.
In fact, any transformation that maps probabilities into the
real line could be used to produce a generalized linear model,
as long as the transformation is one-to-one, continuous and differentiable.
In particular, suppose F(.) is the cumulative distribution
function (c.d.f.) of a random variable defined on the real line,
and write
for -
<
hi <
.
Then we could use the inverse transformation
for 0 <
pi < 1 as the link function.
Popular choices of c.d.f.'s in this context are the
normal, logistic and extreme value distributions.
In this section we motivate this general approach by
introducing models for binary data in terms of latent variables.
3.7.1 A Latent Variable Formulation
Let Yi denote a random variable representing
a binary response coded zero and one, as usual.
We will call Yi the manifest response.
Suppose that there is an unobservable continuous
random variable Y*i which can take any value in the real line,
and such that Yi takes the value one if an only if
Y*i exceeds a certain threshold q.
We will call Y*i the latent response.
Figure 3.6 shows the relationship between
the latent variable and the response when the
threshold is zero.

Figure 3.6: Latent Variable and Manifest Response
The interpretation of Yi and Y*i depends on the context.
An economist, for example, may view Yi as a binary choice,
such as purchasing or renting a home,
and Y*i as the difference in the utilities of
purchasing and renting.
A psychologist may view Yi as a response to an item in an
attitude scale, such as agreeing or disagreeing with
school vouchers, and Y*i as the underlying attitude.
Biometricians often view Y*i as a dose and Yi as
a response, hence the name dose-response models.
Since a positive outcome occurs only when
the latent response exceeds the threshold, we can write the
probability pi of a positive outcome as
|
pi = |
Pr
| {Yi = 1} = |
Pr
| {Y*i > q}. |
|
As often happens with latent variables, the location and
scale of Y
*i are arbitrary. We can add a constant a
to both Y
*i and the threshold
q, or multiply
both by a constant c, without changing the probability
of a positive outcome. To identify the model we take the
threshold to be zero, and standardize Y
*i to have
standard deviation one (or any other fixed value).
Suppose now that the outcome depends on a vector of covariates x.
To model this dependence we use an ordinary linear model
for the latent variable, writing
where β is a vector of coefficients of
the covariates
xi and U
i is the error term,
assumed to have a distribution with c.d.f. F(u),
not necessarily the normal distribution.
Under this model, the probability pi of observing a positive
outcome is
where
hi =
xiβ is the linear predictor.
If the distribution of the error term U
i is symmetric about zero,
so F(u) = 1-F(-u), we can write
This expression defines a generalized linear model with
Bernoulli response and link
In the more general case where the distribution of the error
term is not necessarily symmetric, we still have a generalized
linear model with link
We now consider some specific distributions.
3.7.2 Probit Analysis
The obvious choice of an error distribution is the normal.
Assuming that the error term has a standard normal distribution
Ui ~ N(0,1),
the results of the previous section lead to
where
F is the standard normal c.d.f. The inverse
transformation, which gives the linear predictor as
a function of the probability
is called the
probit.
It is instructive to consider the more general case where the error
term Ui ~ N(0,σ2) has a normal distribution with
variance σ2.
Following the same steps as before we find that
|
|
|
| |
| |
= |
Pr
| {Ui > -xiβ} = |
Pr
| {Ui/σ > -xiβ/σ} |
|
| |
= 1-F(-xiβ/σ) = F(xiβ/σ), |
|
|
|
|
where we have divided by σ to obtain a standard normal variate,
and used the symmetry of the normal distribution to obtain the
last result.
This development shows that we cannot identify β and σ
separately, because the probability depends on them only
through their ratio β/σ. This is another way of
saying that the scale of the latent variable is not
identified. We therefore take σ = 1, or equivalently
interpret the β's in units of standard deviation of
the latent variable.
As a simple example, consider fitting a probit model to the
contraceptive use data by age and desire for more children.
In view of the results in Section 3.5, we introduce a main
effect of wanting no more children, a linear effect of age,
and a linear age by desire interaction. Fitting this model
gives a deviance of 8.91 on four d.f. Estimates of the parameters
and standard errors appear in Table 3.16
Table 3.16: Estimates for Probit Model of Contraceptive Use
With a Linear Age by Desire Interaction
| Parameter | Symbol | Estimate | Std. Error | z-ratio |
| Constant | α1 | -0.7297 | 0.0460 | -15.85 |
| Age | β1 | 0.0129 | 0.0061 | 2.13 |
| Desire | α2-α1 | 0.4572 | 0.0731 | 6.26 |
| Age ×
Desire | β2-β1 | 0.0305 | 0.0092 | 3.32 |
To interpret these results we imagine a latent continuous
variable representing the woman's motivation to use contraception
(or the utility of using contraception, compared to not using).
At the average age of 30.6, not wanting
more children increases the motivation to use contraception
by almost half a standard deviation. Each year of age
is associated with an increase in motivation of 0.01
standard deviations if she wants more children and 0.03
standard deviations more (for a total of 0.04) if she does not. In the next
section we compare these results with logit estimates.
A slight disadvantage of using the normal distribution as
a link for binary response models is that
the c.d.f. does not have a closed form, although excellent
numerical approximations and computer algorithms are available
for computing both the normal probability integral and its
inverse, the probit.
3.7.3 Logistic Regression
An alternative to the normal distribution is the standard
logistic distribution, whose shape is remarkably similar to the
normal distribution but has the advantage of a closed form
expression
|
pi = F(hi) = |
ehi 1 + ehi
|
, |
|
for -
<
hi <
. The standard logistic distribution
is symmetric, has mean zero, and has variance
p2/3.
The shape is very close to the normal, except that it has heavier
tails. The inverse transformation, which can be obtained solving
for
hi in the expression above is
|
hi = F-1(pi) = log |
pi 1-pi
|
, |
|
our good old friend, the
logit.
Thus, coefficients in a logit regression model can be
interpret not only in terms of log-odds, but also as
effects of the covariates on a latent variable that follows
a linear model with logistic errors.
The logit and probit transformations are almost linear functions
of each other for values of pi in the range from 0.1 to 0.9,
and therefore tend to give very similar results.
Comparison of probit and logit coefficients should take into account
the fact that the standard normal and the standard logistic
distributions have different variances.
Recall that with binary data we can only estimate the ratio β/σ.
In probit analysis we have implicitly set σ = 1.
In a logit model, by using a standard logistic error term,
we have effectively set σ = p/3.
Thus, coefficients in a logit model should be standardized
dividing by p/3 before comparing them with
probit coefficients.

Figure 3.7: The Standardized Probit, Logit and C-Log-Log Links
Figure 3.7 compares the logit and probit links
(and a third link discussed below) after standardizing the
logits to unit variance. The solid line is the probit and the dotted
line is the logit divided by p/3. As you can see,
they are barely distinguishable.
To illustrate the similarity of these links in practice, consider
our models of contraceptive use by age and desire for more
children in Tables 3.10 and 3.16. The
deviance of 9.14 for the logit model is very similar to the
deviance of 8.91 for the probit model, indicating an acceptable fit.
The Wald tests of individual coefficients are also very similar,
for example the test for the effect of wanting no more children
at age 30.6 is 6.22 in the logit model and
6.26 in the probit model. The coefficients themselves look
somewhat different, but of course they are not standardized.
The effect of wanting no more children at the average age is
0.758 in the logit scale. Dividing by p/3,
the standard deviation of the underlying logistic distribution,
we find this effect equivalent to an increase in the latent
variable of 0.417 standard deviations.
The probit analysis estimates the effect as 0.457 standard deviations.
3.7.4 The Complementary Log-Log Transformation
A third choice of link is the complementary log-log
transformation
which is the inverse of the c.d.f. of the extreme value
(or log-Weibull) distribution, with c.d.f.
For small values of
pi the complementary log-log
transformation is close to the logit. As the probability
increases, the transformation approaches infinity more
slowly that either the probit or logit.
This particular choice of link function can also be obtained
from our general latent variable formulation if we
assume that -Ui (note the minus sign) has a standard
extreme value distribution, so the error term itself
has a reverse extreme value distribution, with c.d.f.
The reverse extreme value distribution is asymmetric,
with a long tail to the right.
It has mean equal to Euler's constant 0.577 and variance
p2/6 = 1.645. The median is -loglog2 = 0.367 and the
quartiles are -0.327 and 1.246.
Inverting the reverse extreme value c.d.f. and
applying Equation 3.17,
which is valid for both symmetric and asymmetric distributions,
we find that the link corresponding to this error
distribution is the complementary log-log.
Thus, coefficients in a generalized linear model with binary
response and a complementary log-log link can be interpreted
as effects of the covariates on a latent variable which
follows a linear model with reverse extreme value errors.
To compare these coefficients with estimates based on a
probit analysis we should standardize them,
dividing by p/6.
To compare coefficients with logit analysis we should divide by
2, or standardize both c-log-log and logit coefficients.
Figure 3.7 compares the c-log-log link with the
probit and logit after standardizing it to have mean zero
and variance one. Although the c-log-log link differs from
the other two, one would need extremely large sample sizes
to be able to discriminate empirically between these links.
The complementary log-log transformation has a direct
interpretation in terms of hazard ratios, and thus has
practical applications in terms of hazard models,
as we shall see later in the sequel.
Continue with 3.8. Regression Diagnostics for Binary Data