3.2 Estimation and Hypothesis Testing
The logistic regression model just developed is a
generalized linear model with binomial errors and link logit.
We can therefore rely on the general theory developed in
Appendix B to obtain estimates of the parameters and to test
hypotheses. In this section we summarize the most important
results needed in the applications.
3.2.1 Maximum Likelihood Estimation
Although you will probably use a statistical package to compute
the estimates, here is a brief description of the underlying
procedure.
The likelihood function for n independent binomial observations
is a product of densities given by Equation 3.3.
Taking logs we find that, except for a constant involving
the combinatorial terms, the loglikelihood function is
logL(β) = 
 { y_{i} log(p_{i}) + (n_{i}y_{i})log(1p_{i})}, 

where
p_{i} depends on the covariates
x_{i} and
a vector of p parameters β through the logit transformation of
Equation
3.9.
At this point we could take first and expected second derivatives
to obtain the score and information matrix and develop a
Fisher scoring procedure for maximizing the loglikelihood.
As shown in Appendix B, the procedure is equivalent to
iteratively reweighted least squares (IRLS).
Given a current estimate [^(β)] of the parameters,
we calculate the linear predictor [^(h)] = x_{i}[^(β)]
and the fitted values [^(μ)] = logit^{1}(h).
With these values we calculate the working dependent variable
z, which has elements
where n
_{i} are the binomial denominators.
We then regress
z on the covariates calculating the weighted least squares
estimate
where
W is a diagonal matrix of weights with entries
w_{ii} = 
^ μ

i

(n_{i} 
^ μ

i

)/n_{i}. 

(You may be interested to know that the weight is inversely proportional
to the estimated variance of the working dependent variable.)
The resulting estimate of β is used to obtain
improved fitted values and the procedure is iterated to convergence.
Suitable initial values can be obtained by applying the link
to the data. To avoid problems with counts of 0 or n_{i} (which
is always the case with individual zeroone data), we calculate
empirical logits adding 1/2 to both the numerator and
denominator, i.e. we calculate
z_{i} = log 
y_{i}+1/2 n_{i}y_{i}+1/2

, 

and then regress this quantity on
x_{i} to obtain an initial
estimate of β.
The resulting estimate is consistent and its largesample
variance is given by
where
W is the matrix of weights evaluated in the last iteration.
Alternatives to maximum likelihood estimation include
weighted least squares, which can be used with grouped data,
and a method that minimizes Pearson's chisquared statistic,
which can be used with both grouped and individual data.
We will not consider these alternatives further.
3.2.2 Goodness of Fit Statistics
Suppose we have just fitted a model and want to assess how well
it fits the data.
A measure of discrepancy between observed and fitted values
is the deviance statistic, which is given by
D = 2 
 { y_{i} log( 
y_{i}

) +(n_{i}y_{i})log( 
n_{i}y_{i}

) }, 
 (3.13) 
where y
_{i} is the observed and [^(μ
_{i})] is the fitted
value for the ith observation. Note that this statistic
is twice a sum of `observed times log of observed over expected',
where the sum is over both successes and failures (i.e. we
compare both y
_{i} and n
_{i}y
_{i} with their expected values).
In a perfect fit the ratio observed over expected is one and its
logarithm is zero, so the deviance is zero.
In Appendix B we show that this statistic may be constructed
as a likelihood ratio test that compares the model of interest
with a saturated model that has one parameter for each
observation.
With grouped data, the distribution of the deviance statistic
as the group sizes n_{i} for all I,
converges to a chisquared distribution with np d.f.,
where n is the number of groups and p is the number
of parameters in the model, including the constant.
Thus, for reasonably large groups, the deviance provides a
goodness of fit test for the model.
With individual data the distribution of the deviance does
not converge to a chisquared (or any other known) distribution,
and cannot be used as a goodness of fit test. We will, however,
consider other diagnostic tools that can be used with individual
data.
An alternative measure of goodness of fit is
Pearson's chisquared statistic, which for binomial
data can be written as
Note that each term in the sum is the squared difference
between observed and fitted values y
_{i} and [^(μ)]
_{i},
divided by the variance of y
_{i}, which is
μ
_{i}(n
_{i}μ
_{i})/n
_{i}, estimated using [^(μ)]
_{i} for
μ
_{i}.
This statistic can also be derived as a sum of `observed
minus expected squared over expected', where the sum
is over both successes and failures.
With grouped data Pearson's statistic has approximately in
large samples a chisquared distribution with np d.f.,
and is asymptotically equivalent to the deviance or likelihoodratio
chisquared statistic. The statistic can not be used as a goodness
of fit test with individual data, but provides a basis for
calculating residuals, as we shall see when we discuss
logistic regression diagnostics.
3.2.3 Tests of Hypotheses
Let us consider the problem of testing hypotheses in logit
models.
As usual, we can calculate Wald tests based on the largesample
distribution of the m.l.e., which is approximately normal with
mean β and variancecovariance matrix as given in
Equation 3.12.
In particular, we can test the hypothesis
concerning the significance of a single coefficient by
calculating the ratio of the estimate to its standard error
This statistic has approximately a standard normal distribution
in large samples.
Alternatively, we can treat the square of this statistic
as approximately a chisquared with one d.f.
The Wald test can be use to calculate a confidence interval
for β_{j}. We can assert with 100(1α)% confidence
that the true parameter lies in the interval with boundaries
where z
_{1α/2} is the normal critical value for a twosided
test of size α. Confidence intervals for effects in the
logit scale can be translated into confidence intervals for
odds ratios by exponentiating the boundaries.
The Wald test can be applied to tests hypotheses concerning
several coefficients by calculating the usual quadratic form.
This test can also be inverted to obtain confidence regions for
vectorvalue parameters, but we will not consider this extension.
For more general problems we consider the likelihood ratio
test. A key to construct these tests is the deviance statistic
introduced in the previous subsection. In a nutshell, the
likelihood ratio test to compare two nested models is based
on the difference between their deviances.
To fix ideas, consider partitioning the model matrix
and the vector of coefficients into two components
X = (X_{1},X_{2}) and β = 







with p
_{1} and p
_{2} elements, respectively. Consider
testing the hypothesis
that the variables in
X_{2} have no effect on the response,
i.e. the joint significance of the coefficients in β
_{2}.
Let D(X_{1}) denote the deviance of a model that includes
only the variables in X_{1} and let D(X_{1}+X_{2}) denote
the deviance of a model that includes all variables in X.
Then the difference
c^{2} = D(X_{1})  D(X_{1}+X_{2}) 

has approximately in large samples a chisquared distribution
with p
_{2} d.f. Note that p
_{2} is the difference in the number
of parameters between the two models being compared.
The deviance plays a role similar to the residual sum of squares.
In fact, in Appendix B we show that in models for normally
distributed data the deviance is the residual sum of squares.
Likelihood ratio tests in generalized linear models are based on
scaled deviances, obtained by dividing the deviance by a scale factor.
In linear models the scale factor was σ^{2}, and
we had to divide the RSS's (or their difference) by an estimate
of σ^{2} in order to calculate the test criterion.
With binomial data the scale factor is one, and there is no need
to scale the deviances.
The Pearson chisquared statistic in the previous subsection,
while providing an alternative goodness of fit test for grouped
data, cannot be used in general to compare nested models.
In particular, differences in deviances have chisquared
distributions but differences in Pearson chisquared statistics
do not. This is the main reason why in statistical modelling
we use the deviance or likelihood ratio chisquared statistic
rather than the more traditional Pearson chisquared of
elementary statistics.
Continue with 3.3. The Comparison of Two Groups