Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata Logs Datasets Problem Sets

3.2  Estimation and Hypothesis Testing

The logistic regression model just developed is a generalized linear model with binomial errors and link logit. We can therefore rely on the general theory developed in Appendix B to obtain estimates of the parameters and to test hypotheses. In this section we summarize the most important results needed in the applications.

3.2.1  Maximum Likelihood Estimation

Although you will probably use a statistical package to compute the estimates, here is a brief description of the underlying procedure. The likelihood function for n independent binomial observations is a product of densities given by Equation 3.3. Taking logs we find that, except for a constant involving the combinatorial terms, the log-likelihood function is

logL(β) =
{ yi log(pi) + (ni-yi)log(1-pi)},
where pi depends on the covariates xi and a vector of p parameters β through the logit transformation of Equation 3.9.

At this point we could take first and expected second derivatives to obtain the score and information matrix and develop a Fisher scoring procedure for maximizing the log-likelihood. As shown in Appendix B, the procedure is equivalent to iteratively re-weighted least squares (IRLS). Given a current estimate [^(β)] of the parameters, we calculate the linear predictor [^(h)] = xi[^(β)] and the fitted values [^(μ)] = logit-1(h). With these values we calculate the working dependent variable z, which has elements

zi = ^
h
 

i 
+
yi- ^
μ
 

i 

^
μ
 

i 
(ni- ^
μ
 

i 
)
ni,
where ni are the binomial denominators. We then regress z on the covariates calculating the weighted least squares estimate
^
β
 
= (XWX)-1XWz,
where W is a diagonal matrix of weights with entries
wii = ^
μ
 

i 
(ni- ^
μ
 

i 
)/ni.
(You may be interested to know that the weight is inversely proportional to the estimated variance of the working dependent variable.) The resulting estimate of β is used to obtain improved fitted values and the procedure is iterated to convergence.

Suitable initial values can be obtained by applying the link to the data. To avoid problems with counts of 0 or ni (which is always the case with individual zero-one data), we calculate empirical logits adding 1/2 to both the numerator and denominator, i.e. we calculate

zi = log yi+1/2
ni-yi+1/2
,
and then regress this quantity on xi to obtain an initial estimate of β.

The resulting estimate is consistent and its large-sample variance is given by

var( ^
β
 
) = (XWX)-1
(3.12)
where W is the matrix of weights evaluated in the last iteration.

Alternatives to maximum likelihood estimation include weighted least squares, which can be used with grouped data, and a method that minimizes Pearson's chi-squared statistic, which can be used with both grouped and individual data. We will not consider these alternatives further.

3.2.2  Goodness of Fit Statistics

Suppose we have just fitted a model and want to assess how well it fits the data. A measure of discrepancy between observed and fitted values is the deviance statistic, which is given by

D = 2
{ yi log( yi
^
μi
) +(ni-yi)log( ni-yi
ni- ^
μi
 
) },
(3.13)
where yi is the observed and [^(μi)] is the fitted value for the i-th observation. Note that this statistic is twice a sum of `observed times log of observed over expected', where the sum is over both successes and failures (i.e. we compare both yi and ni-yi with their expected values). In a perfect fit the ratio observed over expected is one and its logarithm is zero, so the deviance is zero.

In Appendix B we show that this statistic may be constructed as a likelihood ratio test that compares the model of interest with a saturated model that has one parameter for each observation.

With grouped data, the distribution of the deviance statistic as the group sizes ni for all I, converges to a chi-squared distribution with n-p d.f., where n is the number of groups and p is the number of parameters in the model, including the constant. Thus, for reasonably large groups, the deviance provides a goodness of fit test for the model. With individual data the distribution of the deviance does not converge to a chi-squared (or any other known) distribution, and cannot be used as a goodness of fit test. We will, however, consider other diagnostic tools that can be used with individual data.

An alternative measure of goodness of fit is Pearson's chi-squared statistic, which for binomial data can be written as

c2P =

i 
ni (yi- ^
μ
 

i 
)2

^
μ
 

i 
( ni- ^
μ
 

i 
)
.
(3.14)
Note that each term in the sum is the squared difference between observed and fitted values yi and [^(μ)]i, divided by the variance of yi, which is μi(nii)/ni, estimated using [^(μ)]i for μi. This statistic can also be derived as a sum of `observed minus expected squared over expected', where the sum is over both successes and failures.

With grouped data Pearson's statistic has approximately in large samples a chi-squared distribution with n-p d.f., and is asymptotically equivalent to the deviance or likelihood-ratio chi-squared statistic. The statistic can not be used as a goodness of fit test with individual data, but provides a basis for calculating residuals, as we shall see when we discuss logistic regression diagnostics.

3.2.3  Tests of Hypotheses

Let us consider the problem of testing hypotheses in logit models. As usual, we can calculate Wald tests based on the large-sample distribution of the m.l.e., which is approximately normal with mean β and variance-covariance matrix as given in Equation 3.12.

In particular, we can test the hypothesis

H0j = 0
concerning the significance of a single coefficient by calculating the ratio of the estimate to its standard error
z =
^
β

j

  


^
var
 
( ^
β
 

j 
)
 
.
This statistic has approximately a standard normal distribution in large samples. Alternatively, we can treat the square of this statistic as approximately a chi-squared with one d.f.

The Wald test can be use to calculate a confidence interval for βj. We can assert with 100(1-α)% confidence that the true parameter lies in the interval with boundaries

^
β
 

j 
z1-α/2   


^
var
 
( ^
β
 

j 
)
 
,
where z1-α/2 is the normal critical value for a two-sided test of size α. Confidence intervals for effects in the logit scale can be translated into confidence intervals for odds ratios by exponentiating the boundaries.

The Wald test can be applied to tests hypotheses concerning several coefficients by calculating the usual quadratic form. This test can also be inverted to obtain confidence regions for vector-value parameters, but we will not consider this extension.

For more general problems we consider the likelihood ratio test. A key to construct these tests is the deviance statistic introduced in the previous subsection. In a nutshell, the likelihood ratio test to compare two nested models is based on the difference between their deviances.

To fix ideas, consider partitioning the model matrix and the vector of coefficients into two components

X = (X1,X2) and β =

β1
β2


with p1 and p2 elements, respectively. Consider testing the hypothesis
H0: β2 = 0,
that the variables in X2 have no effect on the response, i.e. the joint significance of the coefficients in β2.

Let D(X1) denote the deviance of a model that includes only the variables in X1 and let D(X1+X2) denote the deviance of a model that includes all variables in X. Then the difference

c2 = D(X1) - D(X1+X2)
has approximately in large samples a chi-squared distribution with p2 d.f. Note that p2 is the difference in the number of parameters between the two models being compared.

The deviance plays a role similar to the residual sum of squares. In fact, in Appendix B we show that in models for normally distributed data the deviance is the residual sum of squares. Likelihood ratio tests in generalized linear models are based on scaled deviances, obtained by dividing the deviance by a scale factor. In linear models the scale factor was σ2, and we had to divide the RSS's (or their difference) by an estimate of σ2 in order to calculate the test criterion. With binomial data the scale factor is one, and there is no need to scale the deviances.

The Pearson chi-squared statistic in the previous subsection, while providing an alternative goodness of fit test for grouped data, cannot be used in general to compare nested models. In particular, differences in deviances have chi-squared distributions but differences in Pearson chi-squared statistics do not. This is the main reason why in statistical modelling we use the deviance or likelihood ratio chi-squared statistic rather than the more traditional Pearson chi-squared of elementary statistics.


Continue with 3.3. The Comparison of Two Groups