2.5. Multiple Linear Regression Table of Contents 2.7. Two-Way Analysis of Variance

2.6  One-Way Analysis of Variance

We now consider models where the predictors are categorical variables or factors with a discrete number of levels. To illustrate the use of these models we will group the index of social setting (and later the index of family planning effort) into discrete categories.

2.6.1  The One-Way Layout

Table 2.10 shows the percent decline in the CBR for the 20 countries in our illustrative dataset, classified according to the index of social setting in three categories: low (under 70 points), medium (70-79) and high (80 or more).

Table 2.10: CBR Decline by Levels of Social Setting

SettingPercent decline in CBR
Low1, 0, 7, 21, 13, 4, 7
Medium10, 6, 2, 0, 25
High9, 11, 29, 29, 40, 21, 22, 29

It will be convenient to modify our notation to reflect the one-way layout of the data explicitly. Let k denote the number of groups or levels of the factor, ni denote the number of observations in group i, and let yij denote the response for the j-th unit in the i-th group, for j = 1,,ni, and i = 1,,k. In our example k = 3 and yij is the CBR decline in the j-th country in the i-th category of social setting, with i = 1,2,3; j = 1, , ni; n1 = 7, n2 = 5 and n3 = 8).

2.6.2  The One-Factor Model

As usual, we treat yij as a realization of a random variable Yij ~ N(mij, s2), where the variance is the same for all observations. In terms of the systematic structure of the model, we assume that

mij = m+ ai,
(2.18)
where m plays the role of the constant and ai represents the effect of level i of the factor.

Before we proceed further, it is important to note that the model as written is not identified. We have essentially k groups but have introduced k+1 linear parameters. The solution is to introduce a constraint, and there are several ways in which we could proceed.

One approach is to set m = 0 (or simply drop m). If we do this, the ai's become cell means, with ai representing the expected response in group i. While simple and attractive, this approach does not generalize well to models with more than one factor.

Our preferred alternative is to set one of the ai's to zero. Conventionally we set a1 = 0, but any of the groups could be chosen as the reference cell or level. In this approach m becomes the expected response in the reference cell, and ai becomes the effect of level i of the factor, compared to the reference level.

A third alternative is to require the group effects to add-up to zero, so ai = 0. In this case m represents some sort of overall expected response, and ai measures the extent to which responses at level i of the factor deviate from the overall mean. Some statistics texts refer to this constraint as the `usual' restrictions, but I think the reference cell method is now used more widely in social research.

A variant of the `usual' restrictions is to require a weighted sum of the effects to add up to zero, so wi ai = 0. The weights are often taken to be the number of observations in each group, so wi = ni. In this case m is a weighted average representing the expected response, and ai is, as before, the extent to which responses at level i of the factor deviate from the overall mean.

Each of these parameterizations can easily be translated into one of the others, so the choice can rest on practical considerations. The reference cell method is easy to implement in a regression context and the resulting parameters have a clear interpretation.

2.6.3  Estimates and Standard Errors

The model in Equation 2.18 is a special case of the generalized linear model, where the design matrix X has k+1 columns: a column of ones representing the constant, and k columns of indicator variables, say x1, , xk, where xi takes the value one for observations at level i of the factor and the value zero otherwise.

Note that the model matrix as defined so far is rank deficient, because the first column is the sum of the last k. Hence the need for constraints. The cell means approach is equivalent to dropping the constant, and the reference cell method is equivalent to dropping one of the indicator or dummy variables representing the levels of the factor. Both approaches are easily implemented. The other two approaches, which set to zero either the unweighted or weighted sum of the effects, are best implemented using Lagrange multipliers and will not be considered here.

Parameter estimates, standard errors and t ratios can then be obtained from the general results of Sections 2.2 and 2.3. You may be interested to know that the estimates of the regression coefficients in the one-way layout are simple functions of the cell means. Using the reference cell method,

^
m
 
= _
y
 

1 
and ^
ai
 
= _
y
 

i 
- _
y
 

1 
 for i > 1,
where [`y]i is the average of the responses at level i of the factor.

Table 2.11 shows the estimates for our sample data. We expect a CBR decline of almost 8% in countries with low social setting (the reference cell). Increasing social setting to medium or high is associated with additional declines of one and 16 percentage points, respectively, compared to low setting.

Table 2.11: Estimates for One-Way Anova Model of
CBR Decline by Levels of Social Setting

ParameterSymbolEstimateStd. Errort-ratio
Lowm7.5713.4982.16
Medium (vs. low)a21.0295.4200.19
High (vs. low)a316.1794.7903.38

Looking at the t ratios we see that the difference between medium and low setting is not significant, so we accept H0: a2 = 0, whereas the difference between high and low setting, with a t-ratio of 3.38 on 17 d.f. and a two-sided P-value of 0.004, is highly significant, so we reject H0:a3 = 0. These t-ratios test the significance of two particular contrasts: medium vs. low and high vs. low. In the next subsection we consider an overall test of the significance of social setting.

2.6.4  The One-Way Anova Table

Fitting the model with social setting treated as a factor reduces the RSS from 2650.2 (for the null model) to 1456.4, a gain of 1193.8 at the expense of two degrees of freedom (the two a's). We can contrast this gain with the remaining RSS of 1456.4 on 17 d.f. The calculations are laid out in Table 2.12, and lead to an F-test of 6.97 on 2 and 17 d.f., which has a P-value of 0.006. We therefore reject the hypothesis H0: a2 = a3 = 0 of no setting effects, and conclude that the expected response depends on social setting.

Table 2.12: Analysis of Variance for One-Factor Model
of CBR Decline by Levels of Social Setting

Source of
variation
Sum of
squares
Degrees of
freedom
Mean
squared
F-
ratio
Setting1193.82596.96.97
Residual1456.41785.7
Total2650.219

Having established that social setting has an effect on CBR decline, we can inspect the parameter estimates and t-ratios to learn more about the nature of the effect. As noted earlier, the difference between high and low settings is significant, while that between medium and low is not.

It is instructive to calculate the Wald test for this example. Let a = (a2,a3) denote the two setting effects. The estimate and its variance-covariance matrix, calculated using the general results of Section 2.2, are

^
a
 
=

1.029
16.179


and ^
var
 
( ^
a
 
) =

29.373
12.239
12.239
22.948


.
The Wald statistic is
W = ^
a
 
  ^
var-1
 
( ^
a
 
)   ^
a
 
= 13.94,
and has approximately a chi-squared distribution with two d.f. Under the assumption of normality, however, we can divide by two to obtain F = 6.97, which has an F distribution with two and 17 d.f., and coincides with the test based on the reduction in the residual sum of squares, as shown in Table 2.12.

2.6.5  The Correlation Ratio

Note from Table 2.12 that the model treating social setting as a factor with three levels has reduced the RSS by 1456.6 out of 2650.2, thereby explaining 45.1%. The square root of the proportion of variance explained by a discrete factor is called the correlation ratio, and is often denoted h. In our example [^(h)] = 0.672.

If the factor has only two categories the resulting coefficient is called the point-biserial correlation, a measure often used in psychometrics to correlate a test score (a continuous variable) with the answer to a dichotomous item (correct or incorrect). Note that both measures are identical in construction to Pearson's correlation coefficient. The difference in terminology reflects whether the predictor is a continuous variable with a linear effect or a discrete variable with two or more than two categories.


Continue with 2.7. Two-Way Analysis of Variance
Copyright © Germán Rodríguez, 1993-2000. Please send feedback to grodri@princeton.edu
Conversion from LaTeX was done using TTH, version 2.34.