Let us now turn to applications, modelling the dependence of a continuous response y on a single linear predictor x. In terms of our example, we will study fertility decline as a function of social setting. One can often obtain useful insight into the form of this dependence by plotting the data, as we did in Figure 2.1.
This equation defines a straight line. The parameter α is called the constant or intercept, and represents the expected response when xi = 0. (This quantity may not be of direct interest if zero is not in the range of the data.) The parameter β is called the slope, and represents the expected increment in the response per unit change in xi.
You probably have seen the simple linear regression model written with an explicit error term as
The simple linear regression model can be obtained as a special case of the general linear model of Section 2.1 by letting the model matrix X consist of two columns: a column of ones representing the constant and a column with the values of x representing the predictor. Estimates of the parameters, standard errors, and tests of hypotheses can then be obtained from the general results of Sections 2.2 and 2.3.
It may be of interest to note that in simple linear regression the estimates of the constant and slope are given by
Fitting this model to the family planning effort data with CBR decline as the response and the index of social setting as a predictor gives a residual sum of squares of 1449.1 on 18 d.f. (20 observations minus two parameters: the constant and slope).
Table 2.3 shows the estimates of the parameters, their standard errors and the corresponding t-ratios.
We find that, on the average, each additional point in the social setting scale is associated with an additional half a percentage point of CBR decline, measured from a baseline of an expected 22% increase in CBR when social setting is zero. (Since the social setting scores range from 35 to 91, the constant is not particularly meaningful in this example.)
The estimated standard error of the slope is 0.13, and the corresponding t-test of 3.86 on 18 d.f. is highly significant. With 95% confidence we estimate that the slope lies between 0.23 and 0.78.
Figure 2.3 shows the results in graphical form, plotting observed and fitted values of CBR decline versus social setting. The fitted values are calculated for any values of the predictor x as [^y] = [^(α)] + [^(β)] x and lie, of course, in a straight line.
You should verify that the analogous model with family planning effort as a single predictor gives a residual sum of squares of 950.6 on 18 d.f., with constant 2.336 (2.662) and slope 1.253 (0.2208). Make sure you know how to interpret these estimates.
Instead of using a test based on the distribution of the OLS estimator, we could test the significance of the slope by comparing the simple linear regression model with the null model. Note that these models are nested, because we can obtain the null model by setting β = 0 in the simple linear regression model.
Fitting the null model to the family planning data gives a residual sum of squares of 2650.2 on 19 d.f. Adding a linear effect of social setting reduces the RSS by 1201.1 at the expense of one d.f. This gain can be contrasted with the remaining RSS of 1449.1 on 18 d.f. by constructing an F-test. The calculations are set out in Table 2.4, and lead to an F-statistic of 14.9 on one and 18 d.f.
|Degrees of |
These results can be used to verify the equivalence of t and F test statistics and critical values. Squaring the observed t-statistic of 3.86 gives the observed F-ratio of 14.9. Squaring the 95% two-sided critical value of the Student's t distribution with 18 d.f., which is 2.1, gives the 95% critical value of the F distribution with one and 18 d.f., which is 4.4.
You should verify that the t and F tests for the model with a linear effect of family planning effort are t = 5.67 and F = 32.2.
A simple summary of the strength of the relationship between the predictor and the response can be obtained by calculating a proportionate reduction in the residual sum of squares as we move from the null model to the model with x. The quantity
The square root of the proportion of variance explained in a simple linear regression model, with the same sign as the regression coefficient, is Pearson's linear correlation coefficient. This measure ranges between -1 and 1, taking these values for perfect inverse and direct relationships, respectively. For the model with CBR decline as a linear function of social setting, Pearson's r = 0.673. This coefficient can be calculated directly from the covariance of x and y and their variances, as
In our example, each standard deviation of increase in social setting is associated with an additional decline in the CBR of 0.673 standard deviations. While the regression coefficient expresses the association in the original units of x and y, Pearson's r expresses the association in units of standard deviation.
You should verify that a linear effect of family planning effort accounts for 64.1% of the variation in CBR decline, so Pearson's r = 0.801. Clearly CBR decline is associated more strongly with family planning effort than with social setting.
Continue with 2.5. Multiple Linear Regression