![]() |
|
![]() | |||
|
|
|||||
Linear models are used to study how a quantitative variable depends on one or more predictors or explanatory variables. The predictors themselves may be quantitative or qualitative.
We will illustrate the use of linear models for continuous data using a small dataset extracted from Mauldin and Berelson (1978) and reproduced in Table 2.1. The data include an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR)-the number of births per thousand population-between 1965 and 1975, for 20 countries in Latin America and the Caribbean.
| Setting | Effort | CBR Decline | |
| Bolivia | 46 | 0 | 1 |
| Brazil | 74 | 0 | 10 |
| Chile | 89 | 16 | 29 |
| Colombia | 77 | 16 | 25 |
| CostaRica | 84 | 21 | 29 |
| Cuba | 89 | 15 | 40 |
| Dominican Rep | 68 | 14 | 21 |
| Ecuador | 70 | 6 | 0 |
| El Salvador | 60 | 13 | 13 |
| Guatemala | 55 | 9 | 4 |
| Haiti | 35 | 3 | 0 |
| Honduras | 51 | 7 | 7 |
| Jamaica | 87 | 23 | 21 |
| Mexico | 83 | 4 | 9 |
| Nicaragua | 68 | 0 | 7 |
| Panama | 84 | 19 | 22 |
| Paraguay | 74 | 3 | 6 |
| Peru | 73 | 0 | 2 |
| Trinidad-Tobago | 84 | 15 | 29 |
| Venezuela | 91 | 7 | 11 |
The index of social setting combines seven social indicators, namely literacy, school enrollment, life expectancy, infant mortality, percent of males aged 15-64 in the non-agricultural labor force, gross national product per capita and percent of population living in urban areas. Higher scores represent higher socio-economic levels.
The index of family planning effort combines 15 different program indicators, including such aspects as the existence of an official family planning policy, the availability of contraceptive methods, and the structure of the family planning program. An index of 0 denotes the absence of a program, 1-9 indicates weak programs, 10-19 represents moderate efforts and 20 or more denotes fairly strong programs.
Figure 2.1 shows scatterplots for all pairs of variables. Note that CBR decline is positively associated with both social setting and family planning effort. Note also that countries with higher socio-economic levels tend to have stronger family planning programs.

In our analysis of these data we will treat the percent decline in the CBR as a continuous response and the indices of social setting and family planning effort as predictors. In a first approach to the data we will treat the predictors as continuous covariates with linear effects. Later we will group them into categories and treat them as discrete factors.
The first issue we must deal with is that the response will vary even among units with identical values of the covariates. To model this fact we will treat each response yi as a realization of a random variable Yi. Conceptually, we view the observed response as only one out of many possible outcomes that we could have observed under identical circumstances, and we describe the possible values in terms of a probability distribution.
For the models in this chapter we will assume that the random variable Yi has a normal distribution with mean mi and variance s2, in symbols:
|
Note that the expected value may vary from unit to unit, but the variance is the same for all. In terms of our example, we may expect a larger fertility decline in Cuba than in Haiti, but we don't anticipate that our expectation will be closer to the truth for one country than for the other.
The normal or Gaussian distribution (after the mathematician Karl Gauss) has probability density function
| (2.1) |

Most of the probability mass in the normal distribution (in fact, 99.7%) lies within three standard deviations of the mean. In terms of our example, we would be very surprised if fertility in a country declined 3s more than expected. Of course, we don't know yet what to expect, nor what s is.
So far we have considered the distribution of one observation. At this point we add the important assumption that the observations are mutually independent. This assumption allows us to obtain the joint distribution of the data as a simple product of the individual probability distributions, and underlies the construction of the likelihood function that will be used for estimation and testing. When the observations are independent they are also uncorrelated and their covariance is zero, so cov(Yi,Yj) = 0 for i j.
It will be convenient to collect the n responses in a column vector y, which we view as a realization of a random vector Y with mean E(Y) = m and variance-covariance matrix var(Y) = s2I, where I is the identity matrix. The diagonal elements of var(Y) are all s2 and the off-diagonal elements are all zero, so the n observations are uncorrelated and have the same variance. Under the assumption of normality, Y has a multivariate normal distribution
| (2.2) |
Let us now turn our attention to the systematic part of the model. Suppose that we have data on p predictors x1, , xp which take values xi1, , xip for the i-th unit. We will assume that the expected response depends on these predictors. Specifically, we will assume that mi is as linear function of the predictors
|
This equation may be written more compactly using matrix notation as
| (2.3) |
| (2.4) |
The expression Xb is called the linear predictor, and includes many special cases of interest. Later in this chapter we will show how it includes simple and multiple linear regression models, analysis of variance models and analysis of covariance models.
The simplest possible linear model assumes that every unit has the same expected value, so that mi = m for all i. This model is often called the null model, because it postulates no systematic differences between the units. The null model can be obtained as a special case of Equation 2.3 by setting p = 1 and xi = 1 for all i. In terms of our example, this model would expect fertility to decline by the same amount in all countries, and would attribute all observed differences between countries to random variation.
At the other extreme we have a model where every unit has its own expected value mi. This model is called the saturated model because it has as many parameters in the linear predictor (or linear parameters, for short) as it has observations. The saturated model can be obtained as a special case of Equation 2.3 by setting p = n and letting xi take the value 1 for unit i and 0 otherwise. In this model the x's are indicator variables for the different units, and there is no random variation left. All observed differences between countries are attributed to their own idiosyncrasies.
Obviously the null and saturated models are not very useful by themselves. Most statistical models of interest lie somewhere in between, and most of this chapter will be devoted to an exploration of the middle ground. Our aim is to capture systematic sources of variation in the linear predictor, and let the error term account for unstructured or random variation.
Continue with 2.2. Estimation of the Parameters
Copyright © Germán Rodríguez, 1993-2000.
Please send feedback to grodri@princeton.edu
Conversion from LaTeX was done using TTH, version 2.34.