|
|
|
|
|
|
||
In class we explored in more detail the relationship between fertility change and social setting. Let us start by loading the data
. use http://data.princeton.edu/wws509/datasets/fpe, clear (Family Planning Effort Data)
One way to help the eye discern trends in a scatterplot is to use a smoother, or non-parametric estimate of the regression function. The simplest methods are running means (or moving averages) and running lines. The lowess method was first proposed by Cleveland in 1979. For each distinct value of x it produces a fitted value y by running a regression in a local neighborhood of x, giving more weight to points closer to x. The size of the neighborhood is called the bandwidth, and represents a trade off between smoothness and goodness of fit: large neighborhoods produce smooth curves that may not fit well, and small neighborhoods lead to curves that fit better but look wiggly.
Stata's lowess command uses a default bandwidth of 80%.
Let us try if for our data:
. lowess change setting . graph export splfig1.png, replace (file splfig1.png written in PNG format)

By default the command produces a scatter plot and superimposes the lowess
fit. The fit can be saved to a variable using the gen(varname)
option. The option bwidth(number) controls the bandwidth,
which must be a number strictly between 0 and 1. For our data there is a
hint that the relationship may be a bit steeper at higher settings.
One way to account for a curvilinear relationship is to add a quadratic term. I strongly recommending centering a variable by subtracting the mean, or a convenient round value near the mean, before squaring. This reduces collinearity (thus leading to more stable estimates) and produces numbers that are easier to interpret. For our data I decided to center setting on 70:
. gen settingsqc = (setting-70)^2
. regress change setting settingsqc
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 2, 17) = 8.22
Model | 1302.66136 2 651.330679 Prob > F = 0.0032
Residual | 1347.53864 17 79.2669789 R-squared = 0.4915
-------------+------------------------------ Adj R-squared = 0.4317
Total | 2650.2 19 139.484211 Root MSE = 8.9032
------------------------------------------------------------------------------
change | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
setting | .5808057 .1459596 3.98 0.001 .2728578 .8887536
settingsqc | .009076 .0080173 1.13 0.273 -.007839 .025991
_cons | -29.75161 11.70088 -2.54 0.021 -54.43831 -5.064914
------------------------------------------------------------------------------
The linear term reflects the effect of setting at 70 (since the quadratic term is then zero). Thus, at that level of development we see fertility decline an average of 0.58 percentage points more per point of setting. Moreover, at that point the difference is increasing by roughly 0.02 percentage points per point of setting. (See the note at the end of this subsection to see where this number comes from.)
The best way to appreciate a quadratic fit, however, is to plot it.
We could use Stata's predict command to get the fitted values
and superimpose a line on the scatterplot.
Or we can let Stata do all the work using qfit:
. twoway (scatter change setting) (qfit change setting) /// > , title(A quadratic fit) legend(off) . graph export splfig2.png, replace (file splfig2.png written in PNG format)

Try superimposing the lowess and the quadratic fits to compare them more directly.
A note on derivatives and unit changes. In a linear model where E(Y)=a+b x, b is the derivative w.r.t. x, and we usually interpret it as as the average difference in Y per unit difference in X, which happens to be exactly right. In a quadratic model where E(Y)=a+b x + g(x-c)2 we have to be a bit more careful. The derivative w.r.t. x is b + 2 g(x-c) and it depends on x, with a multiplier that is twice g. This can still be interpreted as a unit change at x, as you can verify by predicting at x+1/2 and x-1/2 and computing the difference. (Working with x+1 and x estimates the unit change around x+1/2.)
Another way to account for the shape of the relationship is to
use a straight line that changes slope somewhere along the way.
The technical term for the resulting curve is a linear spline,
the simplest member of a family of piecewise polynomials called
splines.
We can easily obtain a regression of y on x that changes
slope at a point x by defining two variables:
| x1 = min(x, x), and |
| x2 = max(x-x,0)/td> |
The second variable is often written as (x-x)+, the positive part of x-x. We then fit the model
Work out the values of the linear predictor for x < x and for x > x to satisfy yourself that you get lines with slopes b1 and b2 that have the same value at x.
Here's how to do this for our data, picking 70 as the change over point:
. gen s1 = setting
. replace s1 = 70 if setting > 70
(12 real changes made)
. gen s2 = setting - 70
. replace s2 = 0 if setting < 70
(7 real changes made)
. regress change s1 s2
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 2, 17) = 9.16
Model | 1374.90927 2 687.454633 Prob > F = 0.0020
Residual | 1275.29073 17 75.017102 R-squared = 0.5188
-------------+------------------------------ Adj R-squared = 0.4622
Total | 2650.2 19 139.484211 Root MSE = 8.6612
------------------------------------------------------------------------------
change | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
s1 | .2080681 .2324725 0.90 0.383 -.282406 .6985422
s2 | .9120976 .2956165 3.09 0.007 .2884014 1.535794
_cons | -5.946731 14.12723 -0.42 0.679 -35.75257 23.85911
------------------------------------------------------------------------------
We see that the average difference in fertility declines per point of social setting is 0.21 percentage points for low settings (< 70) and 0.91 for not so low settings (> 70). To visualize the fit we plot the data and the linear spline
. predict ls
(option xb assumed; fitted values)
. graph twoway (scatter change setting) (line ls setting, sort) ///
> , legend(off) title("A Linear Spline Fit") ///
> xtitle(Social setting) ytitle(Fertility decline) note(Knot at 70)
. graph export splfig3.png, replace
(file splfig3.png written in PNG format)

In this example the linear spline fits a little bit better than the quadratic. It may also be easier to explain than a quadratic. None of these models, however, fits significantly better than the simple linear regression. Here is a test of equality of the two slopes:
. test s1=s2
( 1) s1 - s2 = 0
F( 1, 17) = 2.32
Prob > F = 0.1463
Stata has a command mkspline that automates
the construction of linear splines with any number of knots.
Type help mkspline to learn mode.