Smoothers, Polynomials, and Linear Splines

In class we explored in more detail the relationship between fertility change and social setting. Let us start by loading the data

. use http://data.princeton.edu/wws509/datasets/fpe, clear
(Family Planning Effort Data)

Lowess Regression

One way to help the eye discern trends in a scatterplot is to use a smoother, or non-parametric estimate of the regression function. The simplest methods are running means (or moving averages) and running lines. The lowess method was first proposed by Cleveland in 1979. For each distinct value of x it produces a fitted value y by running a regression in a local neighborhood of x, giving more weight to points closer to x. The size of the neighborhood is called the bandwidth, and represents a trade off between smoothness and goodness of fit: large neighborhoods produce smooth curves that may not fit well, and small neighborhoods lead to curves that fit better but look wiggly.

Stata's lowess command uses a default bandwidth of 80%. Let us try if for our data:

. lowess change setting

. graph export splfig1.png, replace
(file splfig1.png written in PNG format)

By default the command produces a scatter plot and superimposes the lowess fit. The fit can be saved to a variable using the gen(varname) option. The option bwidth(number) controls the bandwidth, which must be a number strictly between 0 and 1. For our data there is a hint that the relationship may be a bit steeper at higher settings.

A Quadratic Term

One way to account for a curvilinear relationship is to add a quadratic term. I strongly recommending centering a variable by subtracting the mean, or a convenient round value near the mean, before squaring. This reduces collinearity (thus leading to more stable estimates) and produces numbers that are easier to interpret. For our data I decided to center setting on 70:

. gen settingsqc = (setting-70)^2

. regress change setting settingsqc

      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  2,    17) =    8.22
       Model |  1302.66136     2  651.330679           Prob > F      =  0.0032
    Residual |  1347.53864    17  79.2669789           R-squared     =  0.4915
-------------+------------------------------           Adj R-squared =  0.4317
       Total |      2650.2    19  139.484211           Root MSE      =  8.9032

------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     setting |   .5808057   .1459596     3.98   0.001     .2728578    .8887536
  settingsqc |    .009076   .0080173     1.13   0.273     -.007839     .025991
       _cons |  -29.75161   11.70088    -2.54   0.021    -54.43831   -5.064914
------------------------------------------------------------------------------

The linear term reflects the effect of setting at 70 (since the quadratic term is then zero). Thus, at that level of development we see fertility decline an average of 0.58 percentage points more per point of setting. Moreover, at that point the difference is increasing by roughly 0.02 percentage points per point of setting. (See the note at the end of this subsection to see where this number comes from.)

The best way to appreciate a quadratic fit, however, is to plot it. We could use Stata's predict command to get the fitted values and superimpose a line on the scatterplot. Or we can let Stata do all the work using qfit:

. twoway (scatter change setting) (qfit change setting) ///
>       , title(A quadratic fit) legend(off)

. graph export splfig2.png, replace
(file splfig2.png written in PNG format)

Try superimposing the lowess and the quadratic fits to compare them more directly.

A note on derivatives and unit changes. In a linear model where E(Y)=a+b x, b is the derivative w.r.t. x, and we usually interpret it as as the average difference in Y per unit difference in X, which happens to be exactly right. In a quadratic model where E(Y)=a+b x + g(x-c)2 we have to be a bit more careful. The derivative w.r.t. x is b + 2 g(x-c) and it depends on x, with a multiplier that is twice g. This can still be interpreted as a unit change at x, as you can verify by predicting at x+1/2 and x-1/2 and computing the difference. (Working with x+1 and x estimates the unit change around x+1/2.)

A Linear Spline

Another way to account for the shape of the relationship is to use a straight line that changes slope somewhere along the way. The technical term for the resulting curve is a linear spline, the simplest member of a family of piecewise polynomials called splines.

We can easily obtain a regression of y on x that changes slope at a point x by defining two variables:

x1 = min(x, x), and
x2 = max(x-x,0)

The second variable is often written as (x-x)+, the positive part of x-x. We then fit the model

E(Y)= b0 + b1 x1 + b2 x2

Work out the values of the linear predictor for x < x and for x > x to satisfy yourself that you get lines with slopes b1 and b2 that have the same value at x.

Here's how to do this for our data, picking 70 as the change over point:

. gen s1 = setting

. replace s1 = 70 if setting > 70
(12 real changes made)

. gen s2 = setting - 70

. replace s2 = 0 if setting < 70
(7 real changes made)

. regress change s1 s2

      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  2,    17) =    9.16
       Model |  1374.90927     2  687.454633           Prob > F      =  0.0020
    Residual |  1275.29073    17   75.017102           R-squared     =  0.5188
-------------+------------------------------           Adj R-squared =  0.4622
       Total |      2650.2    19  139.484211           Root MSE      =  8.6612

------------------------------------------------------------------------------
      change |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          s1 |   .2080681   .2324725     0.90   0.383     -.282406    .6985422
          s2 |   .9120976   .2956165     3.09   0.007     .2884014    1.535794
       _cons |  -5.946731   14.12723    -0.42   0.679    -35.75257    23.85911
------------------------------------------------------------------------------

We see that the average difference in fertility declines per point of social setting is 0.21 percentage points for low settings (< 70) and 0.91 for not so low settings (> 70). To visualize the fit we plot the data and the linear spline

. predict ls
(option xb assumed; fitted values)

. graph twoway (scatter change setting) (line ls setting, sort) ///
>  , legend(off) title("A Linear Spline Fit")                   ///
>    xtitle(Social setting) ytitle(Fertility decline) note(Knot at 70)

. graph export splfig3.png, replace
(file splfig3.png written in PNG format)

In this example the linear spline fits a little bit better than the quadratic. It may also be easier to explain than a quadratic. None of these models, however, fits significantly better than the simple linear regression. Here is a test of equality of the two slopes:

. test s1=s2

 ( 1)  s1 - s2 = 0

       F(  1,    17) =    2.32
            Prob > F =    0.1463

Stata has a command mkspline that automates the construction of linear splines with any number of knots. Type help mkspline to learn mode.