Home | GLMs | Multilevel | Survival | Demography | Stata | R

Smoothing: Splines

This continues our work on smoothing. Let's start by making sure we have loaded the data.

. infile age pop using ///
>         http://data.princeton.edu/eco572/datasets/cohhpop.dat
(99 observations read)

Our application will focus on regression splines, because they are the easiest ones to use in Stata, but we will mention briefly natural regression splines and smoothing splines.

Regression Splines

A cubic spline S(x) with knots t1 ... tk has linear, quadratic and cubic terms on x, and one term of the form (x-tj)3+ for each knot. These terms can easily be computed and entered in a regression, although one should be careful about multi-collinearity when there are lots of knots.

A better solution is to use b-splines, a well-conditioned basis for splines. Stata does not have built-in b-splines, but Roger Newson has contributed a command that you can install from the depository at SSC (the Statistical Software Components archive) or from his own website. Type findit bspline to learn more.

Let us use spline regression to smooth the Colombian data. We will use a cubic spline with 3 internal knots at ages 25, 50 and 75. For comparison we will fit a second spline with 4 internal knots at ages 20, 40, 60 and 80. Newson's command requires you to specify the extremes as well, so we add 0 and 100 in both cases, but these don't count as real knots in our terminology. Don't forget to specify p(3) to get cubic splines, otherwise you get piecewise linear splines.

. bspline, xvar(age) knots(0 25 50 75 100) p(3) gen(_bs3k)

With 3 internal knots a cubic spline has 7 parameters (including the constant). The gen option told the command to save these as variables whose names start with _bs3k, so we should have _bs3k1 to _bs3k7.

We now regress the population counts on the b-splines, omitting the constant. (Alternative, you could omit one of the b-splines, which you need to do in multiple regression if there are several terms represented by b-splines.) We then use predict to evaluate the spline and plot it.

. quietly regress pop _bs3k*, noconstant

. predict bs3k
(option xb assumed; fitted values)

. twoway (scatter pop age)(line bs3k age) , legend(off) name(rs3) ///
>         note(knots 25 50 75) title(3 internal knots)

(I supressed the regression to save space but you shouldn't. Always make sure you can verify that things run as expected.) Let us do exactly the same thing with four knots:

. bspline, xvar(age) knots(0 20 40 60 80 100) p(3) gen(_bs4k)

. quietly regress pop _bs4k*, noconstant

. predict bs4k
(option xb assumed; fitted values)

. twoway (scatter pop age)(line bs4k age) , legend(off) name(rs4) ///
>         note(knots 20 40 60 80) title(4 internal knots)

. graph combine rs3 rs4, xsize(6) ysize(3) ///
>         title(Regression Spline Smoothers)

. graph export cohhrs.png
(file cohhrs.png written in PNG format)

. drop _bs* // don't need these vars anymore

. graph drop rs3 rs4 //nor these graphs

As you can see, both regression splines do an excellent job of smoothing the data. There really isn't much to choose between the versions with 3 or 4 knots.

In general, however, the location of the knots is important. We have chosen to space then equally throughout the range. A popular alternative is to set them at selected quantiles.

The Power Series

Just to convince yourself that there is nothing magic about b-splines, we will reproduce the first fit 'by hand', using the so-called power series basis described in the handout.

. gen age2 = age^2

. gen age3 = age^3

. gen k25 = cond(age > 25, (age-25)^3, 0)

. gen k50 = cond(age > 50, (age-50)^3, 0)

. gen k75 = cond(age > 75, (age-75)^3, 0)

. quietly regress pop age age2 age3 k25 k50 k75

. predict myspline
(option xb assumed; fitted values)

. summarize bs3k myspline

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        bs3k |        99    563.1313    556.2422  -.7697959   1657.724
    myspline |        99    563.1313    556.2422  -.7688386   1657.724

. corr bs3k myspline
(obs=99)

             |     bs3k myspline
-------------+------------------
        bs3k |   1.0000
    myspline |   1.0000   1.0000

As you can see, we get the same results except for a tiny difference in the minimum. This is a tribute to Stata's numerical prowess in the presence of high multicollinearity. (The correlation between age and age^2 is .9676, between age^2 and age^3 is .9859, and between age^3 and the first knot term (age-25)^3+, it is .9867. The highest correlation between any two terms in the b-spline basis is .8206 and most are substantially less.)

Natural Splines

Sometimes we have little information at the extremes of the range. Natural cubic splines, which are constrained to be linear outside the range of the data, provide a useful tool in those circumstances. Note that requiring linearity outside the range of the data imposes additional smoothness constraints inside the range, for example the polynomials used at the ends must 'die' with no curvature.

Stata does not have a natural cubic spline function. Mata has a spline3 function which does natural cubic spline interpolation, but that's not quite the same thing. In research projects we have used R to compute a natural cubic spline basis and merge that into our Stata datasets.

Smoothing Splines

A smoothing spline has a knot at each data point, but introduces a penalty for lack of smoothness. If the penalty is zero you get a function that interpolates the data points. If the penalty is infinite, you get an OLS fit to the data. Usually a nice compromise can be found somewhere in between. We usually focus on splines of odd degree, and particulary on cubic splines, which have some nice properties as noted in the handout.

Stata does not have a function for computing smoothing splines, but it is not too difficult to improvise one using the results given on page 7 of the handout.