Smoothing: Splines
This continues our work on smoothing. Let's start by making sure we have loaded the data.
. infile age pop using /// > http://data.princeton.edu/eco572/datasets/cohhpop.dat (99 observations read)
Our application will focus on regression splines, because they are the easiest ones to use in Stata, but we will mention briefly natural regression splines and smoothing splines.
Regression Splines
A cubic spline S(x) with knots t1 ... tk has linear, quadratic and cubic terms on x, and one term of the form (x-tj)3+ for each knot. These terms can easily be computed and entered in a regression, although one should be careful about multi-collinearity when there are lots of knots.
A better solution is to use b-splines, a well-conditioned
basis for splines. Stata does not have built-in b-splines,
but Roger Newson has contributed a command that you can install
from the depository at SSC (the Statistical Software Components
archive) or from his own website.
Type findit bspline to learn more.
Let us use spline regression to smooth the Colombian data. We
will use a cubic spline with 3 internal knots at ages 25, 50 and 75.
For comparison we will fit a second spline with 4 internal knots
at ages 20, 40, 60 and 80.
Newson's command requires you to specify the extremes as well,
so we add 0 and 100 in both cases, but these don't count as
real knots in our terminology.
Don't forget to specify p(3) to get
cubic splines, otherwise you get piecewise linear splines.
. bspline, xvar(age) knots(0 25 50 75 100) p(3) gen(_bs3k)
With 3 internal knots a cubic spline has 7 parameters (including
the constant). The gen option told the command to
save these as variables whose names start with _bs3k,
so we should have _bs3k1 to _bs3k7.
We now regress the population counts on the b-splines, omitting
the constant. (Alternative, you could omit one of the b-splines,
which you need to do in multiple regression if there are several
terms represented by b-splines.) We then use predict
to evaluate the spline and plot it.
. quietly regress pop _bs3k*, noconstant . predict bs3k (option xb assumed; fitted values) . twoway (scatter pop age)(line bs3k age) , legend(off) name(rs3) /// > note(knots 25 50 75) title(3 internal knots)
(I supressed the regression to save space but you shouldn't. Always make sure you can verify that things run as expected.) Let us do exactly the same thing with four knots:
. bspline, xvar(age) knots(0 20 40 60 80 100) p(3) gen(_bs4k) . quietly regress pop _bs4k*, noconstant . predict bs4k (option xb assumed; fitted values) . twoway (scatter pop age)(line bs4k age) , legend(off) name(rs4) /// > note(knots 20 40 60 80) title(4 internal knots) . graph combine rs3 rs4, xsize(6) ysize(3) /// > title(Regression Spline Smoothers) . graph export cohhrs.png (file cohhrs.png written in PNG format) . drop _bs* // don't need these vars anymore . graph drop rs3 rs4 //nor these graphs

As you can see, both regression splines do an excellent job of smoothing the data. There really isn't much to choose between the versions with 3 or 4 knots.
In general, however, the location of the knots is important. We have chosen to space then equally throughout the range. A popular alternative is to set them at selected quantiles.
The Power Series
Just to convince yourself that there is nothing magic about b-splines, we will reproduce the first fit 'by hand', using the so-called power series basis described in the handout.
. gen age2 = age^2
. gen age3 = age^3
. gen k25 = cond(age > 25, (age-25)^3, 0)
. gen k50 = cond(age > 50, (age-50)^3, 0)
. gen k75 = cond(age > 75, (age-75)^3, 0)
. quietly regress pop age age2 age3 k25 k50 k75
. predict myspline
(option xb assumed; fitted values)
. summarize bs3k myspline
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
bs3k | 99 563.1313 556.2422 -.7697959 1657.724
myspline | 99 563.1313 556.2422 -.7688386 1657.724
. corr bs3k myspline
(obs=99)
| bs3k myspline
-------------+------------------
bs3k | 1.0000
myspline | 1.0000 1.0000
As you can see, we get the same results except for a tiny difference in the minimum. This is a tribute to Stata's numerical prowess in the presence of high multicollinearity. (The correlation between age and age^2 is .9676, between age^2 and age^3 is .9859, and between age^3 and the first knot term (age-25)^3+, it is .9867. The highest correlation between any two terms in the b-spline basis is .8206 and most are substantially less.)
Natural Splines
Sometimes we have little information at the extremes of the range. Natural cubic splines, which are constrained to be linear outside the range of the data, provide a useful tool in those circumstances. Note that requiring linearity outside the range of the data imposes additional smoothness constraints inside the range, for example the polynomials used at the ends must 'die' with no curvature.
Stata does not have a natural cubic spline function.
Mata has a spline3 function which does natural
cubic spline interpolation, but that's not quite the same thing.
In research projects we have used R to compute a natural cubic
spline basis and merge that into our Stata datasets.
Smoothing Splines
A smoothing spline has a knot at each data point, but introduces a penalty for lack of smoothness. If the penalty is zero you get a function that interpolates the data points. If the penalty is infinite, you get an OLS fit to the data. Usually a nice compromise can be found somewhere in between. We usually focus on splines of odd degree, and particulary on cubic splines, which have some nice properties as noted in the handout.
Stata does not have a function for computing smoothing splines, but it is not too difficult to improvise one using the results given on page 7 of the handout.
