Home | GLMs | Multilevel | Survival | Demography | Stata | R

Smoothing: Running Means and Lines

We will work with data from the Colombia WFS Household Survey, conducted in 1975-76. I tabulated the age distribution of all household members and saved it in an ascci file, which we now read and plot:

. infile age pop using ///
>         http://data.princeton.edu/eco572/datasets/cohhpop.dat
(99 observations read)

. line pop age , ///
>         title(Colombia 1975-76) subtitle(WFS Household Survey) ///
>         ytitle(population)

. graph export cohhpop.png, replace
(file cohhpop.png written in PNG format)

As you can see, the distribution looks somewhat less smooth than the data from the Philippines we studied earlier. Can you compute the Myers index for this distribution?

Running Means

The simplest way to smooth a scatterplot is to use moving averages, also know as running means. We can easily compute a moving average with a window of three, where for each observation we consider the neighbors immediately above and below:

. gen ma3 = ( pop[_n-1] + pop[_n] + pop[_n+1] )/3
(2 missing values generated)

. replace ma3 = (pop[1]+pop[2])/2 in 1
(1 real change made)

. replace ma3 = (pop[_N-1] + pop[_N])/2 in -1
(1 real change made)

The first and last observations have only one neighbor each, and Stata rightly returns a missing value, which we replace by the average of just two.

There is, of course, an easier way. The lowess command can compute running means if you specify the options mean and noweight. The window is specified as a proportion of the data via an option called bwidth, which is short for bandwidth. To reproduce our result we need to use 0.03 or 3% of the data:

. lowess pop age, mean noweight bwidth(0.03) ///
>         gen(rm3) name(rm3) title(" ")

First let us list a couple of cases to verify that lowess and us are doing the same thing:

. list age ma3 rm3 in 1/4

     +-------------------+
     | age    ma3    rm3 |
     |-------------------|
  1. |   0   1529   1529 |
  2. |   1   1473   1473 |
  3. |   2   1447   1447 |
  4. |   3   1462   1462 |
     +-------------------+

Looking at the plot (below) we see that a moving average of just three observations is too "wiggly"; it follows the data too closely.

The solution is to use more bandwidth. Let us try Stata's default, which happens to be 0.8 or 80% of the data:

. lowess pop age, mean noweight name(rm80)  title(" ")

. graph combine rm3 rm80, xsize(6) ysize(3) ///
>         title(Running Mean Smoothers)    

. graph drop rm3 rm80 // not needed anymore

. graph export cohhrm.png, replace
(file cohhrm.png written in PNG format)

Now we over smoothed. Try different bandwidths. You should find that to get a reasonable fit you need to use a fairly narrow bandwidth, but then the curve doesn't look very smooth.

Weighted Means

A common problem with running means is bias. A solution is to use weights that give more importance to near than far neighbors. Here is an example of two running means, both using a bandwidth of 25 percent of the data, one unweighted and one weighted (which you specify using the option weight, or just omitting as it is the default).

. lowess pop age, mean nowei bw(.25) name(rm25) title(Unweighted)

. lowess pop age, mean bw(.25)       name(rmw25) title(Weighted)

. graph combine rm25 rmw25, xsize(6) ysize(3) ///
>         title(Running Mean Smoothers)

. graph drop rm25 rmw25

. graph export cohhrmw.png,replace
(file cohhrmw.png written in PNG format)

The weighted running mean does a better job fitting the data at young ages while maintining smoothness. Experiment by varying the bandwidth.

Running Lines

A better solution to the problem with running means is to use running lines. We define a neighborhood of each point, typically the k nearest neighbors on each side, fit a regression line to the points in the neighborhood, and then use it to predict a smoother value for the index observation.

This sounds like a lot of work, with one regression for each point, but it can be done reasonably fast using regression updating formulas. Running lines is the default option of the lowess command, so all you do in Stata is not specify means. We do, however, specify noweight. Here are running lines using bandwidths of 10% and 40%.

. lowess pop age, nowei bw(.1) name(rl10) title(" ")

. lowess pop age, nowei bw(.4) name(rl40) title(" ")

. graph combine rl10 rl40, xsize(6) ysize(3) ///
>         title(Running Line Smoothers)

. graph drop rl10 rl40

. graph export cohhrl.png,replace
(file cohhrl.png written in PNG format)

Obviously 10% of the data is not enough to smooth and 40% is a bit too much. You might want to try 25%, as we did with the running means.

Lowess Smoother

It is, of course, possible to combine the two solutions, using running lines and weights, and this is what the lowess smoother does by default, so all you do is not specify means or weights. Usually you will want to specify the bandwidth. Let us try 10% and 40%, as we did for the unweighted lines

. lowess pop age, bw(.1) title(" ") name(lo10)

. lowess pop age, bw(.4) title(" ") name(lo40)

. graph combine lo10 lo40, xsize(6) ysize(3) ///
>         title(Lowess Smoothers)

. graph drop lo10 lo40

. graph export cohhrlw.png, replace
(file cohhrlw.png written in PNG format)

This is clearly the best technique in the family. Try a bandwidth somewhere in the above range.

Digit Preference Revisited

Smoothing the age distribution provides a better way to assess digit preference that Myers' blending. Let us compute the last digit of age and tabulate it over the entire range using the observed frequencies and a lowess smooth.

. gen lastdigit = mod(age,10)

. tab lastdigit [fw=pop], matcell(obs)

  lastdigit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      7,797       13.99       13.99
          1 |      4,847        8.69       22.68
          2 |      5,946       10.67       33.35
          3 |      5,388        9.66       43.01
          4 |      5,281        9.47       52.48
          5 |      6,577       11.80       64.28
          6 |      5,290        9.49       73.77
          7 |      4,879        8.75       82.52
          8 |      5,476        9.82       92.34
          9 |      4,269        7.66      100.00
------------+-----------------------------------
      Total |     55,750      100.00

We see evidence of preference for ages ending in 0 and 5, which is very common, and probably 2 as well. The option matcell(obs) saves the frequencies in a matrix call obs for later use. Next we smooth the data using a bandwidth of 25%, and save the fitted values in a variable called smooth, which we then use as weights in our tabulation

. lowess pop age, bwidth(0.25) gen(smooth) nograph

. tab lastdigit [aw=smooth], matcell(fit)

  lastdigit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 11.0366512       11.15       11.15
          1 | 10.8233452       10.93       22.08
          2 |10.59322023       10.70       32.78
          3 | 10.3362373       10.44       43.22
          4 | 10.0601082       10.16       53.38
          5 | 9.77877815        9.88       63.26
          6 | 9.50162757        9.60       72.86
          7 | 9.22961745        9.32       82.18
          8 | 8.95876398        9.05       91.23
          9 | 8.68165074        8.77      100.00
------------+-----------------------------------
      Total |         99      100.00

Note how the expect fewer people at higher digits even in a smooth distribution, we have 11.2% ending in 0 but only 8.8% ending in 9. (Stata complains if you try to use non-integer frequency weights, so I used analytic weights--which are standardized to sum to 99, the number of observations--but otherwise produce the same results in terms of the percent distribution.)

We are now ready to compute our own index of digit preference, defined as half the sum of absolute differences between the observed and the smooth frequencies for each digit. We divide each set of frequencies by the sum of the weights, which is 55750 and 99, respectively. Because we can't use the absolute value function with Stata matrices we use Mata:

. mat diff = obs/55750 - fit/99

. mata: sum( abs( st_matrix("diff") ) )/2
  .0553043845

So, we would need to reshuffle 5.5% of the observations to eliminate digit preference. (You did compute the value of Myers index, right? How do they compare?)