Smoothing: Running Means and Lines
We will work with data from the Colombia WFS Household Survey, conducted in 1975-76. I tabulated the age distribution of all household members and saved it in an ascci file, which we now read and plot:
. infile age pop using /// > http://data.princeton.edu/eco572/datasets/cohhpop.dat (99 observations read) . line pop age , /// > title(Colombia 1975-76) subtitle(WFS Household Survey) /// > ytitle(population) . graph export cohhpop.png, replace (file cohhpop.png written in PNG format)

As you can see, the distribution looks somewhat less smooth than the data from the Philippines we studied earlier. Can you compute the Myers index for this distribution?
Running Means
The simplest way to smooth a scatterplot is to use moving averages, also know as running means. We can easily compute a moving average with a window of three, where for each observation we consider the neighbors immediately above and below:
. gen ma3 = ( pop[_n-1] + pop[_n] + pop[_n+1] )/3 (2 missing values generated) . replace ma3 = (pop[1]+pop[2])/2 in 1 (1 real change made) . replace ma3 = (pop[_N-1] + pop[_N])/2 in -1 (1 real change made)
The first and last observations have only one neighbor each, and Stata rightly returns a missing value, which we replace by the average of just two.
There is, of course, an easier way. The lowess
command can compute running means if you specify the options
mean and noweight. The window is
specified as a proportion of the data via an option called
bwidth, which is short for bandwidth.
To reproduce our result we need to use 0.03
or 3% of the data:
. lowess pop age, mean noweight bwidth(0.03) ///
> gen(rm3) name(rm3) title(" ")
First let us list a couple of cases to verify that
lowess and us are doing the same thing:
. list age ma3 rm3 in 1/4
+-------------------+
| age ma3 rm3 |
|-------------------|
1. | 0 1529 1529 |
2. | 1 1473 1473 |
3. | 2 1447 1447 |
4. | 3 1462 1462 |
+-------------------+
Looking at the plot (below) we see that a moving average of just three observations is too "wiggly"; it follows the data too closely.

The solution is to use more bandwidth. Let us try Stata's default, which happens to be 0.8 or 80% of the data:
. lowess pop age, mean noweight name(rm80) title(" ")
. graph combine rm3 rm80, xsize(6) ysize(3) ///
> title(Running Mean Smoothers)
. graph drop rm3 rm80 // not needed anymore
. graph export cohhrm.png, replace
(file cohhrm.png written in PNG format)
Now we over smoothed. Try different bandwidths. You should find that to get a reasonable fit you need to use a fairly narrow bandwidth, but then the curve doesn't look very smooth.
Weighted Means
A common problem with running means is bias. A solution is to
use weights that give more importance to near than far neighbors.
Here is an example of two running means, both using a bandwidth
of 25 percent of the data, one unweighted and one weighted
(which you specify using the option weight, or
just omitting as it is the default).
. lowess pop age, mean nowei bw(.25) name(rm25) title(Unweighted) . lowess pop age, mean bw(.25) name(rmw25) title(Weighted) . graph combine rm25 rmw25, xsize(6) ysize(3) /// > title(Running Mean Smoothers) . graph drop rm25 rmw25 . graph export cohhrmw.png,replace (file cohhrmw.png written in PNG format)

The weighted running mean does a better job fitting the data at young ages while maintining smoothness. Experiment by varying the bandwidth.
Running Lines
A better solution to the problem with running means is to use running lines. We define a neighborhood of each point, typically the k nearest neighbors on each side, fit a regression line to the points in the neighborhood, and then use it to predict a smoother value for the index observation.
This sounds like a lot of work, with one regression for each point,
but it can be done reasonably fast using regression updating formulas.
Running lines is the default option of the lowess command,
so all you do in Stata is not specify means.
We do, however, specify noweight.
Here are running lines using bandwidths of 10% and 40%.
. lowess pop age, nowei bw(.1) name(rl10) title(" ")
. lowess pop age, nowei bw(.4) name(rl40) title(" ")
. graph combine rl10 rl40, xsize(6) ysize(3) ///
> title(Running Line Smoothers)
. graph drop rl10 rl40
. graph export cohhrl.png,replace
(file cohhrl.png written in PNG format)

Obviously 10% of the data is not enough to smooth and 40% is a bit too much. You might want to try 25%, as we did with the running means.
Lowess Smoother
It is, of course, possible to combine the two solutions, using
running lines and weights, and this is what the
lowess smoother does by default, so all you do
is not specify means or weights.
Usually you will want to specify the bandwidth.
Let us try 10% and 40%, as we did for the unweighted lines
. lowess pop age, bw(.1) title(" ") name(lo10)
. lowess pop age, bw(.4) title(" ") name(lo40)
. graph combine lo10 lo40, xsize(6) ysize(3) ///
> title(Lowess Smoothers)
. graph drop lo10 lo40
. graph export cohhrlw.png, replace
(file cohhrlw.png written in PNG format)

This is clearly the best technique in the family. Try a bandwidth somewhere in the above range.
Digit Preference Revisited
Smoothing the age distribution provides a better way to assess digit preference that Myers' blending. Let us compute the last digit of age and tabulate it over the entire range using the observed frequencies and a lowess smooth.
. gen lastdigit = mod(age,10)
. tab lastdigit [fw=pop], matcell(obs)
lastdigit | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,797 13.99 13.99
1 | 4,847 8.69 22.68
2 | 5,946 10.67 33.35
3 | 5,388 9.66 43.01
4 | 5,281 9.47 52.48
5 | 6,577 11.80 64.28
6 | 5,290 9.49 73.77
7 | 4,879 8.75 82.52
8 | 5,476 9.82 92.34
9 | 4,269 7.66 100.00
------------+-----------------------------------
Total | 55,750 100.00
We see evidence of preference for ages ending in 0 and 5,
which is very common, and probably 2 as well.
The option matcell(obs) saves the frequencies
in a matrix call obs for later use.
Next we smooth the data using a bandwidth of 25%, and save
the fitted values in a variable called smooth,
which we then use as weights in our tabulation
. lowess pop age, bwidth(0.25) gen(smooth) nograph
. tab lastdigit [aw=smooth], matcell(fit)
lastdigit | Freq. Percent Cum.
------------+-----------------------------------
0 | 11.0366512 11.15 11.15
1 | 10.8233452 10.93 22.08
2 |10.59322023 10.70 32.78
3 | 10.3362373 10.44 43.22
4 | 10.0601082 10.16 53.38
5 | 9.77877815 9.88 63.26
6 | 9.50162757 9.60 72.86
7 | 9.22961745 9.32 82.18
8 | 8.95876398 9.05 91.23
9 | 8.68165074 8.77 100.00
------------+-----------------------------------
Total | 99 100.00
Note how the expect fewer people at higher digits even in a smooth distribution, we have 11.2% ending in 0 but only 8.8% ending in 9. (Stata complains if you try to use non-integer frequency weights, so I used analytic weights--which are standardized to sum to 99, the number of observations--but otherwise produce the same results in terms of the percent distribution.)
We are now ready to compute our own index of digit preference, defined as half the sum of absolute differences between the observed and the smooth frequencies for each digit. We divide each set of frequencies by the sum of the weights, which is 55750 and 99, respectively. Because we can't use the absolute value function with Stata matrices we use Mata:
. mat diff = obs/55750 - fit/99
. mata: sum( abs( st_matrix("diff") ) )/2
.0553043845
So, we would need to reshuffle 5.5% of the observations to eliminate digit preference. (You did compute the value of Myers index, right? How do they compare?)
