Home | GLMs | Multilevel | Survival | Demography | Stata | R

Digit Preference

The datasets section has the age distribution of the Philippines in 1990 from ages 0 to 99, which appears as table 7.7 in Siegel and Swanson (2004), p. 103. (Thanks to Tom Pullum for providing an electronic version as well as a do file to analyze it.)

. infile age pop using ///
>     http://data.princeton.edu/eco572/datasets/phpop1990.dat
(100 observations read)

. gen pm = pop/1000000

. line pm age, ytitle("Population (millions)") ///
>         title(Philippines 1990)

. graph export phpop1990.png, replace
(file phpop1990.png written in PNG format)

The spikes are likely to represent some form of age misreporting. To investigate a preference for terminal digits such as 0 or 5 we could tabulate the last digit of age

. gen lastdigit = mod(age,10)

. tab lastdigit [fw=pop]

  lastdigit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  7,611,712       12.57       12.57
          1 |  6,374,857       10.53       23.10
          2 |  6,424,913       10.61       33.71
          3 |  6,042,858        9.98       43.69
          4 |  6,003,475        9.91       53.60
          5 |  6,140,517       10.14       63.74
          6 |  5,546,993        9.16       72.90
          7 |  5,664,798        9.35       82.25
          8 |  5,344,036        8.82       91.08
          9 |  5,401,935        8.92      100.00
------------+-----------------------------------
      Total | 60,556,094      100.00

We see strong preference for 0, and some for 1, 2 and 5. Note, however, that because of mortality we should expect more people at age 0 than 5 or 9, and more at 10 than 15 or 19, and so forth, with more at 80 than 85 or 89.

Myers' adjustment is to tabulate the last digit several times starting at different values. In our example we could tabulate it only for ages 0 to 89, and then 1 to 90, until we do 9 to 98, and sum the tabulations, obtaining a "blended" population which helps control for a trend.

This is equivalent to counting all ages 10 times, except ages 0 to 8 (which are counted 1 to 9 times, respectively) and 90 to 98 (which are counted 9 to 1 times, respectively). So we can obtained the result more easily using a weight:

. gen mw = 10

. replace mw = age+1 if age < 9
(9 real changes made)

. replace mw = 99-age if age > 89
(10 real changes made)

. replace mw = 0 if age > 98 
(0 real changes made)

. gen combow = pop*mw

. tab lastdigit [fw=combow]

  lastdigit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 59,752,360       11.28       11.28
          1 | 50,629,836        9.56       20.84
          2 | 52,212,367        9.86       30.70
          3 | 50,395,096        9.51       40.21
          4 | 51,921,770        9.80       50.01
          5 | 54,969,894       10.38       60.39
          6 | 50,600,297        9.55       69.94
          7 | 53,367,794       10.07       80.02
          8 | 51,854,354        9.79       89.81
          9 | 54,002,900       10.19      100.00
------------+-----------------------------------
      Total |529,706,668      100.00

Now that we have the counts we need to divide by the sum and see how far we are from 10% in each digit. This is best done with mata:

. mata:

------------------------------------------------- mata (type end to exit) 

:         b = st_matrix("blended")

:         f = 100 * b :/ sum(b)           

:       sum( abs(f :-10 ) )/2 
   1.927535

: end
-----------------------------------------------------------------------------

So we would need to reclasify almost 2 percent of the cases. The blended frequencies show some preference for 0, but no marked preference for 5.

A Stata Command

I have wrapped these calculations in a Stata command called myers. To install it in net-aware Stata type net from http://data.princeton.edu/eco572/stata, and follow the instructions. (This is the standard way of loading course commands.) The command has online help, which you should consult to learn about the options, which include range(bot top), to specify the (initial) range of ages to be tabulated, and months, for studying heaping in duration data. Here's the command run on our example:

. myers age[fw=pop], range(0 89)

 Last digit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 59,752,360       11.28       11.28
          1 | 50,629,836        9.56       20.84
          2 | 52,212,367        9.86       30.70
          3 | 50,395,096        9.51       40.21
          4 | 51,921,770        9.80       50.01
          5 | 54,969,894       10.38       60.39
          6 | 50,600,297        9.55       69.94
          7 | 53,367,794       10.07       80.02
          8 | 51,854,354        9.79       89.81
          9 | 54,002,900       10.19      100.00
------------+-----------------------------------
      Total |529,706,668      100.00

Myers' Blended Index = 1.9275349

These data are used as an example in Siegel and Swanson (2004). They state that they cover the range 10-89, but in fact they continue the sum all the way to 99 (excluding 100, which is left out of our dataset). This is equivalent to assuming that the frequencies above 99 are all zero. We can reproduce their results exactly using

. myers age [fw=pop], range(10 99)

 Last digit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 43,095,176       11.52       11.52
          1 | 35,421,604        9.47       20.99
          2 | 36,523,195        9.77       30.76
          3 | 35,262,494        9.43       40.19
          4 | 36,780,695        9.83       50.02
          5 | 39,840,158       10.65       60.67
          6 | 35,354,160        9.45       70.13
          7 | 37,572,482       10.05       80.17
          8 | 36,349,561        9.72       89.89
          9 | 37,802,270       10.11      100.00
------------+-----------------------------------
      Total |374,001,795      100.00

Myers' Blended Index = 2.3286968
(Assuming zero frequencies for values 100 to 108)

Breastfeeding Duration

Here are reported durations of breastfeeding in the last closed interval (the interval between the next-to-last and last child) from the Bangladesh WFS:

. clear

. infile duration n using ///
>         http://data.princeton.edu/eco572/datasets/bdblci.dat
(56 observations read)

. twoway (line n duration) , title(Bangladesh 1975-76) ///
>         xtitle("duration of breastfeeding (LCI)") ///
>         xlabel( 0(12)72 ) xmtick(6(12)66) legend(off)

. graph export bdblci.png, replace
(file bdblci.png written in PNG format)

It seems quite clear that there is a marked preference for multiples of a year as well as multiples of 6 months. It is possible, of course, that women will aim to wean the child at age one, two, or three. Later on we will see current status evidence that this is not likely to be the case.

In the meantime, we can calculate Myers' index, using the months option, and specifying a range of 0 to 59, which spans 60 months, an exact multiple of 12, and leaves (at least) 11 more months of data for blending.

. myers duration [fw=n], range(0,59) months

 Last digit |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     31,104       62.59       62.59
          1 |        454        0.91       63.50
          2 |        648        1.30       64.81
          3 |        800        1.61       66.42
          4 |        932        1.88       68.29
          5 |        504        1.01       69.31
          6 |     12,253       24.66       93.97
          7 |        284        0.57       94.54
          8 |        942        1.90       96.43
          9 |        566        1.14       97.57
         10 |        751        1.51       99.08
         11 |        456        0.92      100.00
------------+-----------------------------------
      Total |     49,694      100.00

Myers' Blended Index = 70.581291

So, 63% of the cases report an exact multiple of 12 and another 25% report a multiple of 6 (but not 12). We would need to reclassify 71% of the cases to obtain a uniform distribution by month.

This dataset does not exhibit a marked trend through the range the way an age distribution may do, so you would expect very similar results by just dividing duration by 12 and tabulating the remainder. We just need to make sure we give each month an equal chance by going up to 59 or 71. Try

. gen dmod12 = mod(dur,12)

. tab dmod12 [fw=n] if dur < 60 // or < 72

     dmod12 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,746       60.03       60.03
          1 |        112        2.45       62.48
          2 |         87        1.90       64.39
          3 |        130        2.84       67.23
          4 |        101        2.21       69.44
          5 |         60        1.31       70.75
          6 |      1,059       23.15       93.90
          7 |         31        0.68       94.58
          8 |         92        2.01       96.59
          9 |         52        1.14       97.73
         10 |         66        1.44       99.17
         11 |         38        0.83      100.00
------------+-----------------------------------
      Total |      4,574      100.00

We get very similar results.