Germán Rodríguez
Generalized Linear Models Princeton University

Solutions for Problem Set 2:
Determinants of Hourly Wages

We start by reading the data

. use http://data.princeton.edu/wws509/datasets/wages, clear
(Hourly wages in 1985 CPS extract)
> library(foreign)

> wages <- read.dta("http://data.princeton.edu/wws509/datasets/wages.dta")

> names(wages)
 [1] "education"   "south"       "female"      "workexp"     "unionmember"
 [6] "wages"       "age"         "ethnicity"   "occupation"  "sector"     
[11] "married"    

[1] Working with Wages

(a) Fit a linear model to explore how hourly wages depend on education, work experience, union membership, region, occupation and sex. (For simplicity we leave out the other predictors throughout this problem set.)

. regress wages educ workexp union south i.occupation female

      Source │       SS           df       MS      Number of obs   =       534
─────────────┼──────────────────────────────────   F(10, 523)      =     24.39
       Model │  4477.32498        10  447.732498   Prob > F        =    0.0000
    Residual │  9599.37357       523  18.3544428   R-squared       =    0.3181
─────────────┼──────────────────────────────────   Adj R-squared   =    0.3050
       Total │  14076.6985       533  26.4103162   Root MSE        =    4.2842

──────────────┬────────────────────────────────────────────────────────────────
        wages │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
──────────────┼────────────────────────────────────────────────────────────────
    education │   .6722878   .0990429     6.79   0.000     .4777169    .8668586
      workexp │   .0936955   .0165623     5.66   0.000     .0611587    .1262322
  unionmember │   1.517377   .5083565     2.98   0.003     .5187055    2.516049
        south │  -.6885843   .4150423    -1.66   0.098    -1.503939    .1267705
              │
   occupation │
       Sales  │  -3.975443   .9141984    -4.35   0.000    -5.771395   -2.179491
    Clerical  │  -3.347118   .7600163    -4.40   0.000    -4.840178   -1.854058
     Service  │  -4.148184   .8053377    -5.15   0.000    -5.730278    -2.56609
Professional  │  -1.267909   .7270285    -1.74   0.082    -2.696164    .1603461
       Other  │  -2.799024    .756551    -3.70   0.000    -4.285276   -1.312772
              │
       female │  -1.845267   .4152299    -4.44   0.000     -2.66099   -1.029544
        _cons │   1.979516   1.710527     1.16   0.248    -1.380831    5.339863
──────────────┴────────────────────────────────────────────────────────────────

. estimates store wages
> m1 <- lm(wages ~ education + workexp + unionmember + south + occupation + female,
+ 	data = wages)

> summary(m1)

Call:
lm(formula = wages ~ education + workexp + unionmember + south + 
    occupation + female, data = wages)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.296  -2.420  -0.552   1.995  34.860 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             1.97952    1.71053   1.157 0.247696    
education               0.67229    0.09904   6.788 3.10e-11 ***
workexp                 0.09370    0.01656   5.657 2.54e-08 ***
unionmember             1.51738    0.50836   2.985 0.002970 ** 
south                  -0.68858    0.41504  -1.659 0.097701 .  
occupationSales        -3.97544    0.91420  -4.349 1.65e-05 ***
occupationClerical     -3.34712    0.76002  -4.404 1.29e-05 ***
occupationService      -4.14818    0.80534  -5.151 3.68e-07 ***
occupationProfessional -1.26791    0.72703  -1.744 0.081754 .  
occupationOther        -2.79902    0.75655  -3.700 0.000239 ***
female                 -1.84527    0.41523  -4.444 1.08e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.284 on 523 degrees of freedom
Multiple R-squared:  0.3181,	Adjusted R-squared:  0.305 
F-statistic: 24.39 on 10 and 523 DF,  p-value: < 2.2e-16

(b) Describe the net effects of education, work experience, and union membership on wages.

The hourly wage is 67 cents higher per year of education for workers with the same observed values of the other variables. Work experience is associated with an increase of 9 cents in the hourly wage per year of experience, everything else being the same. Union members earn $1.52 more per hour than non-union members with the same characteristics

(c) Describe the gender gap after adjusting for the effects of the other variables in the model, and test its significance.

Female workers earn on average $1.85 per hour less than males with the same education, work experience, union membership, region of residence and occupation. The hypothesis that women make the same as comparable men is rejected with a t statistic of -4.4 (P-value of zero to three decimal places.)

(d) Calculate and plot the jack-knifed residuals versus the fitted values. What does the plot indicate? Any outliers?

. predict jack, rstu

. predict fv
(option xb assumed; fitted values)

. scatter jack fv

. graph export ps2fig1.png, width(500) replace
(file ps2fig1.png written in PNG format)
> library(dplyr)

> wages <- mutate(wages, jack = rstudent(m1), fv = fitted(m1))

> library(ggplot2)

> ggplot(wages, aes(fv, jack)) + geom_point()

> ggsave("ps2fig1r.png", width=500/72, height=400/72, dpi=72)
Residual Plot Residual Plot

Residual Plot

The plot exhibits the typical megaphone pattern that indicates heteroscedasticity; the variance of the residuals is clearly larger at higher predicted hourly wages.
We also see a very clear outlier; a worker whose wages are more than 8 standard deviations higher than expected.

(e) Compute robust standard errors and comment on whether the gender gap is still significant. (R users should make sure to use the "HC1" method.)

. quietly regress wages educ work union south i.occupation female, vce(robust)

. estimates store robust

. estimates table wages robust, se

─────────────┬──────────────────────────
    Variable │   wages        robust    
─────────────┼──────────────────────────
   education │  .67228777    .67228777  
             │  .09904295    .10155895  
     workexp │  .09369546    .09369546  
             │  .01656226    .01934914  
 unionmember │  1.5173771    1.5173771  
             │  .50835653     .5257942  
       south │ -.68858431   -.68858431  
             │  .41504228    .38976722  
             │
  occupation │
      Sales  │ -3.9754434   -3.9754434  
             │  .91419841    1.1030045  
   Clerical  │ -3.3471181   -3.3471181  
             │  .76001633    1.0812816  
    Service  │ -4.1481844   -4.1481844  
             │  .80533767    1.1210233  
Professio~l  │ -1.2679088   -1.2679088  
             │  .72702849    1.1014587  
      Other  │ -2.7990237   -2.7990237  
             │  .75655098    1.0819253  
             │
      female │ -1.8452669   -1.8452669  
             │  .41522986    .44311694  
       _cons │  1.9795165    1.9795165  
             │  1.7105266    2.0648923  
─────────────┴──────────────────────────
                            legend: b/se
> library(lmtest)

> library(sandwich)

> robust <- coeftest(m1, vcov = vcovHC(m1, type = "HC1"))

> cbind(coeftest(m1)[,1:3], robust[,2:3])
                          Estimate Std. Error   t value Std. Error    t value
(Intercept)             1.97951646 1.71052663  1.157256 2.06489233  0.9586536
education               0.67228777 0.09904295  6.787841 0.10155895  6.6196803
workexp                 0.09369546 0.01656226  5.657167 0.01934914  4.8423578
unionmember             1.51737708 0.50835653  2.984868 0.52579420  2.8858764
south                  -0.68858431 0.41504228 -1.659070 0.38976722 -1.7666553
occupationSales        -3.97544338 0.91419841 -4.348556 1.10300451 -3.6041950
occupationClerical     -3.34711813 0.76001633 -4.404008 1.08128163 -3.0955100
occupationService      -4.14818440 0.80533767 -5.150863 1.12102328 -3.7003553
occupationProfessional -1.26790876 0.72702849 -1.743960 1.10145868 -1.1511179
occupationOther        -2.79902370 0.75655098 -3.699716 1.08192532 -2.5870766
female                 -1.84526688 0.41522986 -4.443965 0.44311694 -4.1642887

The robust standard errors are somehat larger than the model based estimates, but the gender gap is still significant, with a t ratio in excess of 4.

[2] Working with Log-Wages

(a) Fit the model of part 1.a working with the natural logarithm of hourly wages.

. gen logwages = log(wages)

. regress logwages educ union south i.occupation workexp female

      Source │       SS           df       MS      Number of obs   =       534
─────────────┼──────────────────────────────────   F(10, 523)      =     27.97
       Model │  51.7302552        10  5.17302552   Prob > F        =    0.0000
    Residual │  96.7165671       523  .184926514   R-squared       =    0.3485
─────────────┼──────────────────────────────────   Adj R-squared   =    0.3360
       Total │  148.446822       533  .278511862   Root MSE        =    .43003

──────────────┬────────────────────────────────────────────────────────────────
     logwages │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
──────────────┼────────────────────────────────────────────────────────────────
    education │   .0694744   .0099415     6.99   0.000     .0499442    .0890046
  unionmember │    .205853   .0510267     4.03   0.000     .1056106    .3060955
        south │  -.1060423   .0416602    -2.55   0.011    -.1878842   -.0242005
              │
   occupation │
       Sales  │  -.3504642   .0917634    -3.82   0.000    -.5307343   -.1701941
    Clerical  │  -.2176053   .0762872    -2.85   0.005    -.3674724   -.0677382
     Service  │  -.4042984   .0808364    -5.00   0.000    -.5631024   -.2454945
Professional  │  -.0432898   .0729761    -0.59   0.553     -.186652    .1000724
       Other  │  -.2047711   .0759394    -2.70   0.007    -.3539548   -.0555873
              │
      workexp │   .0105903   .0016624     6.37   0.000     .0073244    .0138562
       female │  -.2080284    .041679    -4.99   0.000    -.2899072   -.1261495
        _cons │   1.251034   .1716955     7.29   0.000     .9137368    1.588332
──────────────┴────────────────────────────────────────────────────────────────
> m2 <- lm(log(wages) ~ education + workexp + unionmember + south + occupation + female,
+   data = wages)

>  	summary(m2)

Call:
lm(formula = log(wages) ~ education + workexp + unionmember + 
    south + occupation + female, data = wages)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.33889 -0.27657  0.00404  0.29798  1.76925 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             1.251034   0.171695   7.286 1.18e-12 ***
education               0.069474   0.009942   6.988 8.51e-12 ***
workexp                 0.010590   0.001662   6.370 4.14e-10 ***
unionmember             0.205853   0.051027   4.034 6.30e-05 ***
south                  -0.106042   0.041660  -2.545  0.01120 *  
occupationSales        -0.350464   0.091763  -3.819  0.00015 ***
occupationClerical     -0.217605   0.076287  -2.852  0.00451 ** 
occupationService      -0.404298   0.080836  -5.001 7.78e-07 ***
occupationProfessional -0.043290   0.072976  -0.593  0.55330    
occupationOther        -0.204771   0.075939  -2.697  0.00723 ** 
female                 -0.208028   0.041679  -4.991 8.19e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.43 on 523 degrees of freedom
Multiple R-squared:  0.3485,	Adjusted R-squared:  0.336 
F-statistic: 27.97 on 10 and 523 DF,  p-value: < 2.2e-16

(b) Describe the coefficients of education, work experience and union membership in terms of the effects of these variables on wages (not on log wages).

. mata b = st_matrix("e(b)")

. mata b[1..3] \ exp(b[1..3])
                  1              2              3
    ┌──────────────────────────────────────────────┐
  1 │   .0694743579    .2058530126   -.1060423388  │
  2 │   1.071944574    1.228572606    .8993865683  │
    └──────────────────────────────────────────────┘
> vars = c("education","workexp","unionmember")

> exp(coef(m2)[vars]) - 1
  education     workexp unionmember 
 0.07194457  0.01064661  0.22857261 

We estimate the returns to education as 7.2% higher wages per year of education, everything else being equal. Workers with more experience make on average about one percent more per year of experience, adjusting for everything else. Union members make on average 23% more per hour than non-union workers with the same observed characteristics. The first two coefficients are small enough to interpret directly as percentages, but the approximation is less accurate for the union coefficient. You can see all coefficients in exponentiated form using reg, eform(e(b)).

(c) Describe the gender gap as estimated in this model and test its significance.

The coefficient of -0.208 corresponds to a relative effect of 0.812, indicating that female workers make on average 81 cents on the dollar, compared to males with the same education, work experience, union membership, region of residence and occupation. The t test of -4.99 on 523 d.f. is highly significant, indicating that the difference is too large to be due to chance.

(d) Check whether the returns to work experience are the same for males and females.

We add an interaction between the indicator for females and work experience and test its significance

. gen workXfem = workexp * female

. regress logwages educ workexp union south i.occupation female workXfem

      Source │       SS           df       MS      Number of obs   =       534
─────────────┼──────────────────────────────────   F(11, 522)      =     26.11
       Model │  52.6813635        11  4.78921486   Prob > F        =    0.0000
    Residual │  95.7654588       522  .183458733   R-squared       =    0.3549
─────────────┼──────────────────────────────────   Adj R-squared   =    0.3413
       Total │  148.446822       533  .278511862   Root MSE        =    .42832

──────────────┬────────────────────────────────────────────────────────────────
     logwages │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
──────────────┼────────────────────────────────────────────────────────────────
    education │   .0714044   .0099382     7.18   0.000     .0518806    .0909282
      workexp │   .0140507   .0022475     6.25   0.000     .0096353     .018466
  unionmember │   .1964052   .0509929     3.85   0.000     .0962287    .2965817
        south │   -.110057    .041532    -2.65   0.008    -.1916474   -.0284666
              │
   occupation │
       Sales  │   -.344788   .0914325    -3.77   0.000    -.5244089   -.1651672
    Clerical  │  -.2024009   .0762767    -2.65   0.008     -.352248   -.0525538
     Service  │  -.3819969   .0811085    -4.71   0.000    -.5413362   -.2226577
Professional  │  -.0367114   .0727433    -0.50   0.614    -.1796169    .1061942
       Other  │  -.1890991     .07595    -2.49   0.013    -.3383042    -.039894
              │
       female │  -.0858201   .0678538    -1.26   0.207      -.21912    .0474799
     workXfem │  -.0069357   .0030461    -2.28   0.023    -.0129198   -.0009516
        _cons │   1.158463   .1757792     6.59   0.000     .8131414    1.503784
──────────────┴────────────────────────────────────────────────────────────────
> m3 <- update(m2, ~ . + workexp:female)

> summary(m3)

Call:
lm(formula = log(wages) ~ education + workexp + unionmember + 
    south + occupation + female + workexp:female, data = wages)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.35253 -0.28561 -0.00226  0.29440  1.71607 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             1.158463   0.175779   6.590 1.07e-10 ***
education               0.071404   0.009938   7.185 2.34e-12 ***
workexp                 0.014051   0.002248   6.252 8.45e-10 ***
unionmember             0.196405   0.050993   3.852 0.000132 ***
south                  -0.110057   0.041532  -2.650 0.008295 ** 
occupationSales        -0.344788   0.091432  -3.771 0.000181 ***
occupationClerical     -0.202401   0.076277  -2.654 0.008209 ** 
occupationService      -0.381997   0.081109  -4.710 3.18e-06 ***
occupationProfessional -0.036711   0.072743  -0.505 0.614003    
occupationOther        -0.189099   0.075950  -2.490 0.013092 *  
female                 -0.085820   0.067854  -1.265 0.206515    
workexp:female         -0.006936   0.003046  -2.277 0.023195 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4283 on 522 degrees of freedom
Multiple R-squared:  0.3549,	Adjusted R-squared:  0.3413 
F-statistic: 26.11 on 11 and 522 DF,  p-value: < 2.2e-16

The interaction term is significant at the conventional five percent level. The results show that hourly wages are 1.4% higher per year of experience for males, but only 0.7% higher per year of experience for females with the same characteristics. In other words the estimated returns to experience for women are half those of men. (I kept work experience uncentered because 0 work experience is meaningful and interesting. Had I centered it I would have obtained the gender gap at mean experience as the "main" effect of female.)

(e) How does the result of part (d) affect the conclusions of part (c)? Do we still have a gender gap?

. sum workexp

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
     workexp │        534     17.8221    12.37971          0         55

. di exp(_b[female]) - 1, exp(_b[female] + r(mean) * _b[workXfem]) - 1
-.08224064 -.18895215
> mwe <- mean(wages$workexp)

> b <- coef(m3)

>  c(exp(b["female"]),  exp(b["female"] + mwe * b["workexp:female"])) - 1
     female      female 
-0.08224064 -0.18895215 

Yes, this result affects our conclusions. In part (c) we estimated an average gender gap across the entire range of experience. In fact the gap is smaller among new workers, but increases over time. Among new workers women make 8% less than comparable men, but the gap increases by 0.7 percentage points each year, so at the mean work experience of 17.8 years it has become 19%.

[3] Regression Diagnostics

(a) Calculate and plot the jack-knifed residuals versus fitted values for the final model using log wages. What does the plot tell us? Any outliers?

. predict ljack, rstu

. predict lfv
(option xb assumed; fitted values)

. scatter ljack lfv

. graph export ps2fig2.png, width(500) replace
(file ps2fig2.png written in PNG format)
> wages <- mutate(wages, ljack = rstudent(m3), lfv = fitted(m3))

> ggplot(wages, aes(lfv, ljack)) + geom_point()

> ggsave("ps2fig2r.png", width=500/72, height=400/72, dpi=72)
Residual Plot for Log Wages Residual Plot for Log Wages

Residual Plot for Log Wages

The graph is much improved, although there is still a hint of lower variance at very low wages. The outlier we noticed before has moved in a bit, but now we have a worker whose relative wages are almost six standard deviations lower than predicted.

  1. Run a quick Q-Q plot to check the normality of the residuals. How does the graph look? (No formal test required.)

To use ggplot you need to compute the quantiles "by hand" as shown in the R logs. We'll do it the easy way with qqnorm()

. qnorm ljack

. graph export ps2fig3.png, width(500) replace
(file ps2fig3.png written in PNG format)
> png("ps2fig3.png", width=500, height = 400, units = "px") 

> qqnorm(wages$ljack)

> dev.off()
windows 
      2 
Normality Plot Normality Plot

Normality Plot

We see that except for the two outliers just noted, the observations are not far from normality. These two, however, are so far out that almost any test would reject the assumption of normal residuals. (You may try swilk or sktest as test of normality.)

(c) Calculate the leverages. Do we have any observation with high leverage? Identify the observation with the highest leverage. Can you tell why it has potential influence?

. predict lev, lev

. gen n = _n

. scatter lev n

. graph export ps2fig4.png, width(500) replace    
(file ps2fig4.png written in PNG format)
> wages %>% filter(lev > .06) %>% 
+     select(n, wages, education, workexp, female, lev)
    n wages education workexp female        lev
1  63  7.00         3      55      0 0.06215793
2 347  6.00         4      54      0 0.06729734
3 351  3.75         2      16      0 0.07189593
Leverages Leverages

Leverages

We see three observations with comparatively large leverages, but none of these is large enough to cause concern. Let us list the three largest ones:

. list n wages educ workexp female lev if lev > .06

     ┌──────────────────────────────────────────────────────┐
     │   n   wages   educat~n   workexp   female        lev │
     ├──────────────────────────────────────────────────────┤
 63. │  63       7          3        55        0   .0621579 │
347. │ 347       6          4        54        0   .0672973 │
351. │ 351    3.75          2        16        0   .0718959 │
     └──────────────────────────────────────────────────────┘

. sum educ

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
   education │        534    13.01873    2.615373          2         18
> wages %>% filter(lev > .06) %>% 
+     select(n, wages, education, workexp, female, lev)
    n wages education workexp female        lev
1  63  7.00         3      55      0 0.06215793
2 347  6.00         4      54      0 0.06729734
3 351  3.75         2      16      0 0.07189593

> mean(wages$education)
[1] 13.01873

The highest leverage is a male worker with only two years of education. He has leverage because his education is substantially below the mean, and gains additional leverage because this is combined with comparatively low work experience. (You may find instructive plotting education versus work experience.)

(d) Calculate the Cook's distances. Do we have any observations that stand out? What do you think would happen to the gender gap if we refit the model omitting the observation with the highest actual influence? (No need to do the fit.)

. predict cook, cook

. scatter cook n

. graph export ps2fig5.png, width(500) replace
(file ps2fig5.png written in PNG format)
> ggplot(wages, aes(n, cook)) + geom_point()

>  ggsave("ps2fig5r.png", width=500/72, height=400/72, dpi=72)
Cook's Distances Cook's Distances

Cook's Distances

None is close to one. Many exceed the 4/(n-p) rule of thumb. But there are two that are much larger than all the others:

. gen pwages = exp(lfv)

. list n wages pwages educ occup workexp female cook if cook > .04

     ┌──────────────────────────────────────────────────────────────────────────────┐
     │   n   wages     pwages   educat~n   occupation   workexp   female       cook │
     ├──────────────────────────────────────────────────────────────────────────────┤
171. │ 171    44.5   7.999821         14   Management         1        1   .0437665 │
200. │ 200       1   10.51214         12   Management        24        0   .0634495 │
     └──────────────────────────────────────────────────────────────────────────────┘
> wages <- mutate(wages, pwages = exp(lfv))

> wages %>% filter(cook > 0.04) %>%
+ 	select(n, wages, pwages, education, workexp, female, cook)
    n wages    pwages education workexp female       cook
1 171  44.5  7.999821        14       1      1 0.04376649
2 200   1.0 10.512145        12      24      0 0.06344948

The largest one is a male manager with 23 years of experience who only makes $1 an hour, when we expect $10.51. Were we to omit this observation we would expect the gender gap to increase. The second one is a female manager with almost no experience who makes $44.5. Omitting her would also increase the gender gap. They would , however, have opposite effects on the interaction term. Interestingly, none of the high leverage cases ended up having actual influence on the fit.

(e) Did we need to transform the data? Was the natural logarithm a good transformation? Explore these questions in a Box-Cox framework by plotting the profile log-likelihood for parameters in the range (-2,2) and formally testing for the identity and log transformations.

Turns out boxcox doesn't allow factor variables, so we compute the occupation dummies and the interaction term ourselves. The code below uses value labels to name the occupation dummies

. forvalues cat = 2/6 {
  2.     local name : label (occupation) `cat'
  3.     gen occ_`name' = occupation == `cat'
  4. }    

We now follow the procedure outlined in the Stata logs to plot the profile log-likelihood.

. gen lambda = .
(534 missing values generated)

. gen ll = .
(534 missing values generated)

. local xvars educ workexp union south occ_* female workXfem

. local i = 0

. forvalues p = -2(0.5)2 {
  2.     quietly boxcox wages `xvars', iterate(0) from(`p',copy)
  3.     local i = `i' + 1
  4.     quietly replace lambda = `p' in `i'
  5.     quietly replace ll = e(ll) in `i'
  6. }

. graph twoway mspline ll lambda in 1/9, bands(9) ytitle(Profile log-lik)

. graph export ps2fig6.png, width(500) replace
(file ps2fig6.png written in PNG format)

We use the MASS library as shown in the R logs

> require(MASS)

> m4 <- update(m1, ~ . + workexp:female)

> png("ps2fig6r.png", width=500, height=400, unit="px")

> bc <- boxcox(m4)

> dev.off()
Box-Cox Profile Log-Likelihood Box-Cox Profile Log-Likelihood

Box-Cox Profile Log-Likelihood

You may want to run the boxcox command to see the tests in the summary table at the end. I'll run it quietly to get the estimate and log-likelihood and then compute the test from the output I already have:

. quietly boxcox wages `xvars'

. di _b[theta:_cons], e(ll)
-.01530901 -1398.4514

. quietly gen chisq = -2*(ll - e(ll))

. list lambda ll chisq if inlist(lambda, -1, 0, 1)

     ┌───────────────────────────────┐
     │ lambda          ll      chisq │
     ├───────────────────────────────┤
  3. │     -1   -1543.741   290.5793 │
  5. │      0   -1398.482   .0612271 │
  7. │      1   -1526.194   255.4846 │
     └───────────────────────────────┘

Obviously the maximum is not far from the log transformation, but boxcox doesn't tell us exactly where. We can find it using optimize()

> f <- function(lam) -boxcox(m4, lambda=lam, plotit=FALSE)$y

> opt <- optimize(f, interval = c(-1,1))

> opt
$minimum
[1] -0.01530976

$objective
[1] 1217.997

So the best transformation is -0.015 with a log-lik of 1217.997. The main use of this information is for testing selected transformations

> tests <- data.frame(lambda = c(-1, 0, 1))

> g <- function(lam) 2 * (f(lam) - opt$objective)

> tests <- mutate(tests, chisq = g(lambda))	

> tests
  lambda        chisq
1     -1 290.57935348
2      0   0.06123391
3      1 255.48446078

The results show that we would soundly reject the hypothesis that no transformation is needed, with a chi-squared of 255.48 on one d.f. On the other hand we can't reject the log transformation, with a chi-squared of 0.06 on one d.f., and very close to the optimal transformation, which corresponds to the power -0.015.