﻿ 2.10 Transforming the Data

## 2.10 Transforming the Data

The final section in this chapter deals with Box-Cox transformations To avoid problems with negative values of the response variable, we add 1/2 to all observations:

```. gen y = change + 0.5
```

Stata has a powerful `boxcox` command that can fit models where both the response and optionally (a subset of) the predictors are transformed. (Earlier versions could transform only the outcome, but in exchange provided a few additional options, including a plot that we will now do 'by hand'.)

#### The Box-Cox Transformation

We will determine the optimal transformation for the analysis of covariance model of Section 2.8. If you are running this in a different session you will need to redefine the local macro with the predictors:

```. local predictors setting effort_mod effort_str
```

We are interested in transforming the outcome or 'left-hand-side' only. I will specify the option `model(lhs)` to make this clear, although it is the default and can be omitted. I will also specify `nolog` to suppress the iteration log:

```. boxcox y `predictors', model(lhs) nolog
Fitting comparison model

Fitting full model

Number of obs   =         20
LR chi2(3)      =      29.29
Log likelihood = -59.245917                       Prob > chi2     =      0.000

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      Z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
/theta |   .6686079    .167689     3.99   0.000     .3399435    .9972724
------------------------------------------------------------------------------

Estimates of scale-variant parameters
----------------------------
|      Coef.
-------------+--------------
Notrans      |
setting |   .0945824
effort_mod |   1.957216
effort_str |   7.661322
_cons |  -3.247024
-------------+--------------
/sigma |   2.282724
----------------------------

---------------------------------------------------------
Test         Restricted     LR statistic      P-value
H0:       log likelihood       chi2       Prob > chi2
---------------------------------------------------------
theta = -1      -100.56379        82.64           0.000
theta =  0       -67.15625        15.82           0.000
theta =  1      -61.068635         3.65           0.056
---------------------------------------------------------

. scalar maxlogL = e(ll)
```

As you can see, Stata suggests a power of 0.6686, which is in good agreement with what one would expect from Figure 2.8 in the notes. We'll see how to do this figure below. For now, note that we saved the maximized log-likelihood, which was available as `e(ll)`, in a scalar called `maxlogL`. (To see a list of all quantities available for extraction after an estimation command type `ereturn list`.)

Stata also fits the model using the optimal transformation and shows the resulting coefficients, but not the standard errors. The latter are supressed because they do not account for the fact that estimating the transformation itself introduces additional uncertainty. To test the significance of a coefficient you can compare the models with and without the corresponding variable using a likelihood ratio test. Note, however, that dropping a variable may change the transformation being used.

My preferred approach is to use the Box-Cox procedure as general guidance on whether a transformation is needed and, if so, which value in the 'ladder of powers' would do a good job. Having settled on something like taking square roots, logs, or reciprocals, one can then proceed conditionally on the chosen transformation. Stata can help implement this approach in two ways.

First, Stata shows likelihood ratio tests for the hypotheses that the Box-Cox parameter is -1, 0 and 1, which correspond to the reciprocal, the log, and no transformation at all. The last possibility cannot be rejected at the conventional five percent level, indicating that there is no evidence that we need to transform the response. The log and reciprocal transformations are both soundly rejected. If one insisted on transforming the data, taking square roots would probably be best.

Second, we can plot a profile likelihood showing the relative merit of various transformations. Stata 6 used to do a graph similar to what we need as an option of the `boxcox` command, but the option is not available in later versions. This provides us an opportunity to do a little programming exercise. (We could, of course, type `version 6` and have Stata behave as it did back then. One disadvantage of this approach is that we have no control over the range of transformations plotted. Also, version 6 used to omit a constant from the log-likelihood, so the reported values need to be adjusted for comparison with later versions.)

#### The Profile Log-Likelihood

It turns out that we can compute the Box-Cox log-likelihood for any value of the parameter using two options of the `boxcox` command which deal with the maximization procedure. We specify the transformation as a starting value with the option `from(value, copy)`, and set the maximum number of iterations to zero with `iterate(0)`, so Stata simply computes the log-likelihood, which we can then retrieve from `e(ll)`. A hack, really, but it beats having to program your own function.

Next we write a short loop to compute the log-likelihood for exponent values between -1 and 2 in steps of 0.5. We also create two new variables, `p` to store the exponents, and `logL` to store the log-likelihoods. (If you want to learn more about Stata macros and loops see part 4 of my Stata Tutorial.)

```. gen p = .
(20 missing values generated)

. gen logL = .
(20 missing values generated)

. local I = 1

. forvalues p = -1(0.5)2  {
2.         quietly boxcox y `predictors', from(`p',copy) iterate(0)
3.         quietly replace p = `p' in `I'
4.         quietly replace logL = e(ll) in `I'
5.         local I = `I' + 1
6. }
```

The graph that follows uses a spline to join the points using a smooth curve. We also draw a horizontal line to identify powers that are not significantly different from the best. This occurs when twice the difference in log-likelihoods is less than 3.84, the 95% critical value for a chi-squared with one d.f. In the scale of logL this makes the line 3.84/2 units below the highest point of the curve.

```. gen cb =  maxlogL - 3.84/2  if p > -0.5 & p < 2
(16 missing values generated)

. graph twoway (mspline logL p, bands(7)) (line cb p) ,   ///
>    title("Figure 2.8: Box-Cox Profile Log-Likelihood")  ///
>    xtitle("lambda") ytitle("log-likelihood") legend(off)

. graph export fig28.png, width(500) replace
(file fig28.png written in PNG format)
```

#### Atkinson's Score Test

Our final calculation involves Atkinson's score test, which requires fitting the auxiliary variable given in Equation 2.31 in the notes. We compute the geometric mean, store it in a scalar called `gmean`, use this to compute the auxiliary variable `atkinson`, and then fit the extended model:

```. gen logy = ln(y)

. quietly summarize logy

. scalar gmean = exp(r(mean))

. gen atkinson = y * (ln(y/gmean) - 1 )

. regress change `predictors' atkinson

Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  4,    15) =   23.67
Model |  2287.80568     4  571.951421           Prob > F      =  0.0000
Residual |  362.394315    15   24.159621           R-squared     =  0.8633
Total |      2650.2    19  139.484211           Root MSE      =  4.9152

------------------------------------------------------------------------------
change |      Coef.   Std. Err.      T    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
setting |   .1969659   .0911353     2.16   0.047     .0027155    .3912163
effort_mod |   3.785032   2.739944     1.38   0.187     -2.05502    9.625084
effort_str |   11.66637   4.380003     2.66   0.018     2.330614    21.00212
atkinson |   .5916301   .2275638     2.60   0.020     .1065895    1.076671
_cons |  -3.858157   6.197538    -0.62   0.543     -17.0679    9.351583
------------------------------------------------------------------------------
```

The coefficient of the auxiliary variable is 0.59, so the optimal power is approximately 1-0.59 = 0.41, suggesting again that something like a square root transformation might be indicated. The associated t-statistic is significant at the two percent level, but the more accurate likelihood ratio test statistic calculated earlier was only borderline. Thus, we do not have strong evidence against keeping the response in the original scale.

Exercise 1: Try the Box-Tidwell procedure of Equation 2.32 in the notes to see if a transformation of social setting would be indicated.

Exercise 2: Run `boxcox` to estimate optimal (and possibly different) transformations of change and setting, but obviously not of the two dummies representing levels of effort.

Continue with 3 Logit Models in Stata