Germán Rodríguez
Introducing R Princeton University

6 Next Steps

These notes have hardly scratched the surface of R, which has many more statistical functions. These include functions to calculate the density, cdf, and inverse cdf of distributions such as chi-squared, t, F, lognormal, logistic and others.

The survival library includes methods for the estimation of survival curves, tests of differences between survival curves, and Cox proportional hazards models. The library lme4 includes code for fitting generalized linear mixed effect models, including multilevel models. Many new statistical procedures are first made available to the research community in the form of R functions.

To produce really nice graphs consider installing the ggplot2 package. To draw a plot you specify a data frame, aesthetics that map variables to aspects of the graph, and geometries that specify whether to use points, lines, or other primitives. You fill find more information at https://ggplot2.tidyverse.org/

For data management I recommend that you install the dplyr package, which includes tools for adding new variables, selecting cases or variables (rows or columns), as well as summarizing and re-arranging your data. Check the overview at https://dplyr.tidyverse.org/.

You can also run install.packages("tidyverse") to install all the packages in the tidyverse, including ggplot2 and dplyr, as well as tidyr (for help tidying data), readr (for reading rectangular data like csv files), purrr (for an alternative to loops), tibble (for tidy data frames), stringr (for working with strings) and forcats (for working with factors). Learn more at https://www.tidyverse.org/packages/.

In addition, R is a full-fledged programming language, with a rich complement of mathematical functions, matrix operations and control structures. It is very easy to write your own functions. To learn more about programming R, I recommend Wickman (2019)’s Advanced R book.

R is an interpreted language but it is reasonably fast, particularly if you take advantage of the fact that operations are vectorized, and try to avoid looping. Where efficiency is crucial you can always write a function in a compiled language such as C or Fortran and then call it from R. Some of my work on multilevel generalized linear models used this approach.

Last, but most certainly not least, you will want to learn about dynamic documents using R Markdown. The basic idea here is to combine a narrative written in Markdown with R code, an approach that has excellent support in R Studio. The definite book on the subject is Xie, Allaire, and Grolemund (2019).

This tutorial has been written in R Markdown. You can download the source code and the bibliography file. To reproduce the PDF document you also need tweaks.tex. To generate an HTML document change the output specification near the top of the script.

References

Becker, Richard A., and John M. Chambers. 1984. S an Interactive Environment for Data Analysis and Graphics. Belmont, CA: Wadsworth.

Becker, Richard A., John M. Chambers, and Allan R. Wilks. 1988. The New S Language. Pacific Grove, CA: Wadsworth.

Braun, W. John, and Duncan J. Murdoch. 2016. A First Course in Statistical Programming with R. Second Edition. Cambridge University Press.

Chambers, John M. 1998. Programming with Data. New York: Springer.

———. 2008. Software for Data Analysis. Programming with R. New York: Springer.

———. 2016. Extending R. Boca Raton, FL: Chapman Hall/CRC.

Chambers, John M., and Trevor J. Hastie, eds. 1992. Statistical Models in S. Pacific Grove, CA: Wadsworth.

Dalgaard, Peter. 2008. Introductory Statistics with R. Second Edition. New York: Springer.

Fox, John. 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks, CA: SAGE.

Hothorn, Torsten, and Brian S. Everitt. 2014. A Handbook of Statistical Analyses Using R. Third Edition. Boca Raton, FL: CRC Press.

Krause, Anreas, and Melvin Olson. 1997. The Basics of S and S-Plus. New York: Springer.

Murrell, Paul. 2006. R Graphics. Boca Raton, FL: Chapman Hall/CRC.

Pinheiro, José C., and Douglas M. Bates. 2000. Mixed-Effects Models in S and S-Plus. New York: Springer.

Therneau, Terry M., and Patricia M. Grambsch. 2000. Modeling Survival Data. New York.

Venables, W. N., and B. D. Ripley. 2000. S Programming. New York: Springer.

———. 2002. Modern Applied Statistics with S-Plus. Fourth Edition. New York: Springer.

Wickman, Hadley. 2016. ggplot2 Elegant Graphics for Data Analysis. Second Edition. New York: Springer. https://ggplot2-book.org/.

———. 2019. Advanced R. Second Edition. Boca Raton, FL: CRC Press. https://adv-r.hadley.nz/.

Wickman, Hadley, and Garret Grolemund. 2017. R for Data Science. Sebastopol, CA: O’Reilly. https://r4ds.had.co.nz/.

Wilkinson, Leland. 2005. The Grammar of Graphics. Second Edition. New York: Springer.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2019. R Markdown: The Definitive Guide. Boca Raton, FL: Chapman Hall/CRC. https://bookdown.org/yihui/rmarkdown/.