Home | GLMs | Multilevel | Survival | Demography | Stata | R
Introduction Getting Started Reading Data Linear Models GLMs Conclusions

The purpose of these notes, an update of my 1992 handout Introducing S-Plus, is to provide a quick introduction to R, particularly as a tool for fitting linear and generalized linear models.

Contents

1 Introduction
  1.1 The R Language and Environment
  1.2 Bibliographic Remarks

2 Getting Started
  2.1 The R Console
  2.2 Expressions and Assignments
  2.3 Vectors and Matrices
  2.4 Simple Graphs

3 Reading Data
  3.1 Lists and Data Frames
  3.2 Free-Format Input
  3.3 Fixed-Format Input
 3.4 Printing Data and Summaries
 3.5 Plotting Data

4 Linear Models
 4.1 Fitting a Model
 4.2 Examining a Fit
 4.3 Extracting Results
 4.4 Factors and Covariates
 4.5 Regression Splines
 4.6 Other Options

5 Generalized Linear Models
 5.1 Variance and Link Families
 5.2 Logistic Regression
 5.3 Updating Models
 5.4 Model Selection

6 Conclusion

References

A printer-friendly PDF version is available here.

1 Introduction

R is a powerful environment for statistical computing which runs on several platforms. These notes are written specially for users running the Windows version, but most of the material applies to the Mac and Linux versions as well.

1.1 The R Language and Environment

R was first written as a research project by Ross Ihaka and Robert Gentleman, and is now under active development by a group of statisticians called 'the R core team', with a home page at www.r-project.org.

R was designed to be 'not unlike' the S language developed by John Chambers and others at Bell Labs. A commercial version of S with additional features was developed and marketed as S-Plus by Statistical Sciences, which later became Insightful and is now TIBCO Spotfire. R and S-Plus can best be viewed as two implementations of the S language.

R is available free of charge and is distributed under the terms of the Free Software Foundation's GNU General Public License. You can download the program from the Comprehensive R Archive Network (CRAN). Ready-to-run 'binaries' are available for Windows, Mac OS X, and Linux. The source code is also available for download and can be compiled for other platforms.

These notes are organized in several sections, as shown in the table of contents on the right. I have tried to introduce key features of R as they are needed by students in my statistics classes. As a result, I often postpone (or altogether omit) discussion of some of the more powerful features of R as a programming language.

Notes of local interest, such as where to find R at Princeton University, appear in framed boxes and are labeled as such. Permission is hereby given to reproduce these pages freely and host them in your own server if you wish. You may add, edit or delete material in the local notes as long as the rest of the text is left unchanged and due credit is given. Obviously I welcome corrections and suggestions for enhancement.

1.2 Bibliographic Remarks

S was first introduced by Becker and Chambers (1984) in what's known as the 'brown' book. The new S language was described by Becker, Chambers and Wilks (1988) in the 'blue' book. Chambers and Hastie (1992) edited a book discussing statistical modeling in S, called the 'white' book. The latest version of the S language is described by Chambers (1998) in the 'green' book, but R is largely an implementation of the versions documented in the blue and white books. Chamber's (2008) latest book focuses on Programming with R.

Venables and Ripley (1994, 1997, 1999, 2002) have written an excellent book on Modern Applied Statistics with S-PLUS that is now in its fourth edition. The latest edition is particularly useful to R users because the main text explains differences between S-Plus and R where relevant. A companion volume called S Programming appeared in 2000 and applies to both S-Plus and R. These authors have also made available in their website an extensive collection of complements to their books, follow the links at MASS 4.

There is now an extensive and rapidly growing literature on R. Good introductions include the books by Krause and Olson (1987), Dalgaard (2002), and Braun and Murdoch (2007). Beginners will probably benefit from working through the examples in Everitt and Hothorn's (2006) A Handbook of Statistical Analyses Using R or Fox's (2002) companion to applied regression. Among more specialized books my favorites include Murrell (2005), an essential reference on R graphics, Pinheiro and Bates (2000), a book on mixed models, and Therneau and Grambsh's (2000) Modeling Survival Data, which includes many analyses using S-Plus as well as SAS. (Therneau wrote the survival analysis code used in S-Plus and R.) For additional references see the annotated list at R Books.

The official R manuals are available as PDF files that come with the R distribution. These include An Introduction to R (a nice 100-page introduction), a manual on R Data Import/Export describing facilities for transferring data to and from other packages, and useful notes on R installation and Administration. More specialized documents include a draft of the R Language Definition, a guide to Writing R Extensions, documentation on R Internals including coding standards, and finally the massive R Reference Index (~3000 pages). The online help facility is excellent. When you install R you get a choice of various help formats. I recommend compiled html help because you get a nice tree-view of the contents, an index, a pretty decent search engine, and nicely formatted help pages. (On Unix you should probably choose html help.)


Continue with Getting Started