3 Reading and Examining Data
R can handle several types of data, including numbers, character strings, vectors and matrices, as well as more complex data structures. In this section I describe data frames, the preferred way to organize data for statistical analysis, explain how to read data from an external file into a data frame, and show how to examine the data using simple descriptive statistics and informative plots.
3.1 Lists and Data Frames
An important data structure that we have not discussed so far is the list. A list is a set of objects that are usually named and can be anything: numbers, character strings, matrices or even lists.
Unlike a vector, whose elements must all be of the same type
(all numeric, or all character), the elements of a list may
have different types. Here's a list with two components created using the
> person = list(name="Jane", age=24)
Typing the name of the list prints all elements.
You can extract a component of a list using the extract operator
For example we can list just the name or age of this person:
> person$name  "Jane" > person$age  24
Individual elements of a list can also be accessed using their indices or
their names as subscripts. For example we can get the name using
(You can use single or double square brackets depending on whether
you want a list with the name, which is what we did, or just the name,
which would require double brackets as in
The distinction is not important at this point.)
A data frame is essentially a rectangular array containing the values of one or more variables for a set of units. The frame also contains the names of the variables, the names of the observations, and information about the nature of the variables, including whether they are numerical or categorical.
Internally, a data frame is a special kind of list, where each element is a vector of observations on a variable. Data frames look like matrices, but can have columns of different types. This makes them ideally suited for representing datasets, where some variables can be numeric and others can be categorical.
Data frames (like matrices) can also accommodate missing values,
which are coded using the special symbol
Most statistical procedures, however, omit all missing values.
Data frames can be created from vectors, matrices or lists using the function
data.frame, but more often than not one will read data from an
external file, as shown in the next two sections.
3.2 Free-Format Input
Free-format data are text files containing numbers or character strings separated by spaces. Optionally the file may have a header containing variable names. Here's an example of a data file containing information on three variables for 20 countries in Latin America:
setting effort change Bolivia 46 0 1 Brazil 74 0 10 Chile 89 16 29 Colombia 77 16 25 CostaRica 84 21 29 Cuba 89 15 40 DominicanRep 68 14 21 Ecuador 70 6 0 ElSalvador 60 13 13 Guatemala 55 9 4 Haiti 35 3 0 Honduras 51 7 7 Jamaica 87 23 21 Mexico 83 4 9 Nicaragua 68 0 7 Panama 84 19 22 Paraguay 74 3 6 Peru 73 0 2 TrinidadTobago 84 15 29 Venezuela 91 7 11
This small dataset includes an index of social setting, an index of family
planning effort, and the percent decline in the crude birth rate between 1965
The data are available at
in a file called
effort.dat which includes a header with the variable names.
R can read the data directly from the web:
> fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
The function used to read data frames is called
The argument is a character string giving the name of the file containing the data,
but here we have given it a fully qualified url (uniform resource locator),
and that's all it takes.
Alternatively, you could download the data and save them in a local file,
or just cut and paste the data from the browser to an editor such as Notepad,
and then save them. Make sure the file ends up in R's working directory,
which you can find out by typing
If that is not the case you can use a fully qualified path name or
change R's working directory by calling
setwd with a string argument.
Remember to double up your backward slashes (or use forward slashes instead) when
The special symbol
<-is R's assignment operator, which we have
encountered already. Here we assigned the data to an object named
To print the data simply type the name of the object.
> fpe setting effort change Bolivia 46 0 1 Brazil 74 0 10 ... output edited ... Venezuela 91 7 11
In this example R detected correctly that the first line in our file was a
header with the variable names. It also inferred correctly that the first column
had the observation names. (Well, it did so with a little help; I made sure the
row names did not have embedded spaces, hence
Alternatively, I could have used
"Costa Rica" in quotes as a row name.)
You can always tell R explicitly whether or not you have a header by specifying
the optional argument
header=FALSE to the
This is important if you have a header but lack row names, because R's guess is
based on the fact that the header line has one less entry than the next row,
as it did in our example.
If your file does not have a header line, R will use the default variable names V1,
V2, ..., etc. To override this default use
read.table's optional argument
col.names to assign variable names.
This argument takes a vector of names. So, if our file
did not have a header we could have used the command
> fpe = read.table("noheader.dat", + col.names=c("setting","effort","change"))
Incidentally this is the first time that our command did not fit in a line.
R code can be continued automatically
in a new line simply by making it obvious that we are not done, for example
ending the line with a comma, or having an unclosed left parenthesis. R
responds by prompting for more with the continuation symbol
instead of the usual prompt
If your file does not have observation names, R will simply number the
observations from 1 to n. You can specify row names using read.table's
row.names, which works just like
?data.frame for more information.
There are two closely related functions that can be used to get or set
variable and observation names at a later time. These are called
names (for the variable names), and
(for the observation names).
Thus, if our file did not have a header we could have read the
data and then changed the default variable names using the
> fpe = read.table("noheader.dat") > names(fpe) = c("setting","effort","change")
Technical Note: If you have a background in other programming languages you may be surprised to see a function call on the left hand side of an assignment. These are special 'replacement' functions in R. They extract an element of an object and then replace its value.
In our example all three-variables were numeric. R will handle string variables with no problem. If one of our variables was sex, coded M for males and F for females, R would have created a factor, which is basically a categorical variable that takes one of a finite set of values called levels. In Section 5 we will use a data frame with categorical variables to illustrate logistic regression. Another way to generate factors is by grouping a numeric covariate. An example appears in Section 4 below.
Exercise: Use a text editor to create a small file with the following three lines:
a b c 1 2 3 4 5 6
Read this file into R so the variable names are a, b and c. Now delete the first row and read the file again so the variable names are still a, b and c.
3.3 Fixed-Format Input*
Suppose the family planning effort data had been stored in a file containing only the actual data (no country names or variable names) in a fixed format, with social setting in character positions (often called columns) 1-2, family planning effort in positions 3-4 and fertility change in positions 5-6. This is a fairly common way to organize large datasets.
The following call will read the data into a data frame and name the variables:
> fpe = read.table("fixedformat.dat", + col.names = c("setting", "effort", "change"), + sep=c(1, 3, 5))
Here I assume that the file in question is called fixedformat.dat.
I assign column names just as before, using the
The novelty lies in the next argument, called
which is used to indicate how the variables are separated.
The default is white space, which is appropriate when the
variables are separated by one or more blanks or tabs.
If the data are separated by commas, a common format with spreadsheets,
you can specify
Here we created a vector with the numbers 1, 3 and 5 to specify the
character position (or column) where each variable starts.
?read.table for more details.
3.4 Printing Data and Summaries
You can refer to any variable in the
fpe data frame using the extract
For example to look at the values of the fertility change variable, type
and R will list a vector with the values of change for the 20 countries. You can also define fpe as your default dataset by "attaching" it to your session:
If you now type the name
effort by itself, R will now look for it
fpe data frame.
If you are done with a data frame you can detach it using
To obtain simple descriptive statistics on these variables try the
> summary(fpe) setting effort change Min. :35.0 Min. : 0.00 Min. : 0.00 1st Qu.:66.0 1st Qu.: 3.00 1st Qu.: 5.50 Median :74.0 Median : 8.00 Median :10.50 Mean :72.1 Mean : 9.55 Mean :14.30 3rd Qu.:84.0 3rd Qu.:15.25 3rd Qu.:22.75 Max. :91.0 Max. :23.00 Max. :40.00
As you can see, you get the min and max, 1st and 3rd quartiles, median and
mean. For categorical variables you get a table of counts. Alternatively, you
may ask for a summary of a specific variable.
Or use the functions
for the mean and variance of a variable,
cor for the correlation between two variables, as shown below:
> mean(effort)  9.55 > cor(effort,change)  0.80083
Elements of data frames can be addressed using the subscript notation introduced in Section 2.3 for vectors and matrices. For example to list the countries that had a family planning effort score of zero we can use
> fpe[effort == 0,] setting effort change Bolivia 46 0 1 Brazil 74 0 10 Nicaragua 68 0 7 Peru 73 0 2
This works because the expression
effort == 0 selects the rows
(countries) where the effort score is zero, while
leaving the column subscript blank selects all columns (variables).
The fact that the rows are named allows yet another way to select elements: by name. Here's how to print the data for Chile:
> fpe["Chile",] setting effort change Chile 89 16 29
Exercise: Can you list the countries where social setting is high
(say above 80) but effort is low (say below 10)?
Hint: recall the element-by-element logical operator
3.5 Plotting Data
Probably the best way to examine the data is by using graphs. Here's a boxplot of setting. Inspired by a demo included in the R distribution, I used custom colors for the box ("lavender", specifyied using a name R recognizes) and the title (#3366CC, which specifies the red, green and blue components of the color in hexadecimal notation; this particular choice matches the headings on this web page).
> boxplot(setting, col="lavender") > title("Boxplot of Setting", col.main="#3366CC")
As noted earlier, R can save a plot as a png or jpeg file, so that it can be included directly on a web page. Other formats available are postscript for printing and windows metafile for embedding in other applications. Note also that you can cut and paste a graph to insert it in another document.
Here's a scatterplot of change by effort, so you can see what a correlation of 0.80 looks like:
> plot(effort, change, pch=21, bg="gold") > title("Scatterplot of Change by Effort", col.main="#3366CC")
I used two optional arguments that work well together:
pch=21 selects a special plotting symbol, in this case a circle, that
can be colored and filled; and
bf="gold" selects the fill color for the symbol.
I left the perimeter black, but you can change this color with the
To identify points in a scatterplot use the
identify function. Try
the following (assuming the scatterplot is still the active graph):
> identify(effort, change, row.names(fpe), ps=9)
The first three parameters to this function are the x and y coordinates of the
points and the character strings to be used in labeling them.
ps optional argument specifies the
size of the text in points; here I picked 9-point labels.
Now click within a quarter of an inch of a point and the name of the country should appear in the graph. Which country had the most effort but only moderate change? Which one had the most change?
To quit identifying points right click on the graph and select Stop from the pop-up menu. The function returns the indices of the units selected. (Click on the RConsole to make it the focused window before you type more commands.)
Another interesting plot to try is
pairs, which draws a scatterplot
matrix. In our example try
and you will see a 3 by 3 matrix of scatterplots with the variable names down the diagonal and a plot of each variable against every other one.
Before you quit this session consider saving the
To do this use the
> save(fpe, file="fpe.Rdata") > load("fpe.rdata")
The first argument specifies the object to be saved, and the file argument provides the name of a file, which will be in the working directory unless a full path is given. (Remember to double-up your backslashes, or use forward slashes instead.)
By default R saves objects using a compact binary format which is portable
across all R platforms.
There is an optional argument
ascii that can be set to TRUE
to save the object as ASCII text. This option was handy to transfer R objects
across platforms but is no longer needed.
The menu item File | Save Image and its companion File | Load Image can be used to save and load an image of the entire workspace, including all objects that have been created (and not removed) in the session.
Exercise: Use R to create a scatterplot of change by setting, cut and paste the graph into a document in your favorite word processor, and try resizing and printing it. I recommend that you use the windows metafile format for the cut and paste operation.
Continue with Linear Models