2. Data Management
In this section we describe Stata data files, discuss how to read raw data into Stata in free and fixed formats, how to create new variables, how to document a dataset labeling the variables and their values, and how to manage Stata system files.
Stata 11 has a new variables manager that allows editing variable names, labels, types, formats, and notes, as well as value labels, using an intuitive graphical user interface available under Data|Variables Manager in the menu system. While the manager is certainly convenient, we still prefer writing all commands in a do file to ensure research reproducibility. A nice feature of the manager, however, is that it generates the Stata commands needed to accomplish the changes, so it can be used as a learning tool and, as long as you are logging the session, leaves a record behind.
2.1 Stata Files
Stata datasets are rectangular arrays with n
observations on m variables. Unlike packages that read
one observation at a time, Stata keeps all data in memory, which is
one reason why it is so fast.
There's a limit of 2,047 variables in Stata/IC, 32,767 in Stata/SE.
You can have as many observations as your computer's memory will allow,
provided you don't go too far above 2 billion cases. (To find these
limits type help limits.)
2.1.1 Variable Names
Variable names can have up to 32 characters, but many commands
print only 12, and shorter names are easier to type. Stata
names are case sensitive, Age and
age are different variables! It pays to develop a
convention for naming variables and sticking to it. I prefer short
lowercase names and tend to use single words or abbreviations
rather than multi-word names, for example I prefer
effort or fpe to
family_planning_effort or
familyPlanningEffort, although all four names are
legal. Note the use of underscores or camel casing to
separate words.
2.1.2 Variable Types
Variables can contain numbers or strings. Numeric variables can
be stored as integers (bytes, integers, or longs) or floating point
(float or double). These types differ in the range or precision of
the values they can hold, type help datatype for
details.
You usually don't need to be concerned about the storage mode;
Stata does all calculations using doubles, and the
compress command will find the most economical way to
store each variable in your dataset, type help
compress to learn more.
You do have to be careful with logical comparisons involving floating point types. If you store 0.1 in a float called x you may be surprised to learn that x == 0.1 is never true. The reason is that 0.1 is "rounded" to different binary numbers when stored as a float (x) or as a double (the constant 0.1). This problem does not occur with integers or strings.
String variables can have varying lengths up to 244 characters (earlier versions of Intercooled Stata allowed only 80 characters). Strings are ideally suited for id variables because they can be compared without problems.
Sometimes you may need to convert between numeric and string
variables. If a variable has been read as a string but really
contains numbers you will want to use the command destring
or the function real().
Otherwise, you can use encode to convert string data into a numeric variable
or decode to convert numeric variables to strings.
These commands rely on value labels, which are described below.
2.1.3 Missing Values
Like other statistical packages, Stata distinguishes
missing values. The basic missing value for numeric
variables is represented by a dot . Starting with
version 8 there are 26 additional missing-value codes denoted
by .a to .z. These values are
represented internally as very large numbers, so
valid_numbers< . < .a < ... <
.z.
To check for missing you need to write var >= .
(not var == .). Stata has a function that can do this
comparison, missing(varname) and I recommend it because it
leads to more readable code, e.g.
I prefer list id if missing(age) to
list id if age >= .
Missing values for string variables are denoted by "", the empty string; not to be confused with a string that is all blanks, such as " ".
Demographic survey data often use codes such as 88 for not applicable and 99 for not ascertained. For example age at marriage may be coded 88 for single women and 99 for women who are known to be married but did not report their age at marriage. You will often want to distinguish these two cases using different kinds of missing value codes. If you wanted to recode 88's to .n (for "na" or notapplicable) and 99's to .m (for "missing") you could use the code
. replace ageAtMar = .n if ageAtMar == 88 . replace ageAtMar = .m if ageAtMar == 99Sometimes you want to tabulate a variable including missing values but excluding not applicable cases. If you will be doing this often you may prefer to leave 99 as a regular code and define only 88 as missing. Just be careful if you then run a regression!
Stata ships with a number of small datasets, type sysuse
dir to get a list. You can use any of these by typing
sysuse name. The Stata website is also a repository
for datasets used in the Stata manuals and in a number of
statistical books.
2.2 Reading Data Into Stata
In this section we discuss how to read raw data
files. If your data come from another statistical package, such as
SAS or SPSS, consider using a tool such as Stat/Transfer (www.stattransfer.com) or
DBMSCopy (www.dataflux.com).
Stata can read SAS transport files with the fdause command
(so-named because this is the format required by the Food and Drug
Administration), type help fdause. Stata can also read
data from relational databases, type help odbc for an
introduction.
2.2.1 Free Format
If your data are in free format, with
variables separated by blanks, commas, or tabs, you can use
the infile command.
For an example of a free format file see the family planning effort data available on the web at http://data.princeton.edu/wws509/datasets (read the description and click on effort.raw). This is essentially a text file with four columns, one with country names and three with numeric variables, separated by white space. We can read the data into Stata using the command
. infile str14 country setting effort change using /// > http://data.princeton.edu/wws509/datasets/effort.raw
The infile command is followed by the
names of the variables. Because the country name is a string rather
than a numeric variable we precede the name with
str14, which sets the type of the variable as a string
of up to 14 characters. All other variables are numeric, which is
the default type.
The keyword using is followed by the name of the
file, which can be a file on your computer, a local network, or the
internet. In this example we are reading the file directly off the
internet. And that's all there is to it. For more information on
this command type help infile1.
To see what we got we can list
a few cases
. list in 1/3
Spreadsheet packages such as Excel often export data separated
by tabs or commas with one observation per line. Sometimes the
first line has the names of the variables. If your data are in this
format you can read them using the insheet
command. This command is a bit simpler than
infile, but unfortunately doesn't work with
blank-separated values. Type help insheet to learn
more.
2.2.2 Fixed Format
Survey data often come in fixed format, with one or more records per case and each variable in a fixed position in each record.
The simplest way to read fixed-format data is using the
infix command to specify the columns where each
variable is located. As it happens, the effort data are
neatly lined up in columns, so we could read them as follows:
infix str country 4-17 setting 23-24 effort 31-32 change 40-41 using /// > http://data.princeton.edu/wws509/datasets/effort.raw, clear
This says to read the country name from columns 4-17, setting
from columns 23-24, and so on. It is, of course, essential to read
the correct columns. We specified that country was a string
variable but didn't have to specify the width, which was clear from
the fact that the data are in columns 4-17. The clear
option is used to overwrite the existing dataset in memory.
If
you have a large number of variables you should consider typing the
names and locations on a separate file, called a dictionary,
which you can then call from the infix command. Try
typing the following dictionary into a file called
effort.dct:
infix dictionary using http://data.princeton.edu/wws509/datasets/effort.raw {
str country 4-17
setting 23-24
effort 31-32
change 40-41
}
Dictionaries accept only * comments and these must appear after the first line. After you save this file you can read the data using the command
. infix using effort.dct, clear
Note that you now 'use' the dictionary, which in turn
'uses' the data file. Instead of specifying the name of the
data file in the dictionary you could specify it as an option to
the infix command, using the form infix using dictionaryfile,
using(datafile). The first 'using' specifies
the dictionary and the second 'using' is an option
specifying the data file. This is particularly useful if you
want to use one dictionary to read several data files stored
in the same format.
If your observations span multiple records or lines you can
still read them using infix as long as all
observations have the same number of records (not necessarily all
of the same width). For more information see help
infix.
The infile command can also be used with
fixed-format data and a dictionary. This is a very powerful
command that gives you a number of options not available with
infix; for example it lets you define variable labels
right in the dictionary, but the syntax is a bit more complicated.
See help infile2.
In most cases you will find that you can read free-format data
using infile and fixed-format data using
infix. For more information on the trade-offs see
help infiling.
Data can also be typed directly into Stata using the
input command, see help input, or using
the built-in Stata data editor available through Data|Data
editor on the menu system.
2.3 Data Documentation
After you read your data into Stata it is important to prepare some documentation. In this section we will see how to create dataset, variable, and value labels, and how to create notes for the data or variables.
2.3.1 Data Label and Notes
Stata lets you label your dataset using the label
data command followed by a label of up to 80 characters (244
in Stata SE). You can also add notes of up to ~64K characters
each using the notes command followed
by a colon and then the text:
. label data "Family Planning Effort Data" . notes: Source P.W. Mauldin and B. Berelson (1978). /// > Conditions of fertility decline in developing countries, 1965-75. /// > Studies in Family Planning, 9:89-147
Users of the data can type notes to see your
annotation. Documenting your data carefully always pays off.
2.3.2 Variable Labels and Notes
You can (and should) label your variables using the label
variable command followed by the name of the variable and a
label of up to 80 characters enclosed in quotes. With the
infile command you can add these labels to the
dictionary, which is a natural home for them. Otherwise you
should prepare a do file with all the labels.
Here's how to define labels for the three variables in our
dataset:
. label variable setting "Social Setting" . label variable effort "Family Planning Effort" . label variable change "Fertility Change"
Stata also lets you add notes to specific variables using the
command notes varname: text.
Note that the command is followed by a variable name and
then a colon:
. notes change: Percent decline in the crude birth rate (CBR) -the number of /// > births per thousand population- between 1965 and 1975.
Type describe and then notes to check
our work so far.
2.3.3 Value Labels
You can also label the values of categorical variables. Our
dataset doesn't have any categorical variables but let's create
one. We will make a copy of the family planning effort variable and
then group it into three categories, 0-4, 5-15 and 15+, which
represent weak, moderate and strong programs (the generate
and recode used in the first two lines are described in
the next section, where we also show how to accomplish all these
steps with just one command):
. generate effortg = effort
. recode effortg 0/4=1 5/14=2 15/max=3
. label define effortg 1 "Weak" 2 "Moderate" 3 "Strong"
. label values effortg effortg
. label variable effortg "Family Planning Effort (Grouped)"
Stata has a two-step approach to defining labels. First you
define a named label set which associates integer codes
with labels of up to 80 characters (244 in Stata SE), using the
label define command. Then you
associate the set of labels with a variable, using the label
values command. Often you use the same name for the label
set and the variable, as we did in our example.
One advantage of this approach is that you can use the same
set of labels for several variables. The canonical example
is label define yesno 1 "yes" 0 "no", which can
then be associated with all 0-1 variables in your dataset, using a
command of the form
label values variablename yesno for each one.
When defining labels you can omit the quotes if the label is a
single word, but I prefer to use them always for clarity.
Label sets can be modified using the options add or
modify, listed using
label dir (lists only names) or
label list (lists names and labels), and saved to a do
file using label save. Type
help label to learn more about these options and
commands. You can also have labels in different languages as
explained below.
2.3.4 Multilingual Labels*
(This sub-section can be skipped without loss of continuity.) A Stata file can store labels in several languages and you can move freely from one set to another. We'll illustrate the idea by creating Spanish labels for our dataset. Following Stata recommendations we will use the ISO standard two-letter language codes, en for English and es for Spanish.
First we use label language to rename the
current language to en, and to create a new language
set es:
. label language en, rename . label language es, new
If you type desc now you will discover that our
variables have no labels! We could have copied the English ones by
using the option copy, but that wouldn't save us any
work in this case. Here are Spanish versions of the data and
variable labels:
. label data "Datos de Mauldin y Berelson sobre Planificacion Familiar"
. label variable country "Pais"
. label variable setting "Indice de Desarrollo Social"
. label variable effort "Esfuerzo en Planificacion Familiar"
. label variable effortg "Esfuerzo en Planificacion Familiar (Agrupado)"
. label variable change "Cambio en la Tasa Bruta de Natalidad (%)"
These definitions do not overwrite the corresponding English
labels, but coexist with them in a parallel Spanish universe. With
value labels you have to be a bit more careful, however; you can't
just redefine the label set called effortg
because it is only the association between a variable and a
set of labels, not the labels themselves, that is stored in a
language set. What you need to do is define a new label set, wel'll
call it effortg_es,
combining the old name and the new language code, and then
associate it with the variable effortg:
. label define effortg_es 1 "Debil" 2 "Moderado" 3 "Fuerte" . label values effortg effortg_es
You may want to try the describe command now. Try
tabulating effort:
. table effortg
Next we change the language back to English and run the table again:
. label language en . table effortg
One limitation of multi-language language support in
Stata is that labels are restricted to 7-bit ascii
characters, so you cannot include letters with diacritical marks
such as accents. For more information type help
label_language.
2.4 Creating New Variables
The most important Stata commands for creating new
variables are generate/replace
and recode, and they are often used together.
2.4.1 Generate and Replace
The generate command creates a new
variable using an expression that may combine constants, variables,
functions, and arithmetic and logical operators. Let's start
with a simple example: here is how to create setting squared:
. gen settingsq = setting^2.
If you are going to use this term in a regression you know
that linear and quadratic terms are highly correlated. It may
be a good idea to center the variable (by subtracting the mean)
before squaring it. Here we run summarize using
quietly to suppress the output and retrieve the mean
from the stored result r(mean):
. quietly summarize setting
. gen settingcsq = (setting - r(mean))^2
Note that I used a different name for this variable. Stata will
not let you overwrite an existing variable using generate. If
you really mean to replace the values of the old variable use
replace instead. You can also use drop var_names to drop
one or more variables from the dataset.
2.4.2 Operators and Expressions
The following table shows the standard arithmetic, logical and relational operators you may use in expressions:
| Arithmetic | Logical | Relational |
| + add | ! not (also ~) | == equal |
| - subtract | | or | != not equal (also ~=) |
| * multiply | & and | < less than |
| / divide | <= less than or equal | |
| ^ raise to power | > greater than | |
| + string concatenation | >= greater than or equal |
Here's how to create an indicator variable for countries with high-effort programs:
. generate hieffort = effort > 14
This is a common Stata idiom, taking advantage of the fact that logical expressions take the value 1 if true and 0 if false. A common alternative is to write
. generate hieffort2 = 0
. replace hieffort2 = 1 if effort > 14
The two strategies yield exactly the same answer. Both will be wrong if there are missing values, which will be coded as high effort because missing value codes are very large values, as noted in Section 2.1 above. You should develop a good habit of avoiding open ended comparisons. My preferred approach is to use
. generate hieffort3 = effort > 14 if !missing(effort)
which gives true for effort above 14, false for effort less than or equal to 14, and missing when effort is missing. Logical expressions may be combined using & for "and" or | for "or". Here's how to create an indicator variable for effort between 5 and 14:
. gen effort5to14 = (effort >=5 & effort <= 14)
Here we don't need to worry about missing values, they are
excluded by the clause effort <= 14.
2.4.3 Functions
Stata has a large number of functions, here are a few
frequently-used mathematical functions, type help
mathfun to see a complete list:
| abs(x) | the absolute value of x |
| exp(x) | the exponential function of x |
| int(x) | the integer obtained by truncating x towards zero |
| ln(x) or log(x) | the natural logarithm of x if x>0 |
| log10(x) | the log base 10 of x (for x>0) |
| logit(x) | the log of the odds for probability x: logit(x) = ln(x/(1-x)) |
| max(x1,x2,...,xn) | the maximum of x1, x2, ..., xn, ignoring missing values |
| min(x1,x2,...,xn) | the minimum of x1, x2, ..., xn, ignoring missing values |
| round(x) | x rounded to the nearest whole number |
| sqrt(x) | the square root of x if x >= 0 |
These functions are automatically applied to all observations when the argument is a variable in your dataset.
Stata also has a function to generate random numbers (useful in
simulation), namely uniform().
It also has an extensive set of functions to
compute probability distributions (needed for p-values)
and their inverses (needed for critical values),
including normal() for the normal cdf and
invnormal() for its inverse,
see help density functions for more information.
To simulate normally distributed observations you can use
.rnormal() // or invnormal(uniform())
There are also some specialized functions for working with strings,
see help string functions, and with dates,
see help date functions.
2.4.4 Recoding Variables
The recode command is used to group a numeric variable into categories. Suppose for example a fertility survey has age in single years for women aged 15 to 49, and you would like to code it into 5-year age groups. You could, of course, use something like
. gen age5= int((age-15)/5)+1 if !missing(age)
but this only works for regularly spaced intervals (and is a bit cryptic). The same result can be obtained using
. recode age (15/19=1) (20/24=2) (25/29=3) (30/34=4) ///
(35/39=5) (40/44=6) (45/49=7), gen(age5)
Each expression in parenthesis is a recoding rule, and consist of a
list or range of values, followed by an equal sign and a new value. A
range, specified using a slash, includes the two boundaries, so 15/19 is 15 to
19, which could also be specified as 15 16 17 18 19 or even 15 16 17/19.
You can use min to refer to the smallest value and
max to refer to the largest value, as in min/19
and 44/max.
The parentheses can be omitted when the rule has the form
range=value, but they usually help make the command more readable.
Values are assigned to the first category where they fall. Values that are
never assigned to a category are kept as they are.
You can use else (or *) as the last clause to refer
to any value not yet assigned.
Alternatively, you can use missing and nonmissing
to refer to unassigned missing and nonmissing values; these must be the
last two clauses and cannot be combined with else.
In our example we also used the gen() option to generate a new
variable, in this case age5; the default is to replace the values of the existing variable.
I strongly recommend that you always use the gen option or make
a copy of the original variable before recoding it.
You can also specify value labels in each recoding rule. This is simpler and
less error prone that creating the labels in a separate statement.
The option label(label_name) lets you assign a name to the labels created
(the default is the same as the variable name).
Here's an example showing how to recode and label family planning effort in one step
(compare with the four commands used in Section 2.4.2 above).
. recode effort (0/4=1 Weak) (5/14=2 Moderate) (15/max=3 Strong) /// > , generate(efffortg) label(effortg)
It is often a good idea to cross-tabulate original and recoded variables to check that the transformation has worked as intended. (Of course this can only be done if you have generated a new variable!)
2.5 Managing Stata Files
Once you have created a Stata system file you will want to save it on disk
using save filename, replace, where the replace option, as usual,
is needed only if the file already exists. To load a Stata file you have saved
in a previous session you issue the command use filename.
If there are temporary variables you do not need in the saved file you can drop them
(before saving) using drop varnames. Alternatively,
you may specify the variables you want to keep, using keep varnames.
With large files you may want to compress them before saving;
this command looks at the data and stores each variable in the smallest possible
data type that will not result in loss of precision.
It is possible to add variables or observations to a Stata file. To add
variables you use the merge commmand, which requires two (or more)
Stata files, usually with a common id so observations can be paired correctly.
A typical application is to add household information to an individual data file.
Type help merge to learn more.
To add observations to a file you use the append command,
which requires the data to be appended to be on a Stata file, usually containing
the same variables as the dataset in memory. You may, for example, have data for
patients in one clinic and may want to append similar data from another clinic.
Type help append to learn more.
A related but more specialized command is joinby,
which forms all pairwise combinations of observations in memory with
observations in an external dataset (see also cross).
Continue with 3. Stata Graphics

