![]() |
|
![]() |
Stata is a powerful statistical package with smart data-management facilities, a wide array of up-to-date statistical techniques, and a brand-new system for producing publication-quality graphs. Stata is fast and easy to use. In this tutorial we start with a quick introduction and overview and then discuss data management, statistics, graphs, and programming. (The tutorial has been updated for version 9, but most of the discussion applies to version 8 as well.)
Stata is available for Windows, Unix, and Mac computers. This tutorial focuses on the Windows version but most of the contents applies to the other platforms as well. The standard version is called Intercooled Stata and can handle up to 2,047 variables. There is a special edition called Stata SE that can handle up to 32,766 variables (and also allows longer string variables and larger matrices). Both versions can read each other's files within their size limits. The number of observations is limited by your computer's memory, as long as it doesn't exceed about two billion. There's also a small version of Stata that is limited to about 1,000 observations on 99 variables.
At OPR you can access Stata SE on Windows by running the network version on your own workstation,
just create a shortcut to \\opr\shares\applications\stata9-se\wsestata.exe.
Intercooled Stata is also available. For computationally intensive jobs you may want to
login to our Windows server Coale via remote desktop and run Stata there. If you prefer Unix
systems logon to our Unix server Lotka via X-Windows and leave your job running there.
When Stata starts up you see four windows, initially arranged as shown below:
The window labeled Command is where you type your commands. Stata then shows the results in the larger window labeled, appropriately enough, Results. Your command is added to a list in the window labeled Review, so you can keep track of the commands you have used. The last window, labeled Variables, lists the variables in your dataset. You can resize and rearrange these windows and save your preferences using the menu Prefs|Save Window Preferences. You can also choose the font used in each window; just right click and select font (on the results window you can also click on the small box to the left of the window title); I like 8-point Lucida Console. (There are other windows that we will discuss as needed, namely the Graph, Viewer, Data Editor, and Do file Editor.)
Starting with version 8 Stata's graphical user interface (GUI) allows selecting commands and options from a menu and dialog system. However, I strongly recommend using the command language as a way to ensure reproducibility of your results. In fact, I recommend that you type your commands on a separate file, called a do file, as explained in Section 1.2 below, but for now we will just type in the command window. The GUI can be helpful to beginners learning Stata, particularly because after you point and click on the menus and dialogs Stata types the corresponding command for you.
Stata can work as a calculator using the display command.
Try typing the following
(excluding the dot at the start of a line, which is how Stata marks the lines you type):
. display 2+2 4 . display 2 * ttail(20,2.1) .04861759
Stata commands are case-sensitive, display is not the same as Display
and the latter will not work.
Commands can also be abbreviated; the documentation and online help underlines the shortest legal
abbreviation of each command and we will do the same here.
The second command shows the use of a built-in function to compute a p-value,
in this case twice the probability that a Student's t with 20 d.f. exceeds 2.1. This result would just make the 5% cutoff.
To find the two-tailed 5% critical value try display invttail(20, 0.025).
We list a few other functions you can use in Section 2.
If you issue a command and discover that it doesn't work press the Page Up key to recall it (you can cycle through your command history using the Page Up and Page Down keys) and then edit it using the arrow, insert and delete keys, which work exactly as you would expect. For example Arrows advance a character at a time and Ctrl-Arrows advance a word at a time. Shift-Arrows select a character at a time and Shift-Ctrl-Arrows select a word at a time, which you can then delete or replace. A command can be as long as needed (up to some 64k characters); in an interactive session you just keep on typing and the command window will wrap and scroll as needed.
Stata has excellent online help.
To obtain help on a command (or function) type help command_name,
which displays the help on a separate window called the Viewer.
(You can also type chelp command_name,
which shows the help on the Results window; but this is not recommended.)
Or just select Help|Command on the menu system.
Try help ttail. (Unfortunately, version 9 opens a new viewer
each time you type help, and before you know it you have dozens of windows
cluttering your desktop. To avoid this problem you can type the help command on
the viewer itself, or you can modify the help command to reuse the same window
as explained here.)
If you don't know the name of the command you need you can search for it.
Stata has a search command with a few options, type help search
to learn more,
but I prefer findit, which searches the Internet as well as your local
machine and shows results in the Viewer.
Try findit Student's t.
This will list all Stata commands and functions related to the t distribution.
The second entry is functions, which takes you to a table that includes probability
functions, which takes you to a long list of functions, which includes ttail().
Along the way you see that Stata can also compute tail probabilities
for the normal, chi-squared and F distributions, among others.
To learn more about the help system type help help.
Stata comes with a few sample data files.
You will learn how to read your own data into Stata in Section 2,
but for now we will load one of the sample files, namely lifeexp.dta,
which has data on life expectancy and gross national product (GNP) per capita in 1998
for 68 countries.
To see a list of the files shipped with Stata type sysuse dir.
To load the file we want type sysuse lifeexp (the file extension is optional).
To see what's in the file type describe.
(This command can be abbreviated to a single letter but I prefer desc.)
. sysuse lifeexp
(Life expectancy, 1998)
. desc
Contains data from C:\Stata8\ado\base/l/lifeexp.dta
obs: 68 Life expectancy, 1998
vars: 6 26 Sep 2002 14:08
size: 2,924 (99.7% of memory free) (_dta has notes)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
region byte %12.0g region Region
country str28 %28s Country
popgrowth float %9.0g * Avg. annual % growth
lexp byte %9.0g * Life expectancy at birth
gnppc float %9.0g * GNP per capita
safewater byte %9.0g *
* indicated variables have notes
-------------------------------------------------------------------------------
Sorted by:
We see that we have six variables. The dataset has notes that you can see by
typing notes. Four of the variables have annotations that you can see by typing
notes varname. You'll learn how to add notes in Section 2.
Let us run simple descriptive statistics for the two variables we are interested in,
using the summarize command followed by the names of the variables
(which can be omitted to summarize everything):
. summarize lexp gnppc
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 68 72.27941 4.715315 54 79
gnppc | 63 8674.857 10634.68 370 39980
We see that live expectancy averages 72.3 years and GNP per capita ranges from $370 to $39,980
with an average of $10,634.
We also see that Stata reports only 63 observations on GNP per capita,
so we must have some missing values.
Let us list the countries for which we are missing GNP per capita:
. list country gnppc if missing(gnppc)
+--------------------------------------+
| country gnppc |
|--------------------------------------|
7. | Bosnia and Herzegovina . |
40. | Turkmenistan . |
44. | Yugoslavia, FR (Serb./Mont.) . |
46. | Cuba . |
56. | Puerto Rico . |
+--------------------------------------+
We see that we have indeed five missing values.
This example illustrates a powerful feature of Stata:
the action of any command can be restricted to a subset of the data.
If we had typed list country gnppc we would have listed these variables for all 68 countries.
Adding the condition if missing(gnppc)
restricts the list to cases where gnppc is missing.
Note that Stata lists missing values using a dot. We'll learn more about missing values in Section
2 .
To see how life expectancy varies with GNP per capita we will draw a scatter plot
using the graph command, which has a myriad of subcommands and options,
some of which we describe in Section 4.
. graph twoway scatter lexp gnppc

The plot shows a curvilinear relationship between GNP per capita and life expectancy. We will see if the relationship can be linearized by taking the log of GNP per capita.
We compute a new variable using the generate command
with a new variable name and an arithmetic expression. Choosing good variable names is important.
When computing logs I usually just prefix the old variable name with 'log' or
'l',
but compound names can easily become cryptic and hard-to-read.
Some programmers separate words using an underscore, as in log_gnp_pc,
and others prefer the camel-casing convention which capitalizes each word after the first: logGnpPc.
I suggest you develop a consistent style and stick to it.
Variable labels can also help, as described in Section 2.
To compute natural logs we use the built-in function log:
. gen loggnppc = log(gnppc) (5 missing values generated)
Stata says it has generated five missing values. These correspond to the five countries for which we were missing GNP per capita. Try to confirm this statement using the list command. We will learn more about generating new variables in Section 2.
We are now ready to run a linear regression of life expectancy on log GNP per capita.
We will use the regress command, which lists the outcome followed by the predictors
(here just one, loggnppc)
. regress lexp loggnppc
. regress lexp loggnppc
Source | SS df MS Number of obs = 63
-------------+------------------------------ F( 1, 61) = 97.09
Model | 873.264865 1 873.264865 Prob > F = 0.0000
Residual | 548.671643 61 8.99461709 R-squared = 0.6141
-------------+------------------------------ Adj R-squared = 0.6078
Total | 1421.93651 62 22.9344598 Root MSE = 2.9991
------------------------------------------------------------------------------
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
loggnppc | 2.768349 .2809566 9.85 0.000 2.206542 3.330157
_cons | 49.41502 2.348494 21.04 0.000 44.71892 54.11113
------------------------------------------------------------------------------
Note that the regression is based on only 63 observations. Stata omits observations that are missing the outcome or one of the predictors. The log of GNP per capita "explains" 61% of the variation in life expectancy in these countries. We also see that a one percent increase in GNP per capita is associated with an increase of 0.0278 years in life expectancy. (To see this point note that if GNP increases by one percent its log increases by 0.01.)
Following a regression (or in fact any estimation command) you can retype the command with
no arguments to see the results again. Try typing reg.
Stata has a number of post-estimation commands that build on the results of a model fit.
A useful command is predict, which can be used to generate fitted values or
residuals following a regression. The command
. predict plexp (option xb assumed; fitted values) (5 missing values generated)
generates a new variable, plexp, that has the life expectancy predicted from
our regression equation.
No predictions are made for the five countries without GNP per capita.
(If life expectancy was missing for a country it would be excluded from the regression,
but a prediction would be made for it. This technique can be used to fill-in missing values.)
A common task is to superimpose a regression line on a scatter plot to inspect the quality of the fit.
We could do this using the predictions we stored in plexp, but Stata's graph command
knows how to do linear fits on the fly using the lfit plot type,
and can superimpose different types of twoway plots,
as explained in more detail in Section 4.
Try the command
. graph twoway (scatter lexp loggnppc) (lfit lexp loggnppc)

In this command each expression in parenthesis is a separate two-way plot to be overlayed in the same graph. The fit looks reasonably good, except for a possible outlier.
It's hard not to notice the country on the bottom left of the graph, which has much lower life expectancy than one would expect, even given its low GNP per capita. To find which country it is we list the (names of the) countries where life expectancy is less than 55:
. list country lexp plexp if lexp < 55, clean
country lexp plexp
50. Haiti 54 66.06985
We find that the outlier is Haiti, with a life expectancy 12 years less than one would expect
given its GNP per capita. (The keyword clean after the comma is an option
which omits the borders on the listing. Many Stata commands have options, and these are always
specified after a comma.)
If you are curious where the United States is try
. list gnppc loggnppc lexp plexp if country == "United States", clean
gnppc loggnppc lexp plexp
58. 29240 10.28329 77 77.88277
Here we restricted the listing to cases where the value of the variable
country was "United States".
Note the use of a double equal sign in a logical expression.
In Stata x = 2 assigns the value 2 to the variable x,
whereas x == 2 checks to see if the value of x is 2.
To exit Stata you use the exit command (or select File|Exit in the menu, or
press Alt-F4, as in most Windows programs). If you have been following along
this tutorial by typing the commands and try to exit Stata will refuse, saying "no;
data in memory would be lost". This happens because we have added a new variable
that is not part of the original dataset, and it hasn't been saved. As you
can see, Stata is very careful to ensure we don't loose our work.
If you don't care about saving anything you can type exit, clear, which tells
Stata to quit no matter what. Alternatively, you can save the data in memory
using the save filename command, and then exit. A cautious programmer
will always save a modified file using a new name.