Stata Tutorial Table of Contents Using Stata Effectively

1. Introduction

Stata is a powerful statistical package with smart data-management facilities, a wide array of up-to-date statistical techniques, and a brand-new system for producing publication-quality graphs. Stata is fast and easy to use. In this tutorial we start with a quick introduction and overview and then discuss data management, statistics, graphs, and programming. (The tutorial has been updated for version 9, but most of the discussion applies to version 8 as well.)

1.1 A Quick Tour of Stata

Stata is available for Windows, Unix, and Mac computers. This tutorial focuses on the Windows version but most of the contents applies to the other platforms as well. The standard version is called Intercooled Stata and can handle up to 2,047 variables. There is a special edition called Stata SE that can handle up to 32,766 variables (and also allows longer string variables and larger matrices). Both versions can read each other's files within their size limits. The number of observations is limited by your computer's memory, as long as it doesn't exceed about two billion. There's also a small version of Stata that is limited to about 1,000 observations on 99 variables.

At OPR you can access Stata SE on Windows by running the network version on your own workstation, just create a shortcut to \\opr\shares\applications\stata9-se\wsestata.exe. Intercooled Stata is also available. For computationally intensive jobs you may want to login to our Windows server Coale via remote desktop and run Stata there. If you prefer Unix systems logon to our Unix server Lotka via X-Windows and leave your job running there.

1.1.1 The Stata Interface

When Stata starts up you see four windows, initially arranged as shown below:

The window labeled Command is where you type your commands. Stata then shows the results in the larger window labeled, appropriately enough, Results. Your command is added to a list in the window labeled Review, so you can keep track of the commands you have used. The last window, labeled Variables, lists the variables in your dataset. You can resize and rearrange these windows and save your preferences using the menu Prefs|Save Window Preferences. You can also choose the font used in each window; just right click and select font (on the results window you can also click on the small box to the left of the window title); I like 8-point Lucida Console. (There are other windows that we will discuss as needed, namely the Graph, Viewer, Data Editor, and Do file Editor.)

Starting with version 8 Stata's graphical user interface (GUI) allows selecting commands and options from a menu and dialog system. However,  I strongly recommend using the command language as a way to ensure reproducibility of your results. In fact, I recommend that you type your commands on a separate file, called a do file, as explained in Section 1.2 below, but for now we will just type in the command window. The GUI can be helpful to beginners learning Stata, particularly because after you point and click on the menus and dialogs Stata types the corresponding command for you.

1.1.2 Typing Commands

Stata can work as a calculator using the display command.  Try typing the following (excluding the dot at the start of a line, which is how Stata marks the lines you type):

. display 2+2
4
. display 2 * ttail(20,2.1)
.04861759

Stata commands are case-sensitive, display is not the same as Display and the latter will not work. Commands can also be abbreviated; the documentation and online help underlines the shortest legal abbreviation of each command and we will do the same here.

The second command shows the use of a built-in function to compute a p-value, in this case twice the probability that a Student's t with 20 d.f. exceeds 2.1.  This result would just make the 5% cutoff. To find the two-tailed 5% critical value try display invttail(20, 0.025). We list a few other functions you can use in Section 2.

If you issue a command and discover that it doesn't work press the Page Up key to recall it (you can cycle through your command history using the Page Up and Page Down keys) and then edit it using the arrow, insert and delete keys, which work  exactly as you would expect. For example Arrows advance a character at a time and Ctrl-Arrows advance a word at a time. Shift-Arrows select a character at a time and Shift-Ctrl-Arrows select a word at a time, which you can then delete or replace. A command can be as long as needed (up to some 64k characters); in an interactive session you just keep on typing and the command window will wrap and scroll as needed.

1.1.3 Getting Help

Stata has excellent online help. To obtain help on a command (or function) type  help command_name, which displays the help on a separate window called the Viewer. (You can also type chelp command_name, which shows the help on the Results window; but this is not recommended.) Or just select Help|Command on the menu system. Try help ttail.  (Unfortunately, version 9 opens a new viewer each time you type help, and before you know it you have dozens of windows cluttering your desktop. To avoid this problem you can type the help command on the viewer itself, or you can modify the help command to reuse the same window as explained here.)

If you don't know the name of the command you need you can search for it. Stata has a search command with a few options, type help search to learn more, but I prefer findit, which searches the Internet as well as your local machine and shows results in the Viewer. Try findit Student's t. This will list all Stata commands and functions related to the t distribution. The second entry is functions, which takes you to a table that includes probability functions, which takes you to a long list of functions, which includes ttail(). Along the way you see that Stata can also compute tail probabilities for the normal, chi-squared and F distributions, among others.

To learn more about the help system type  help help.

1.1.4 Loading a Sample Data File

Stata comes with a few sample data files. You will learn how to read your own data into Stata in Section 2, but for now we will load one of the sample files, namely lifeexp.dta, which has data on life expectancy and gross national product (GNP) per capita in 1998 for 68 countries. To see a list of the files shipped with Stata type sysuse dir. To load the file we want type sysuse lifeexp (the file extension is optional). To see what's in the file type describe. (This command can be abbreviated to a single letter but I prefer desc.)

. sysuse lifeexp
(Life expectancy, 1998)

. desc

Contains data from C:\Stata8\ado\base/l/lifeexp.dta
  obs:            68                          Life expectancy, 1998
 vars:             6                          26 Sep 2002 14:08
 size:         2,924 (99.7% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
region          byte   %12.0g      region     Region
country         str28  %28s                   Country
popgrowth       float  %9.0g                * Avg. annual % growth
lexp            byte   %9.0g                * Life expectancy at birth
gnppc           float  %9.0g                * GNP per capita
safewater       byte   %9.0g                * 
                                            * indicated variables have notes
-------------------------------------------------------------------------------
Sorted by: 

We see that we have six variables. The dataset has notes that you can see by typing notes. Four of the variables have annotations that you can see by typing notes varname. You'll learn how to add notes in Section 2.

1.1.5 Descriptive Statistics 

Let us run simple descriptive statistics for the two variables we are interested in, using the summarize command followed by the names of the variables (which can be omitted to summarize everything):

. summarize lexp gnppc

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        68    72.27941    4.715315         54         79
       gnppc |        63    8674.857    10634.68        370      39980

We see that live expectancy averages 72.3 years and GNP per capita ranges from $370 to $39,980 with an average of $10,634. We also see that Stata reports only 63 observations on GNP per capita, so we must have some missing values.  Let us list the countries for which we are missing GNP per capita:

. list country gnppc if missing(gnppc)

     +--------------------------------------+
     |                      country   gnppc |
     |--------------------------------------|
  7. |       Bosnia and Herzegovina       . |
 40. |                 Turkmenistan       . |
 44. | Yugoslavia, FR (Serb./Mont.)       . |
 46. |                         Cuba       . |
 56. |                  Puerto Rico       . |
     +--------------------------------------+

We see that we have indeed five missing values. This example illustrates a powerful feature of Stata: the action of any command can be restricted to a subset of the data. If we had typed list country gnppc we would have listed these variables for all 68 countries. Adding the condition if missing(gnppc) restricts the list to cases where gnppc is missing. Note that Stata lists missing values using a dot. We'll learn more about missing values in Section 2 .

1.1.6 Drawing a Scatterplot

To see how life expectancy varies with GNP per capita we will draw a scatter plot using the graph command, which has a myriad of subcommands and options, some of which we describe in Section 4.

. graph twoway scatter lexp gnppc

The plot shows a curvilinear relationship between GNP per capita and life expectancy. We will see if the relationship can be linearized by taking the log of GNP per capita.

1.1.7 Computing New Variables

We compute a new variable using the generate command with a new variable name and an arithmetic expression. Choosing good variable names is important. When computing logs I usually just prefix the old variable name with 'log' or 'l', but compound names can easily become cryptic and hard-to-read. Some programmers separate words using an underscore, as in log_gnp_pc, and others prefer the camel-casing convention which capitalizes each word after the first: logGnpPc. I suggest you develop a consistent style and stick to it. Variable labels can also help, as described in Section 2.

To compute natural logs we use the built-in function log:

. gen loggnppc = log(gnppc)
(5 missing values generated)

Stata says it has generated five missing values. These correspond to the five countries for which we were missing GNP per capita. Try to confirm this statement using the list command.  We will learn more about generating new variables in Section 2.

1.1.8 Simple Linear Regression

We are now ready to run a linear regression of life expectancy on log GNP per capita. We will use the regress command, which lists the outcome followed by the predictors (here just one, loggnppc)

. regress lexp loggnppc
. regress lexp loggnppc

      Source |       SS       df       MS              Number of obs =      63
-------------+------------------------------           F(  1,    61) =   97.09
       Model |  873.264865     1  873.264865           Prob > F      =  0.0000
    Residual |  548.671643    61  8.99461709           R-squared     =  0.6141
-------------+------------------------------           Adj R-squared =  0.6078
       Total |  1421.93651    62  22.9344598           Root MSE      =  2.9991

------------------------------------------------------------------------------
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    loggnppc |   2.768349   .2809566     9.85   0.000     2.206542    3.330157
       _cons |   49.41502   2.348494    21.04   0.000     44.71892    54.11113
------------------------------------------------------------------------------

Note that the regression is based on only 63 observations. Stata omits observations that are missing the outcome or one of the predictors. The log of GNP per capita "explains" 61% of the variation in life expectancy in these countries. We also see that a one percent increase in GNP per capita is associated with an increase of 0.0278 years in life expectancy. (To see this point note that if GNP increases by one percent its log increases by 0.01.)

Following a regression (or in fact any estimation command) you can retype the command with no arguments to see the results again. Try typing reg.

1.1.9 Post-Estimation Commands

Stata has a number of post-estimation commands that build on the results of a model fit. A useful command is predict, which can be used to generate fitted values or residuals following a regression. The command

. predict plexp
(option xb assumed; fitted values)
(5 missing values generated)

generates a new variable, plexp, that has the life expectancy predicted from our regression equation. No predictions are made for the five countries without GNP per capita. (If life expectancy was missing for a country it would be excluded from the regression, but a prediction would be made for it. This technique can be used to fill-in missing values.)

1.1.10 Plotting the Data and a Linear Fit 

A common task is to superimpose a regression line on a scatter plot to inspect the quality of the fit. We could do this using the predictions we stored in plexp, but Stata's graph command knows how to do linear fits on the fly using the lfit plot type, and can superimpose different types of twoway plots, as explained in more detail in Section 4. Try the command

. graph twoway (scatter lexp loggnppc) (lfit lexp loggnppc)

In this command each expression in parenthesis is a separate two-way plot to be overlayed in the same graph. The fit looks reasonably good, except for a possible outlier.

1.1.11 Listing Selected Observations

It's hard not to notice the country on the bottom left of the graph, which has much lower life expectancy than one would expect, even given its low GNP per capita. To find which country it is we list the (names of the) countries where life expectancy is less than 55:

. list country lexp plexp if lexp < 55, clean

       country   lexp      plexp  
 50.     Haiti     54   66.06985  

We find that the outlier is Haiti, with a life expectancy 12 years less than one would expect given its GNP per capita. (The keyword clean after the comma is an option which omits the borders on the listing. Many Stata commands have options, and these are always specified after a comma.) If you are curious where the United States is try

. list gnppc loggnppc lexp plexp if country == "United States", clean

       gnppc   loggnppc   lexp      plexp  
 58.   29240   10.28329     77   77.88277  

Here we restricted the listing to cases where the value of the variable country was "United States". Note the use of a double equal sign in a logical expression. In Stata x = 2 assigns the value 2 to the variable x, whereas x == 2 checks to see if the value of x is 2.

1.1.12 Saving your Work and Exiting Stata

To exit Stata you use the exit command (or select File|Exit in the menu, or press Alt-F4, as in most Windows programs). If you have been following along this tutorial by typing the commands and try to exit Stata will refuse, saying "no; data in memory would be lost". This happens because we have added a new variable that is not part of the original dataset, and it hasn't been saved.  As you can see, Stata is very careful to ensure we don't loose our work.

If you don't care about saving anything you can type exit, clear, which tells Stata to quit no matter what. Alternatively, you can save the data in memory using the save filename command, and then exit. A cautious programmer will always save a modified file using a new name.


Continue with 1.2 Using Stata Effectively