This is a collection of small datasets used in the course. To get a copy just click on the dataset's link. Or right-click (on most browsers) to save a copy to your local machine.

Table of Contents

Here is a list of datasets classified by the type of statistical technique illustrated. Some datasets appear in more than one category.

Data Formats

All datasets are available as ascii files in two formats:

The Program Effort Data

Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR) between 1965 and 1975, for 20 countries in Latin America.

                 setting  effort   change
   Bolivia            46       0        1
   Brazil             74       0       10
   Chile              89      16       29
   Colombia           77      16       25
   CostaRica          84      21       29
   Cuba               89      15       40
   DominicanRep       68      14       21
   Ecuador            70       6        0
   ElSalvador         60      13       13
   Guatemala          55       9        4
   Haiti              35       3        0
   Honduras           51       7        7
   Jamaica            87      23       21
   Mexico             83       4        9
   Nicaragua          68       0        7
   Panama             84      19       22
   Paraguay           74       3        6
   Peru               73       0        2
   TrinidadTobago     84      15       29
   Venezuela          91       7       11

Source: P.W. Mauldin and B. Berelson (1978). Conditions of fertility decline in developing countries, 1965-75. Studies in Family Planning,9:89-147.

To get this dataset click on effort.dat. If you use Stata you may prefer effort.raw, which omits the header line with the variable names.

Discrimination in Salaries

These are the salary data used in Weisberg's book, and include observations on six variables for 52 tenure-track professors in a small college.

The file is available in the usual formats as salary.dat and salary.raw. Variables in the 'raw' file are coded as follows:

Source: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.

Births in Philadelphia

These are data based on a 5% sample of all births occurring in Philadelphia in 1990. The sample has 1115 observations (after deleting 32 cases with incomplete information) on five variables:

The data are available in the files phbirths.raw and phbirths.dat.

Source: I. T. Elo, G. ROdríguez and H. Lee (2001) Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.

The Contraceptive Use Data

Here are the contraceptive use data from page 46 of the lecture notes (and from the Stata and S-Plus handouts), showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.

    age education wantsMore notUsing using 
    <25       low       yes       53     6
    <25       low        no       10     4
    <25      high       yes      212    52
    <25      high        no       50    10
  25-29       low       yes       60    14
  25-29       low        no       19    10
  25-29      high       yes      155    54
  25-29      high        no       65    27
  30-39       low       yes      112    33
  30-39       low        no       77    80
  30-39      high       yes      118    46
  30-39      high        no       68    78
  40-49       low       yes       35     6
  40-49       low        no       46    48
  40-49      high       yes        8     8
  40-49      high        no       12    31

To get a copy of the data in the format shown above click on cuse.dat.

The dataset is also available in the format used in the Stata handout. This format has 32 rows corresponding to all possible covariate and response patterns, and includes a weight indicating the frequency of each combination. The file has 5 columns with numeric codes:

To get the data in this alternative format click on cuse.raw.

The Children Ever Born Data

These are the data from Fiji on children ever born, from page 84 of the lecture notes (and the Stata handout).

The dataset has 70 rows representing grouped individual data. Each row has entries for:

This file is available in the usual two formats: ceb.dat has a header and uses character labels for the factors, and ceb.raw uses numeric codes, as described above.

Smoking and Lung Cancer

This dataset has information on lung cancer deaths by age and smoking status. The file in "raw" format has four columns:

The file is also available in "dat" format with variable names, row names and string labels for age and smoking status.

The Ship Damage Data

These are the data from McCullagh and Nelder (1989, p. 204). The file has 34 rows corresponding to the observed combinations of type of ship, year of construction and period of operation. Each row has information on five variables as follows:

Note that there no ships of type E built in 1960-64, and that ships built in 1970-74 could not have operated in 1960-74. The combinations are omitted from the data file.

You can get the data in the usual versions: ships.dat has a header and codes the factors using strings, and ship.raw uses numeric codes.

The Housing Data

These are the data from Wilner, Walkley and Cook (1955) on the effect of racial attitudes on segregation and integration of public housing:

                                     Sentiment
Proximity  Contact     Norms         fav unfav
close      frequent    favorable     77    32
                       unfavorable   30    36
           infrequent  favorable     14    19
                       unfavorable   15    27
distant    frequent    favorable     43    20
                       unfavorable   36    37
           infrequent  favorable     27    36
                       unfavorable   41   118

You can get a file in the usual character and numeric formats from housing.dat or housing.raw, respectively. The latter codes the factor levels in order of appearance as follows:

Housing Conditions in Copenhagen

These are the Madsen data used in the revised lecture notes. This is a four-way table classifying 1681 residents of twelve areas in Copenhagen in terms of

The data are available in the usual character and numeric formats from copen.dat or copen.raw, respectively.

The Cancer Data

These are the data from Bishop, Fienberg and Holland (1975) on the three-year survival status of breast-cancer patients by age and malignancy of tumor:

                    survive?
      age malignant yes no 
1 under50        no  77 10
2 under50       yes  51 13
3   50-69        no  51 11
4   50-69       yes  38 20
5     70+        no   7  3
6     70+       yes   6  3

You can get a file in the usual character and numeric formats from cancer.dat or cancer.raw.

The Method Choice Data

The method choice data from Brazil are available in a file containing three columns:

As usual, the file is available in two formats: brazil.dat codes the factors using character labels, and brazil.raw uses numeric codes (the age groups are coded 1-6 and the methods are coded 1=not_using, 2=inefficient, 3=efficient, 4=sterilization).

You can read the file with character labels (brazil.dat) into Stata using the command
infile str6 age str14 method freq using brazil.dat

Health Care Utilization in Guatemala

This dataset comes from the Guatemalan Survey of Family Health, a survey of rural women that contains detailed data on care received during pregnancy and delivery along with extensive background information. We have tabulated data on 3334 pregnancies. The outcome is the type of provider seen during pregnancy and there are three predictors. The raw data file has five columns, as follows:

The data are available using numeric codes as healthCare.raw and using string codes as well as row and column labels as healthCare.dat.

For more information on this study, see Glei, D. A. and Goldman, N. (2000), Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala, Ethnicity and Health, 5:5-22.

The Social Mobility Data

The Social Mobility Data are available in a file containing five columns:

The file is available as mobility.dat.

Time to Ph.D.

The Time to Ph.D. data are available in a file containing five columns:

The file has 73 rows and is called phd.dat.

The Gehan Survival Data

The Gehan data show the length of remission in weeks for two groups of leukemia patients, treated and control, and were analyzed by Cox in his original proportional hazards paper. The data are available in a file containing three columns:

Thus, the third and fourth observations, 6 and 6+, corresponding to a death and a censored observation at six weeks, are coded 6, 1 and 6, 0, respectively.

The data are available in the usual two formats in gehan.dat and gehan.raw (group codes are 1=control, 2=treated).

The Somoza Dataset

These are Somoza's (1980) data on infant and child survival in Colombia, used in the notes.

The file has 48 lines, corresponding to the 48 combinations of sex, cohort and age in Table 3, and six columns:

To get a copy of this file choose somoza.dat, which uses character labels for sex, cohort and age, or somoza.raw, which uses numeric codes for all variables.

In order to analyze these data using piece-wise exponential models you first have to calculate events and exposure by sex, cohort and age. Although this can be done using a statistical package such as Stata or Splus, you may find that a spreadsheet package is better suited for the task.

Marriage Dissolution in the U.S.

This dataset is based on a longitudinal survey conducted in the U.S. The unit of observation is the couple and the event of interest is divorce, with interview or widowhood treated as censoring events. We have three fixed covariates: education of the husband and two indicators of ethnicity of the couple. The file has data from 3371 couples, with six variables coded as follows:

The file is available in "raw" format as divorce.raw.

This dataset is adapted from an example in Lillard and Panis (2000), aML Multilevel Multiprocess Statistical Software, Release 1.0, EconWare, LA, California.