Home | GLMs | Multilevel | Survival | Demography | Stata | R
Home Lecture Notes Stata LogsR Logs Datasets Problem Sets

This is a collection of small datasets used in the course. The datasets are now available in Stata format and can be read directly from Stata by typing use http://data.princeton.edu/wws509/datasets/DatasetName. They are also available in two plain text formats, as explained below.

List of Datasets

Here is a list of datasets classified by the type of statistical technique that may be used to analyze them. A couple of datasets appear in more than one category.

Linear Regression
 
 
 

Logistic Regression
 
 
 

Poisson Regression
 
 
 
 

Log-Linear Models for Contingency Tables and Multinomial Response Models
 
 
 
 
 

Survival Data
 
 
 
 

Data Formats

 All datasets are available as plain-text ASCII files, usually in two formats:

To download any of these files using your browser I recommend that you right-click and choose 'save as...'. If you left-click what happens next depends on how your browser is configured to handle these file types, and will often require an extra step.

 The datasets are also available as Stata system files with extension .dta, and can be read directly from net-aware Stata versions 10 or higher using the command given at the top of this page. This is the easiest method for Stata users. You can also right click on the links to save a local copy. R users can read the Stata files using Tom Lumley's read.dta function in the foreign package.

The Program Effort Data

Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR) between 1965 and 1975, for 20 countries in Latin America.

                 setting  effort   change
   Bolivia            46       0        1
   Brazil             74       0       10
   Chile              89      16       29
   Colombia           77      16       25
   CostaRica          84      21       29
   Cuba               89      15       40
   DominicanRep       68      14       21
   Ecuador            70       6        0
   ElSalvador         60      13       13
   Guatemala          55       9        4
   Haiti              35       3        0
   Honduras           51       7        7
   Jamaica            87      23       21
   Mexico             83       4        9
   Nicaragua          68       0        7
   Panama             84      19       22
   Paraguay           74       3        6
   Peru               73       0        2
   TrinidadTobago     84      15       29
   Venezuela          91       7       11

The data are available as plain text files effort.dat, which has a header line with the variable names, and effort.raw, which omits it; otherwise both files look like the listing above. The data are also available in Stata format as effort.dta.

Reference: P.W. Mauldin and B. Berelson (1978). Conditions of fertility decline in developing countries, 1965-75. Studies in Family Planning,9:89-147. JSTOR: http://www.jstor.org/stable/1965523.

Discrimination in Salaries

These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

  • sx = Sex, coded 1 for female and 0 for male
  • rk = Rank, coded
    • 1 for assistant professor,
    • 2 for associate professor, and
    • 3 for full professor
  • yr = Number of years in current rank
  • dg = Highest degree, coded 1 if doctorate, 0 if masters
  • yd = Number of years since highest degree was earned
  • sl = Academic year salary, in dollars.

The file is available in the usual plain text formats as salary.dat using character codes and salary.raw using numeric codes, and in Stata format as salary.dta.

Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.

Births in Philadelphia

These are data based on a 5% sample of all births occurring in Philadelphia in 1990. The sample has 1115 observations (after deleting 32 cases with incomplete information) on five variables:

  • black = Mother is black (1=yes, 0=no),
  • educ = Mother's years of education (0,17),
  • smoke = Whether mother smoked during pregnancy (1=yes, 0=no),
  • gestate = Gestational age in weeks, and
  • grams = Birth weight in grams.

The data are available in plain text format in the files phbirths.raw and phbirths.dat, and in Stata format as phbirts.dta.

The 'dat' file codes black and smoke using TRUE or FALSE, whereas the 'raw' file uses 1 and 0.

Reference: I. T. Elo, G. Rodríguez and H. Lee (2001). Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.

The Contraceptive Use Data

Here are the contraceptive use data from page 46 of the lecture notes (and from the Stata handout), showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.

    age education wantsMore notUsing using 
    <25       low       yes       53     6
    <25       low        no       10     4
    <25      high       yes      212    52
    <25      high        no       50    10
  25-29       low       yes       60    14
  25-29       low        no       19    10
  25-29      high       yes      155    54
  25-29      high        no       65    27
  30-39       low       yes      112    33
  30-39       low        no       77    80
  30-39      high       yes      118    46
  30-39      high        no       68    78
  40-49       low       yes       35     6
  40-49       low        no       46    48
  40-49      high       yes        8     8
  40-49      high        no       12    31

The data are available in the format shown above as cuse.dat.

The dataset is also available in the format used in the Stata handout. This version has 32 rows corresponding to all possible covariate and response patterns, and includes a weight indicating the frequency of each combination. The file has 5 columns with numeric codes:

  • age (four groups, 1=<25, 2=25-29, 3=30-39 and 4=40-49),
  • education (0=none, 1=some),
  • desire for more children (0=more, 1=no more),
  • contraceptive use (0=no, 1=yes), and
  • frequency (number of cases in this category).

The data in this alternative format are available in plain text as cuse.raw and in Stata format as cuse.dta.

Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.

The Children Ever Born Data

These are the data from Fiji on children ever born, from page 84 of the lecture notes (and the Stata handout).

The dataset has 70 rows representing grouped individual data. Each row has entries for:

  • The cell number (1 to 71, cell 68 has no observations),
  • marriage duration (1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29),
  • residence (1=Suva, 2=Urban, 3=Rural),
  • education (1=none, 2=lower primary, 3=upper primary, 4=secondary+),
  • mean number of children ever born (e.g. 0.50),
  • variance of children ever born (e.g. 1.14), and
  • number of women in the cell (e.g. 8).

This file is available in the usual two formats: ceb.dat has a header and uses character labels for the factors, and ceb.raw uses numeric codes, as described above.

Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.

Smoking and Lung Cancer

This dataset has information on lung cancer deaths by age and smoking status.

The file in "raw" format, smoking.raw, has four columns:

  • age: in five-year age groups coded 1 to 9 for 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+.
  • smoking status: coded 1 = doesn't smoke, 2 = smokes cigars or pipe only, 3 = smokes cigarrettes and cigar or pipe, and 4 = smokes cigarrettes only,
  • population: in hundreds of thousands, and
  • deaths: number of lung cancer deaths in a year.

The file is also available in "dat" format as smoking.dat, with variable names, row names and string labels for age and smoking status.e>

The Ship Damage Data

These are the data from McCullagh and Nelder. The file has 34 rows corresponding to the observed combinations of type of ship, year of construction and period of operation. Each row has information on five variables as follows:

  • ship type, coded 1-5 for A, B, C, D and E,
  • year of construction (1=1960-64, 2=1965-70, 3=1970-74, 4=1975-79),
  • period of operation (1=1960-74, 2=1975-79)
  • months of service, ranging from 63 to 20,370, and
  • damage incidents, ranging from 0 to 53.

Note that there no ships of type E built in 1960-64, and that ships built in 1970-74 could not have operated in 1960-74. These combinations are omitted from the data file.

You can get the data in the usual versions: ships.dat has a header and codes the factors using strings, and ship.raw uses the numeric codes shown above.

Reference: McCullagh, P. and Nelder, J. (1989) Generalized Linear Models, 2nd Edition. Chapman and Hall, London. Page 204.

The Housing Data

These are the data from Wilner, Walkley and Cook on the effect of racial attitudes on segregation and integration of public housing. The data can be viewed as a 2x2x2x2 contingency table:

                                     Sentiment
Proximity  Contact     Norms         fav unfav
close      frequent    favorable     77    32
                       unfavorable   30    36
           infrequent  favorable     14    19
                       unfavorable   15    27
distant    frequent    favorable     43    20
                       unfavorable   36    37
           infrequent  favorable     27    36
                       unfavorable   41   118

You can get a file in the usual character and numeric formats from housing.dat or housing.raw, respectively, and in Stata format from housing.dta.

The "raw' data file codes the factor levels in order of appearance as follows:

  • Proximity: 1 = close, 2=distant
  • Contact: 1 = frequent, 2=infrequent
  • Norms: 1=favorable, 2=unfavorable

For regression analysis it would have been better to code these variables using 1 and 0 instead of 1 and 2, and rename them to something like proximClose, contactFreq, and normsFav. I haven't done this because it might break existing code, but the new variables can easily be added.

Reference: Wilner, D., Walkley, R.R. and Cook, S.W. (1955). Human relations in interracial housing: A study of the contact hypothesis. University of Minnesota Press

Housing Conditions in Copenhagen

These are the Madsen data used in the revised lecture notes. This is a four-way table classifying 1681 residents of twelve areas in Copenhagen in terms of:

  • the type of housing they had (1=tower blocks, 2=apartments, 3=atrium houses and 4=terraced houses),
  • their feeling of influence on apartment management (1=low, 2=medium,3=high),
  • their degree of contact with neighbors (1=low, 2=high), and
  • their satisfaction with housing conditions (1=low, 2=medium, 3=high).

The data file contains 72 rows, one for each combination of values of the four variables, and as six columns, a row number, the four variables, and the number of cases in the category. The file is available in the usual character and numeric formats: copen.dat or copen.raw, respectively, and in Stata format as copen.dta

Reference: Madsen, M. (1976). Statistical Analysis of Multiple Contingency Tables. Two Examples. Scand. J. Statist.3:97-106. JSTOR: http://www.jstor.org/stable/4615621

The Cancer Data

These are the data from Bishop, Fienberg and Holland on the three-year survival status of breast-cancer patients by age and malignancy of tumor:

                    survive?
      age malignant yes no 
1 under50        no  77 10
2 under50       yes  51 13
3   50-69        no  51 11
4   50-69       yes  38 20
5     70+        no   7  3
6     70+       yes   6  3

You can get a file in the usual character and numeric formats from cancer.dat or cancer.raw, and in Stata format from cancer.dta.

Reference: Bishop, Y. M. M. ; Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge. .

The Method Choice Data

The method choice data from Brazil are available in a file containing three columns:

  • Age group: 15-19, 20-24, 25-29, 30-34, 35-39 or 40-44
  • Method: sterilization, efficient, inefficient, or not_using
  • Frequency: the number of women in each age/method combination.

As usual, the file is available in two formats: brazil.dat codes the factors using character labels, and brazil.raw uses numeric codes (the age groups are coded 1-6 and the methods are coded 1=not_using, 2=inefficient, 3=efficient, 4=sterilization).

You can read the file with character labels (brazil.dat) into Stata using the command

infile str6 age str14 method freq ///
  using brazil.dat
but of course we now provide a Stata file as brazil.dta.

Health Care Utilization in Guatemala

This dataset comes from the Guatemalan Survey of Family Health, a survey of rural women that contains detailed data on care received during pregnancy and delivery along with extensive background information.

We have tabulated data on 3334 pregnancies. The outcome is the type of provider seen during pregnancy and there are three predictors. The raw data file has five columns, as follows:

  • eth = Ethnicity/Language, coded 1=Indigenous, non-spanish speaker, 2=Indigenous, spanish speaker, and 3= ladino.
  • migr = Migration, whether the community has frequent migration abroad, coded 1=yes, 0=no.
  • avail = Availability of modern health services within one hour of the community, coded 1=yes, 0=no.
  • type = Provider type, coded 1=none, 2=midwife, 3=health post and 4=doctor. For simplicity, women seeing multiple provider types during their pregnancy were coded using the most modern type; for example women seeing both a midwife and a doctor were coded under doctor.
  • n = Count of the number of women in each category defined by the previous four columns.

The data are available using numeric codes as healthCare.raw and using string codes as well as row and column labels as healthCare.dat.

Reference: Glei, D. A. and Goldman, N. (2000), Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala, Ethnicity and Health, 5:5-22.

The Social Mobility Data

The Social Mobility Data are available in a file containing five columns:

  • father's occupation: 1=farm, 2=unskilled, 3=skilled, 4=professional.
  • sons's occupation: same categories as the father.
  • race: coded 1 for blacks, 0 for others.
  • disruption: coded 1 for non-intact family background, 0 otherwise.
  • number of cases

The file is available as mobility.dat, and also in Stata format.

This is a simplified version of a dataset from StatLib which may be found at http://lib.stat.cmu.edu/datasets/socmob. I rounded the counts for son's current occupation to the nearest integer, and grouped both father's and son's occupation into just four categories, treating 1-2 as farm, 3-6 as unskilled, 7-11 as skilled and 12-17 as professional/managerial.

If you use the data in a publication please acknowledge Statlib and the original authors, David L. Featherman and Robert M. Hauser (1978). Opportunity and Change. New York: Academic Press. The data were also analyzed by Timothy J. Biblarz and Adrian E. Raftery (1993). "The Effects of Family Disruption on Social Mobility", American Sociological Review, 58(1):97-109.

Time to Ph.D.

The Time to Ph.D. data are available in a file containing five columns:

  • year: coded 1 to 14, representing years of graduate school.
  • university: coded 1 for Berkeley, 2 for Columbia, 3 for Princeton.
  • residence: coded 1 for permanent residents, 2 for temporary residents.
  • events: number of students graduating in this category.
  • exposure: number of person-years of exposure to graduation in this category.

The file has 73 rows and is called phd.dat.

Reference: Espenshade, T.J. and Rodríguez, G. (1997). Completing the Ph.D.: Comparative Performances of U.S. and Foreign Students. Social Science Quarterly, 78:593-605.

The Gehan-Freirich Survival Data

The data show the length of remission in weeks for two groups of leukemia patients, treated and control, and were analyzed by Cox in his original proportional hazards paper. The data are available in a file containing three columns:

  • Treatment: coded Treated (drug) or Control (placebo),
  • Time: weeks of remission,
  • Failure: coded 1 if a failure (relapse), 0 if censored

Thus, the third and fourth observations, 6 and 6+, corresponding to a death and a censored observation at six weeks, are coded 6, 1 and 6, 0, respectively.

The data are available in the usual two plain-text formats in gehan.dat and gehan.raw (group codes are 1=control, 2=treated), and as a Stata file in gehan.dta.

These data actually come from a matched-pairs design, where patients were paired according to remission status (partial or complete) and then randomly assigned to the treated or control group, but most analyses have ignored this fact. See Andersen et al (1993), pages 22-23, which has references to several papers using this dataset.

Reference: Andersen, P. K.; Borgan, O.; Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes, Springer-Verlag, New York.

The Somoza Dataset

These are Somoza's data on infant and child survival in Colombia, used in the notes. The dataset comes from the Word Fertility Survey, which was fielded in Colombia in 1976. Women in the reproductive ages were asked about their children and these were tabulated by sex, year of birth (cohort), survival status and age at death or at interview, see Table 3 in the notes.

The file has 48 lines, corresponding to the 48 combinations of sex, cohort and age, and six columns:

  • sex: 1=Male or 2=Female,
  • cohort: 1=1941-59, 2=1960-67 or 3=1968-76,
  • age: 0-1/12, 1/12-3/12, 3/12-6/12, 1/2-1, 1-2, 2-5, 5-10 or 10+, coded 1 to 8 in this order
  • dead: number dead in this category
  • alive: number alive at interview

To get a copy of this file in plain text format choose somoza.dat, which uses character labels for sex, cohort and age, or somoza.raw, which uses numeric codes for all variables. The file is also available in Stata format as somoza.dta

In order to analyze these data using piece-wise exponential models you first have to calculate events and exposure by sex, cohort and age. This calculation is often a non-trivial step in preparing the data for survival analysis, but our Stata log shows all the steps needed. The final step of that process, a file with events and exposure by cohort and age (collapsing over sex) is available in Stata format as somoza2.dta.

Reference: Somoza, J. (1980). Illustrative Analysis: Infant and Child Mortality in Colombia. World Fertility Survey Scientific Reports, Number 10.

Marriage Dissolution in the U.S.

This dataset, adapted from an example in the software package aML, is based on a longitudinal survey conducted in the U.S.

The unit of observation is the couple and the event of interest is divorce, with interview and widowhood treated as censoring events. We have three fixed covariates: education of the husband and two indicators of the couple's ethnicity: whether the husband is black and whether the couple is mixed.

The variables are

  • id: a couple number.
  • heduc: education of the husband, coded
    • 0 = less than 12 years,
    • 1 = 12 to 15 years, and
    • 2 = 16 or more years.
  • heblack: coded 1 if the husband is black and 0 otherwise
  • mixed: coded 1 if the husband and wife have different ethnicity (defined as black or other), 0 otherwise.
  • years: duration of marriage, from the date of wedding to divorce or censoring (due to widowhood or interview).
  • div: the failure indicator, coded 1 for divorce and 0 for censoring.

The dataset has 3771 couples and is available in "raw" format as divorce.raw and in "dat" format as divorce.dat. The file is also available in Stata format as divorce.dta.

Reference: Lillard and Panis (2000), aML Multilevel Multiprocess Statistical Software, Release 1.0, EconWare, LA, California.