This is a collection of small datasets used in the course.
The datasets are now available in Stata format and can be read
directly from Stata by typing
use http://data.princeton.edu/wws509/datasets/DatasetName.
They are also available in two plain text formats, as
explained below.
List of Datasets
Here is a list of datasets classified by the type of statistical technique that may be used to analyze them. A couple of datasets appear in more than one category.
Linear Regression
Logistic Regression
Poisson Regression
Log-Linear Models for Contingency Tables
and Multinomial Response Models
Survival Data
Data Formats
All datasets are available as
plain-text ASCII files, usually in two formats:
- The copy with extension
.dathas a header line with the variable names, and codes categorical variables using character strings. This version is best for users of S-Plus or R, who can useread.table. Some files do not have row names; in these cases useheader=T. - The copy with extension
.rawomits the header line and codes all variable using numeric codes. This version is better for users of Stata or other packages that prefer numerical codes. (However, Stata can read the character version if you specify the string width usingstr.)
The datasets are also available as
Stata system files
with extension .dta, and can be read
directly from net-aware Stata versions 10 or higher
using the command given at the top of this page.
This is the easiest method for Stata users.
You can also right click on the links to save a local copy.
R users can read the Stata files using Tom Lumley's
read.dta function in the foreign package.
The Program Effort Data
Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR) between 1965 and 1975, for 20 countries in Latin America.
setting effort change
Bolivia 46 0 1
Brazil 74 0 10
Chile 89 16 29
Colombia 77 16 25
CostaRica 84 21 29
Cuba 89 15 40
DominicanRep 68 14 21
Ecuador 70 6 0
ElSalvador 60 13 13
Guatemala 55 9 4
Haiti 35 3 0
Honduras 51 7 7
Jamaica 87 23 21
Mexico 83 4 9
Nicaragua 68 0 7
Panama 84 19 22
Paraguay 74 3 6
Peru 73 0 2
TrinidadTobago 84 15 29
Venezuela 91 7 11
The data are available as plain text files effort.dat, which has a header line with the variable names, and effort.raw, which omits it; otherwise both files look like the listing above. The data are also available in Stata format as effort.dta.
Reference: P.W. Mauldin and B. Berelson (1978). Conditions of fertility decline in developing countries, 1965-75. Studies in Family Planning,9:89-147. JSTOR: http://www.jstor.org/stable/1965523.
Discrimination in Salaries
These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:
- sx = Sex, coded 1 for female and 0 for male
- rk = Rank, coded
- 1 for assistant professor,
- 2 for associate professor, and
- 3 for full professor
- yr = Number of years in current rank
- dg = Highest degree, coded 1 if doctorate, 0 if masters
- yd = Number of years since highest degree was earned
- sl = Academic year salary, in dollars.
The file is available in the usual plain text formats as salary.dat using character codes and salary.raw using numeric codes, and in Stata format as salary.dta.
Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.
Births in Philadelphia
These are data based on a 5% sample of all births occurring in Philadelphia in 1990. The sample has 1115 observations (after deleting 32 cases with incomplete information) on five variables:
- black = Mother is black (1=yes, 0=no),
- educ = Mother's years of education (0,17),
- smoke = Whether mother smoked during pregnancy (1=yes, 0=no),
- gestate = Gestational age in weeks, and
- grams = Birth weight in grams.
The data are available in plain text format in the files phbirths.raw and phbirths.dat, and in Stata format as phbirts.dta.
The 'dat' file codes black and smoke using TRUE or FALSE, whereas the 'raw' file uses 1 and 0.
Reference: I. T. Elo, G. Rodríguez and H. Lee (2001). Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.
The Contraceptive Use Data
Here are the contraceptive use data from page 46 of the lecture notes (and from the Stata handout), showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.
age education wantsMore notUsing using
<25 low yes 53 6
<25 low no 10 4
<25 high yes 212 52
<25 high no 50 10
25-29 low yes 60 14
25-29 low no 19 10
25-29 high yes 155 54
25-29 high no 65 27
30-39 low yes 112 33
30-39 low no 77 80
30-39 high yes 118 46
30-39 high no 68 78
40-49 low yes 35 6
40-49 low no 46 48
40-49 high yes 8 8
40-49 high no 12 31
The data are available in the format shown above as cuse.dat.
The dataset is also available in the format used in the Stata handout. This version has 32 rows corresponding to all possible covariate and response patterns, and includes a weight indicating the frequency of each combination. The file has 5 columns with numeric codes:
- age (four groups, 1=<25, 2=25-29, 3=30-39 and 4=40-49),
- education (0=none, 1=some),
- desire for more children (0=more, 1=no more),
- contraceptive use (0=no, 1=yes), and
- frequency (number of cases in this category).
The data in this alternative format are available in plain text as cuse.raw and in Stata format as cuse.dta.
Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.
The Children Ever Born Data
These are the data from Fiji on children ever born, from page 84 of the lecture notes (and the Stata handout).
The dataset has 70 rows representing grouped individual data. Each row has entries for:
- The cell number (1 to 71, cell 68 has no observations),
- marriage duration (1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29),
- residence (1=Suva, 2=Urban, 3=Rural),
- education (1=none, 2=lower primary, 3=upper primary, 4=secondary+),
- mean number of children ever born (e.g. 0.50),
- variance of children ever born (e.g. 1.14), and
- number of women in the cell (e.g. 8).
This file is available in the usual two formats: ceb.dat has a header and uses character labels for the factors, and ceb.raw uses numeric codes, as described above.
Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.
Smoking and Lung Cancer
This dataset has information on lung cancer deaths by age and smoking status.
The file in "raw" format, smoking.raw, has four columns:
- age: in five-year age groups coded 1 to 9 for 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+.
- smoking status: coded 1 = doesn't smoke, 2 = smokes cigars or pipe only, 3 = smokes cigarrettes and cigar or pipe, and 4 = smokes cigarrettes only,
- population: in hundreds of thousands, and
- deaths: number of lung cancer deaths in a year.
The file is also available in "dat" format as smoking.dat, with variable names, row names and string labels for age and smoking status.e>
The Ship Damage Data
These are the data from McCullagh and Nelder. The file has 34 rows corresponding to the observed combinations of type of ship, year of construction and period of operation. Each row has information on five variables as follows:
- ship type, coded 1-5 for A, B, C, D and E,
- year of construction (1=1960-64, 2=1965-70, 3=1970-74, 4=1975-79),
- period of operation (1=1960-74, 2=1975-79)
- months of service, ranging from 63 to 20,370, and
- damage incidents, ranging from 0 to 53.
Note that there no ships of type E built in 1960-64, and that ships built in 1970-74 could not have operated in 1960-74. These combinations are omitted from the data file.
You can get the data in the usual versions: ships.dat has a header and codes the factors using strings, and ship.raw uses the numeric codes shown above.
Reference: McCullagh, P. and Nelder, J. (1989) Generalized Linear Models, 2nd Edition. Chapman and Hall, London. Page 204.
The Housing Data
These are the data from Wilner, Walkley and Cook on the effect of racial attitudes on segregation and integration of public housing. The data can be viewed as a 2x2x2x2 contingency table:
Sentiment
Proximity Contact Norms fav unfav
close frequent favorable 77 32
unfavorable 30 36
infrequent favorable 14 19
unfavorable 15 27
distant frequent favorable 43 20
unfavorable 36 37
infrequent favorable 27 36
unfavorable 41 118
You can get a file in the usual character and numeric formats from housing.dat or housing.raw, respectively, and in Stata format from housing.dta.
The "raw' data file codes the factor levels in order of appearance as follows:
- Proximity: 1 = close, 2=distant
- Contact: 1 = frequent, 2=infrequent
- Norms: 1=favorable, 2=unfavorable
For regression analysis it would have been better to code these variables using 1 and 0 instead of 1 and 2, and rename them to something like proximClose, contactFreq, and normsFav. I haven't done this because it might break existing code, but the new variables can easily be added.
Reference: Wilner, D., Walkley, R.R. and Cook, S.W. (1955). Human relations in interracial housing: A study of the contact hypothesis. University of Minnesota Press
Housing Conditions in Copenhagen
These are the Madsen data used in the revised lecture notes. This is a four-way table classifying 1681 residents of twelve areas in Copenhagen in terms of:
- the type of housing they had (1=tower blocks, 2=apartments, 3=atrium houses and 4=terraced houses),
- their feeling of influence on apartment management (1=low, 2=medium,3=high),
- their degree of contact with neighbors (1=low, 2=high), and
- their satisfaction with housing conditions (1=low, 2=medium, 3=high).
The data file contains 72 rows, one for each combination of values of the four variables, and as six columns, a row number, the four variables, and the number of cases in the category. The file is available in the usual character and numeric formats: copen.dat or copen.raw, respectively, and in Stata format as copen.dta
Reference: Madsen, M. (1976). Statistical Analysis of Multiple Contingency Tables. Two Examples. Scand. J. Statist.3:97-106. JSTOR: http://www.jstor.org/stable/4615621
The Cancer Data
These are the data from Bishop, Fienberg and Holland on the three-year survival status of breast-cancer patients by age and malignancy of tumor:
survive?
age malignant yes no
1 under50 no 77 10
2 under50 yes 51 13
3 50-69 no 51 11
4 50-69 yes 38 20
5 70+ no 7 3
6 70+ yes 6 3
You can get a file in the usual character and numeric formats from cancer.dat or cancer.raw, and in Stata format from cancer.dta.
Reference: Bishop, Y. M. M. ; Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge. .
The Method Choice Data
The method choice data from Brazil are available in a file containing three columns:
- Age group: 15-19, 20-24, 25-29, 30-34, 35-39 or 40-44
- Method: sterilization, efficient, inefficient, or not_using
- Frequency: the number of women in each age/method combination.
As usual, the file is available in two formats: brazil.dat codes the factors using character labels, and brazil.raw uses numeric codes (the age groups are coded 1-6 and the methods are coded 1=not_using, 2=inefficient, 3=efficient, 4=sterilization).
You can read the file with character labels (brazil.dat) into Stata using the command
infile str6 age str14 method freq /// using brazil.datbut of course we now provide a Stata file as brazil.dta.
Health Care Utilization in Guatemala
This dataset comes from the Guatemalan Survey of Family Health, a survey of rural women that contains detailed data on care received during pregnancy and delivery along with extensive background information.
We have tabulated data on 3334 pregnancies. The outcome is the type of provider seen during pregnancy and there are three predictors. The raw data file has five columns, as follows:
- eth = Ethnicity/Language, coded 1=Indigenous, non-spanish speaker, 2=Indigenous, spanish speaker, and 3= ladino.
- migr = Migration, whether the community has frequent migration abroad, coded 1=yes, 0=no.
- avail = Availability of modern health services within one hour of the community, coded 1=yes, 0=no.
- type = Provider type, coded 1=none, 2=midwife, 3=health post and 4=doctor. For simplicity, women seeing multiple provider types during their pregnancy were coded using the most modern type; for example women seeing both a midwife and a doctor were coded under doctor.
- n = Count of the number of women in each category defined by the previous four columns.
The data are available using numeric codes as healthCare.raw and using string codes as well as row and column labels as healthCare.dat.
Reference: Glei, D. A. and Goldman, N. (2000), Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala, Ethnicity and Health, 5:5-22.
The Social Mobility Data
The Social Mobility Data are available in a file containing five columns:
- father's occupation: 1=farm, 2=unskilled, 3=skilled, 4=professional.
- sons's occupation: same categories as the father.
- race: coded 1 for blacks, 0 for others.
- disruption: coded 1 for non-intact family background, 0 otherwise.
- number of cases
The file is available as mobility.dat.
Time to Ph.D.
The Time to Ph.D. data are available in a file containing five columns:
- year: coded 1 to 14, representing years of graduate school.
- university: coded 1 for Berkeley, 2 for Columbia, 3 for Princeton.
- residence: coded 1 for permanent residents, 2 for temporary residents.
- events: number of students graduating in this category.
- exposure: number of person-years of exposure to graduation in this category.
The file has 73 rows and is called phd.dat.
Reference: Espenshade, T.J. and Rodríguez, G. (1997). Completing the Ph.D.: Comparative Performances of U.S. and Foreign Students. Social Science Quarterly, 78:593-605.The Gehan-Freirich Survival Data
The data show the length of remission in weeks for two groups of leukemia patients, treated and control, and were analyzed by Cox in his original proportional hazards paper. The data are available in a file containing three columns:
- Treatment: coded Treated (drug) or Control (placebo),
- Time: weeks of remission,
- Failure: coded 1 if a failure (relapse), 0 if censored
Thus, the third and fourth observations, 6 and 6+, corresponding to a death and a censored observation at six weeks, are coded 6, 1 and 6, 0, respectively.
The data are available in the usual two plain-text formats in gehan.dat and gehan.raw (group codes are 1=control, 2=treated), and as a Stata file in gehan.dta.
These data actually come from a matched-pairs design, where patients were paired according to remission status (partial or complete) and then randomly assigned to the treated or control group, but most analyses have ignored this fact. See Andersen et al (1993), pages 22-23, which has references to several papers using this dataset.
Reference: Andersen, P. K.; Borgan, O.; Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes, Springer-Verlag, New York.
The Somoza Dataset
These are Somoza's data on infant and child survival in Colombia, used in the notes. The dataset comes from the Word Fertility Survey, which was fielded in Colombia in 1976. Women in the reproductive ages were asked about their children and these were tabulated by sex, year of birth (cohort), survival status and age at death or at interview, see Table 3 in the notes.
The file has 48 lines, corresponding to the 48 combinations of sex, cohort and age, and six columns:
- sex: 1=Male or 2=Female,
- cohort: 1=1941-59, 2=1960-67 or 3=1968-76,
- age: 0-1/12, 1/12-3/12, 3/12-6/12, 1/2-1, 1-2, 2-5, 5-10 or 10+, coded 1 to 8 in this order
- dead: number dead in this category
- alive: number alive at interview
To get a copy of this file in plain text format choose somoza.dat, which uses character labels for sex, cohort and age, or somoza.raw, which uses numeric codes for all variables. The file is also available in Stata format as somoza.dta
In order to analyze these data using piece-wise exponential models you first have to calculate events and exposure by sex, cohort and age. This calculation is often a non-trivial step in preparing the data for survival analysis, but our Stata log shows all the steps needed. The final step of that process, a file with events and exposure by cohort and age (collapsing over sex) is available in Stata format as somoza2.dta.
Reference: Somoza, J. (1980). Illustrative Analysis: Infant and Child Mortality in Colombia. World Fertility Survey Scientific Reports, Number 10.
Marriage Dissolution in the U.S.
This dataset, adapted from an example in the software package aML, is based on a longitudinal survey conducted in the U.S.
The unit of observation is the couple and the event of interest is divorce, with interview and widowhood treated as censoring events. We have three fixed covariates: education of the husband and two indicators of the couple's ethnicity: whether the husband is black and whether the couple is mixed.
The variables are
- id: a couple number.
- heduc: education of the husband, coded
- 0 = less than 12 years,
- 1 = 12 to 15 years, and
- 2 = 16 or more years.
- heblack: coded 1 if the husband is black and 0 otherwise
- mixed: coded 1 if the husband and wife have different ethnicity (defined as black or other), 0 otherwise.
- years: duration of marriage, from the date of wedding to divorce or censoring (due to widowhood or interview).
- div: the failure indicator, coded 1 for divorce and 0 for censoring.
The dataset has 3771 couples and is available in "raw" format as divorce.raw and in "dat" format as divorce.dat. The file is also available in Stata format as divorce.dta.
Reference: Lillard and Panis (2000), aML Multilevel Multiprocess Statistical Software, Release 1.0, EconWare, LA, California.
