![]() | ![]() | ![]() |
This is a collection of small datasets used in the course. To get a copy just click on the dataset's link. Or right-click (on most browsers) to save a copy to your local machine.
Here is a list of datasets classified by the type of statistical technique illustrated. Some datasets appear in more than one category.
All datasets are available as ascii files in two formats:
.dat
has a header line with the variable names, and codes
categorical variable using character strings.
This version is best for users of S-Plus or R,
who can use read.table. Note that some files
do not have row names; use header=T.
.raw
omits the header line and codes all variable using
numeric codes. This version is best for users of
Stata or other packages that prefer numerical codes.
(However, Stata can read the character version if you
specify the string width using str,
as shown in this example.)
Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR) between 1965 and 1975, for 20 countries in Latin America.
setting effort change
Bolivia 46 0 1
Brazil 74 0 10
Chile 89 16 29
Colombia 77 16 25
CostaRica 84 21 29
Cuba 89 15 40
DominicanRep 68 14 21
Ecuador 70 6 0
ElSalvador 60 13 13
Guatemala 55 9 4
Haiti 35 3 0
Honduras 51 7 7
Jamaica 87 23 21
Mexico 83 4 9
Nicaragua 68 0 7
Panama 84 19 22
Paraguay 74 3 6
Peru 73 0 2
TrinidadTobago 84 15 29
Venezuela 91 7 11
Source: P.W. Mauldin and B. Berelson (1978). Conditions of
fertility decline in developing countries, 1965-75.
Studies in Family Planning,
To get this dataset click on effort.dat. If you use Stata you may prefer effort.raw, which omits the header line with the variable names.
These are the salary data used in Weisberg's book, and include observations on six variables for 52 tenure-track professors in a small college.
The file is available in the usual formats as salary.dat and salary.raw. Variables in the 'raw' file are coded as follows:
Source: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.
These are data based on a 5% sample of all births occurring in Philadelphia in 1990. The sample has 1115 observations (after deleting 32 cases with incomplete information) on five variables:
Source: I. T. Elo, G. ROdríguez and H. Lee (2001) Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.
Here are the contraceptive use data from page 46 of the lecture notes (and from the Stata and S-Plus handouts), showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.
age education wantsMore notUsing using
<25 low yes 53 6
<25 low no 10 4
<25 high yes 212 52
<25 high no 50 10
25-29 low yes 60 14
25-29 low no 19 10
25-29 high yes 155 54
25-29 high no 65 27
30-39 low yes 112 33
30-39 low no 77 80
30-39 high yes 118 46
30-39 high no 68 78
40-49 low yes 35 6
40-49 low no 46 48
40-49 high yes 8 8
40-49 high no 12 31
To get a copy of the data in the format shown above click on cuse.dat.
The dataset is also available in the format used in the Stata handout. This format has 32 rows corresponding to all possible covariate and response patterns, and includes a weight indicating the frequency of each combination. The file has 5 columns with numeric codes:
These are the data from Fiji on children ever born, from page 84 of the lecture notes (and the Stata handout).
The dataset has 70 rows representing grouped individual data. Each row has entries for:
This dataset has information on lung cancer deaths by age and smoking status. The file in "raw" format has four columns:
The file is also available in "dat" format with variable names, row names and string labels for age and smoking status.
These are the data from McCullagh and Nelder (1989, p. 204). The file has 34 rows corresponding to the observed combinations of type of ship, year of construction and period of operation. Each row has information on five variables as follows:
Note that there no ships of type E built in 1960-64, and that ships built in 1970-74 could not have operated in 1960-74. The combinations are omitted from the data file.
You can get the data in the usual versions: ships.dat has a header and codes the factors using strings, and ship.raw uses numeric codes.
These are the data from Wilner, Walkley and Cook (1955) on the effect of racial attitudes on segregation and integration of public housing:
Sentiment
Proximity Contact Norms fav unfav
close frequent favorable 77 32
unfavorable 30 36
infrequent favorable 14 19
unfavorable 15 27
distant frequent favorable 43 20
unfavorable 36 37
infrequent favorable 27 36
unfavorable 41 118
You can get a file in the usual character and numeric formats from housing.dat or housing.raw, respectively. The latter codes the factor levels in order of appearance as follows:
These are the Madsen data used in the revised lecture notes. This is a four-way table classifying 1681 residents of twelve areas in Copenhagen in terms of
These are the data from Bishop, Fienberg and Holland (1975) on the three-year survival status of breast-cancer patients by age and malignancy of tumor:
survive?
age malignant yes no
1 under50 no 77 10
2 under50 yes 51 13
3 50-69 no 51 11
4 50-69 yes 38 20
5 70+ no 7 3
6 70+ yes 6 3
You can get a file in the usual character and numeric formats from cancer.dat or cancer.raw.
The method choice data from Brazil are available in a file containing three columns:
You can read the file with character labels (brazil.dat) into Stata using the
command
infile str6 age str14 method freq using brazil.dat
This dataset comes from the Guatemalan Survey of Family Health, a survey of rural women that contains detailed data on care received during pregnancy and delivery along with extensive background information. We have tabulated data on 3334 pregnancies. The outcome is the type of provider seen during pregnancy and there are three predictors. The raw data file has five columns, as follows:
The data are available using numeric codes as healthCare.raw and using string codes as well as row and column labels as healthCare.dat.
For more information on this study, see Glei, D. A. and Goldman, N. (2000), Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala, Ethnicity and Health, 5:5-22.
The Social Mobility Data are available in a file containing five columns:
The file is available as mobility.dat.
The Time to Ph.D. data are available in a file containing five columns:
The file has 73 rows and is called phd.dat.
The Gehan data show the length of remission in weeks for two groups of leukemia patients, treated and control, and were analyzed by Cox in his original proportional hazards paper. The data are available in a file containing three columns:
The data are available in the usual two formats in gehan.dat and gehan.raw (group codes are 1=control, 2=treated).
These are Somoza's (1980) data on infant and child survival in Colombia, used in the notes.
The file has 48 lines, corresponding to the 48 combinations of sex, cohort and age in Table 3, and six columns:
In order to analyze these data using piece-wise exponential models you first have to calculate events and exposure by sex, cohort and age. Although this can be done using a statistical package such as Stata or Splus, you may find that a spreadsheet package is better suited for the task.
This dataset is based on a longitudinal survey conducted in the U.S. The unit of observation is the couple and the event of interest is divorce, with interview or widowhood treated as censoring events. We have three fixed covariates: education of the husband and two indicators of ethnicity of the couple. The file has data from 3371 couples, with six variables coded as follows:
The file is available in "raw" format as divorce.raw.
This dataset is adapted from an example in Lillard and Panis (2000), aML Multilevel Multiprocess Statistical Software, Release 1.0, EconWare, LA, California.