The Program Effort Data
Here are the famous program effort data from Mauldin and Berelson.
This extract consist of observations on
an index of social setting,
an index of family planning effort, and
the percent decline in the crude birth rate (CBR) between 1965 and 1975,
for 20 countries in Latin America.
setting effort change
Bolivia 46 0 1
Brazil 74 0 10
Chile 89 16 29
Colombia 77 16 25
CostaRica 84 21 29
Cuba 89 15 40
DominicanRep 68 14 21
Ecuador 70 6 0
ElSalvador 60 13 13
Guatemala 55 9 4
Haiti 35 3 0
Honduras 51 7 7
Jamaica 87 23 21
Mexico 83 4 9
Nicaragua 68 0 7
Panama 84 19 22
Paraguay 74 3 6
Peru 73 0 2
TrinidadTobago 84 15 29
Venezuela 91 7 11
The data are available as plain text files
effort.dat,
which has a header line with the variable names, and
effort.raw, which omits it;
otherwise both files look like the listing above.
The data are also available in Stata format as effort.dta.
Reference: P.W. Mauldin and B. Berelson (1978). Conditions of
fertility decline in developing countries, 1965-75.
Studies in Family Planning,9:89-147.
JSTOR: http://www.jstor.org/stable/1965523.
Discrimination in Salaries
These are the salary data used in Weisberg's book, consisting
of observations on six variables for 52 tenure-track
professors in a small college. The variables are:
- sx = Sex, coded 1 for female and 0 for male
- rk = Rank, coded
- 1 for assistant professor,
- 2 for associate professor, and
- 3 for full professor
- yr = Number of years in current rank
- dg = Highest degree, coded 1 if doctorate, 0 if masters
- yd = Number of years since highest degree was earned
- sl = Academic year salary, in dollars.
The file is available in the usual plain text formats as
salary.dat using character codes
and
salary.raw using numeric codes,
and in Stata format
as salary.dta.
Here's an excerpt of the "dat" file:
sx rk yr dg yd sl
male full 25 doctorate 35 36350
male full 13 doctorate 22 35350
male full 10 doctorate 23 28200
female full 7 doctorate 27 26775
male full 19 masters 30 33696
male full 16 doctorate 21 28516
...
female assistant 1 doctorate 1 16686
female assistant 1 doctorate 1 15000
female assistant 0 doctorate 2 20300
Reference: S. Weisberg (1985). Applied Linear Regression,
Second Edition. New York: John Wiley and Sons. Page 194.
Births in Philadelphia
These are data based on a 5% sample of all births occurring in
Philadelphia in 1990. The sample has 1115 observations (after deleting
32 cases with incomplete information) on five variables:
- black = Mother is black (1=yes, 0=no),
- educ = Mother's years of education (0,17),
- smoke = Whether mother smoked during pregnancy (1=yes, 0=no),
- gestate = Gestational age in weeks, and
- grams = Birth weight in grams.
The data are available in plain text format in the files
phbirths.raw and
phbirths.dat, and in Stata format as
phbirts.dta.
The 'dat' file codes black and smoke using TRUE or FALSE,
whereas the 'raw' file uses 1 and 0.
Reference: I. T. Elo, G. Rodríguez and H. Lee (2001).
Racial and Neighborhood Disparities in Birthweight in Philadelphia.
Paper presented at the Annual Meeting of the Population Association of
America, Washington, DC 2001.
The Contraceptive Use Data (W)
Here are the contraceptive use data from page 46 of the
lecture notes (and from the Stata handout), showing the
distribution of 1607 currently married and fecund women
interviewed in the Fiji Fertility Survey, according
to age, education, desire for more children and
current use of contraception.
age education wantsMore notUsing using
<25 low yes 53 6
<25 low no 10 4
<25 high yes 212 52
<25 high no 50 10
25-29 low yes 60 14
25-29 low no 19 10
25-29 high yes 155 54
25-29 high no 65 27
30-39 low yes 112 33
30-39 low no 77 80
30-39 high yes 118 46
30-39 high no 68 78
40-49 low yes 35 6
40-49 low no 46 48
40-49 high yes 8 8
40-49 high no 12 31
The data are available in the format shown above as
cuse.dat, and also as a
Stata system file cusew.dta
using numeric codes and labels for all variables.
These files represent binomial data with 16 groups.
The dataset is also available in a
long format
simulating individual data and using weights to represent
the frequencies.
Reference: Little, R. J. A. (1978).
Generalized Linear Models for Cross-Classified Data from the WFS.
World Fertility Survey Technical Bulletins, Number 5.
The Contraceptive Use Data (L)
This is the alternative version of the contraceptive use data,
showing the distribution of 1607 currently married and fecund women
interviewed in the Fiji Fertility Survey, according
to age, education, desire for more children and
current use of contraception.
This version has 32 rows corresponding to
all possible covariate and response patterns,
and includes a weight indicating the frequency of each combination.
The file has 5 columns with numeric codes:
- age (four groups, 1=<25, 2=25-29, 3=30-39 and 4=40-49),
- education (0=none, 1=some),
- desire for more children (0=more, 1=no more),
- contraceptive use (0=no, 1=yes), and
- frequency (number of cases in this category).
The data in this alternative format are available in
plain text as cuse.raw
and in Stata format as cuse.dta.
An excerpt of the "raw" file is shown below:
1 0 0 0 53
1 0 0 1 6
1 0 1 0 10
1 0 1 1 4
1 1 0 0 212
1 1 0 1 52
...
4 1 0 1 8
4 1 1 0 12
4 1 1 1 31
Reference: Little, R. J. A. (1978).
Generalized Linear Models for Cross-Classified Data from the WFS.
World Fertility Survey Technical Bulletins, Number 5.
The Children Ever Born Data
These are the data from Fiji on children ever born, from
page 84 of the lecture notes (and the Stata handout).
The dataset has 70 rows representing grouped individual data.
Each row has entries for:
- The cell number (1 to 71, cell 68 has no observations),
- marriage duration (1=0-4, 2=5-9, 3=10-14, 4=15-19,
5=20-24, 6=25-29),
- residence (1=Suva, 2=Urban, 3=Rural),
- education (1=none, 2=lower primary, 3=upper primary, 4=secondary+),
- mean number of children ever born (e.g. 0.50),
- variance of children ever born (e.g. 1.14), and
- number of women in the cell (e.g. 8).
This file is available in the usual two formats:
ceb.dat
has a header and uses character labels for the factors, and
ceb.raw
uses numeric codes, as described above. Here's an
excerpt of the dat file:
dur res educ mean var n y
1 0-4 Suva none 0.50 1.14 8 4.00
2 0-4 Suva lower 1.14 0.73 21 23.94
3 0-4 Suva upper 0.90 0.67 42 37.80
4 0-4 Suva sec+ 0.73 0.48 51 37.23
5 0-4 urban none 1.17 1.06 12 14.04
6 0-4 urban lower 0.85 1.59 27 22.95
...
69 25-29 rural none 7.48 11.34 195 1458.60
70 25-29 rural lower 7.81 7.57 59 460.79
71 25-29 rural upper 5.80 7.07 10 58.00
Reference: Little, R. J. A. (1978).
Generalized Linear Models for Cross-Classified Data from the WFS.
World Fertility Survey Technical Bulletins, Number 5.
Smoking and Lung Cancer
This dataset has information on lung cancer deaths by age and
smoking status.
The file in "raw" format, smoking.raw,
has four columns:
- age: in five-year age groups coded 1 to 9 for
40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+.
- smoking status: coded
1 = doesn't smoke,
2 = smokes cigars or pipe only,
3 = smokes cigarrettes and cigar or pipe, and
4 = smokes cigarrettes only,
- population: in hundreds of thousands, and
- deaths: number of lung cancer deaths in a year.
The file is also available in "dat" format as
smoking.dat,
with variable names, row names and string labels for age and
smoking status. An excerpt appears below:
age smoke pop dead
1 40-44 no 656 18
2 45-59 no 359 22
3 50-54 no 249 19
4 55-59 no 632 55
5 60-64 no 1067 117
6 65-69 no 897 170
....
32 60-64 cigarretteOnly 3791 778
33 65-69 cigarretteOnly 2421 689
34 70-74 cigarretteOnly 1195 432
35 75-79 cigarretteOnly 436 214
36 80+ cigarretteOnly 113 63
The origin of this dataset has been lost in the mist of time.
If you can provide a reference please
contact me.
The Ship Damage Data
These are the data from McCullagh and Nelder.
The file has 34 rows corresponding to the observed combinations of
type of ship, year of construction and period of operation.
Each row has information on five variables as follows:
- ship type, coded 1-5 for A, B, C, D and E,
- year of construction (1=1960-64, 2=1965-70, 3=1970-74, 4=1975-79),
- period of operation (1=1960-74, 2=1975-79)
- months of service, ranging from 63 to 20,370, and
- damage incidents, ranging from 0 to 53.
Note that there no ships of type E built in 1960-64, and that
ships built in 1970-74 could not have operated in 1960-74.
These combinations are omitted from the data file.
You can get the data in the usual versions:
ships.dat
has a header and codes the factors using strings, and
ship.raw
uses the numeric codes shown above.
Here's an exceprt of the dat file:
type construction operation months damage
1 A 1960-64 1960-74 127 0
2 A 1960-64 1975-79 63 0
3 A 1965-69 1960-74 1095 3
4 A 1965-69 1975-79 1095 4
5 A 1970-74 1960-74 1512 6
6 A 1970-74 1975-79 3353 18
...
32 E 1970-74 1960-74 1157 5
33 E 1970-74 1975-79 2161 12
34 E 1975-79 1975-79 542 1
Reference: McCullagh, P. and Nelder, J. (1989)
Generalized Linear Models, 2nd Edition.
Chapman and Hall, London. Page 204.
The Housing Data
These are the data from Wilner, Walkley and Cook on the
effect of racial attitudes on segregation and integration of public
housing. The data can be viewed as a 2x2x2x2 contingency table:
Sentiment
Proximity Contact Norms fav unfav
close frequent favorable 77 32
unfavorable 30 36
infrequent favorable 14 19
unfavorable 15 27
distant frequent favorable 43 20
unfavorable 36 37
infrequent favorable 27 36
unfavorable 41 118
You can get a file in the usual character and numeric formats from
housing.dat
or
housing.raw,
respectively, and in Stata format from
housing.dta.
The "raw' data file codes the factor levels in order of
appearance as follows:
- Proximity: 1 = close, 2=distant
- Contact: 1 = frequent, 2=infrequent
- Norms: 1=favorable, 2=unfavorable
For regression analysis it would have been better to code these variables
using 1 and 0 instead of 1 and 2, and rename them to something like proximClose,
contactFreq, and normsFav. I haven't done this because it might break existing
code, but the new variables can easily be added.
Reference: Wilner, D., Walkley, R.R. and Cook, S.W. (1955).
Human relations in interracial housing: A study of the contact hypothesis.
University of Minnesota Press
Housing Conditions in Copenhagen
These are the Madsen data used in the revised lecture notes.
This is a four-way table classifying 1681 residents of twelve areas
in Copenhagen in terms of:
- the type of housing they had (1=tower
blocks, 2=apartments, 3=atrium houses and 4=terraced houses),
- their feeling of influence on apartment management (1=low, 2=medium,3=high),
- their degree of contact with neighbors (1=low, 2=high), and
- their satisfaction with housing conditions (1=low, 2=medium, 3=high).
The data file contains 72 rows, one for each combination of values of the four
variables, and has six columns, a row number, the four variables, and the
number of cases in the category.
The file is available in the usual character and numeric formats:
copen.dat
or
copen.raw,
respectively,
and in Stata format as copen.dta
Here's an exceprt of the "dat" file:
housing influence contact satisfaction n
1 tower low low low 21
2 tower low low medium 21
3 tower low low high 28
4 tower low high low 14
5 tower low high medium 19
6 tower low high high 37
...
70 terraced high high low 5
71 terraced high high medium 6
72 terraced high high high 13
Reference: Madsen, M. (1976). Statistical Analysis of
Multiple Contingency Tables. Two Examples.
Scand. J. Statist.3:97-106.
JSTOR:
http://www.jstor.org/stable/4615621
The Cancer Data
These are the data from Bishop, Fienberg and Holland on
the three-year survival status of breast-cancer patients by age
and malignancy of tumor:
survive?
age malignant yes no
1 under50 no 77 10
2 under50 yes 51 13
3 50-69 no 51 11
4 50-69 yes 38 20
5 70+ no 7 3
6 70+ yes 6 3
You can get a file in the usual character and numeric formats from
cancer.dat
or
cancer.raw,
and in Stata format from
cancer.dta.
Reference: Bishop, Y. M. M. ; Fienberg, S. E. and Holland, P. W. (1975)
Discrete Multivariate Analysis: Theory and Practice.
MIT Press, Cambridge.
.
The Method Choice Data
The method choice data from Brazil
are available in a file containing three columns:
- Age group: 15-19, 20-24, 25-29, 30-34, 35-39 or 40-44
- Method: sterilization, efficient, inefficient, or not_using
- Frequency: the number of women in each age/method combination.
As usual, the file is available in two formats:
brazil.dat
codes the factors using character labels, and
brazil.raw
uses numeric codes (the age groups are coded 1-6 and the methods
are coded 1=not_using, 2=inefficient, 3=efficient, 4=sterilization).
Here's an excerpt of the dat file:
15-19 sterilization 2
15-19 efficient 75
15-19 inefficient 6
15-19 not_using 90
20-24 sterilization 32
20-24 efficient 223
...
40-44 efficient 71
40-44 inefficient 69
40-44 not_using 17
You can read the file with character labels (brazil.dat)
into Stata using the command
infile str6 age str14 method freq ///
using brazil.dat
but of course we now provide a Stata file as
brazil.dta.
Health Care Utilization in Guatemala
This dataset comes from the Guatemalan Survey of Family Health, a survey of rural
women that contains detailed data on care received during pregnancy and delivery
along with extensive background information.
We have tabulated data on 3334
pregnancies. The outcome is the type of provider seen during pregnancy and
there are three predictors. The raw data file has five columns, as follows:
- eth = Ethnicity/Language, coded 1=Indigenous, non-spanish speaker,
2=Indigenous, spanish speaker, and 3= ladino.
- migr = Migration, whether the community has frequent migration abroad,
coded 1=yes, 0=no.
- avail = Availability of modern health services within one hour of the community,
coded 1=yes, 0=no.
- type = Provider type, coded 1=none, 2=midwife, 3=health post and 4=doctor.
For simplicity, women seeing multiple provider types during their
pregnancy were coded using the most modern type; for example women
seeing both a midwife and a doctor were coded under doctor.
- n = Count of the number of women in each category defined by
the previous four columns.
The data are available using numeric codes as healthCare.raw
and using string codes as well as row and column labels as healthCare.dat. Here's are a few lines from the latter:
eth migr avail provider n
1 indNoSpa no no none 7
2 indNoSpa no no midwife 93
3 indNoSpa no no healthPost 6
...
34 ladino yes yes doctor 83
Reference: Glei, D. A. and Goldman, N. (2000),
Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala,
Ethnicity and Health, 5:5-22.
The Social Mobility Data
The Social Mobility Data are available in a file containing five columns:
- father's occupation: 1=farm, 2=unskilled, 3=skilled, 4=professional.
- sons's occupation: same categories as the father.
- race: coded 1 for blacks, 0 for others.
- disruption: coded 1 for non-intact family background, 0 otherwise.
- number of cases
The file is available as mobility.dat,
and also in Stata format. Here's an exceprt of the dat file:
fatherOccup sonOccup black nonintact n
1 farm farm no no 592
2 farm farm no yes 55
3 farm farm yes no 41
4 farm farm yes yes 15
5 farm unskilled no no 1005
6 farm unskilled no yes 134
...
61 professional professional no yes 317
62 professional professional yes no 52
63 professional professional yes yes 19
This is a simplified version of a dataset from StatLib which may be found at
http://lib.stat.cmu.edu/datasets/socmob.
I rounded the counts for son's current occupation to the nearest integer, and
grouped both father's and son's occupation into just four categories, treating
1-2 as farm, 3-6 as unskilled, 7-11 as skilled and 12-17 as professional/managerial.
If you use the data in a publication please acknowledge Statlib and the original
authors, David L. Featherman and Robert M. Hauser (1978).
Opportunity and Change. New York: Academic Press.
The data were also analyzed by Timothy J. Biblarz and Adrian E. Raftery (1993).
"The Effects of Family Disruption on Social Mobility", American Sociological Review,
58(1):97-109.
Time to Ph.D.
The Time to Ph.D. data are available in a file containing five columns:
- year: coded 1 to 14, representing years of graduate school.
- university: coded 1 for Berkeley, 2 for Columbia, 3 for Princeton.
- residence: coded 1 for permanent residents, 2 for temporary residents.
- events: number of students graduating in this category.
- exposure: number of person-years of exposure to graduation in this category.
The file has 73 rows and is called phd.dat.
A brief excerpt is shown below:
1 1 1 31 7422
2 1 1 177 7166
3 1 1 393 6759
4 1 1 484 6138
5 1 1 500 5506
6 1 1 399 4824
...
6 3 2 8 85
7 3 2 2 72
12 3 2 2 37
Reference: Espenshade, T.J. and Rodríguez, G. (1997).
Completing the Ph.D.: Comparative Performances of U.S. and Foreign Students.
Social Science Quarterly, 78:593-605.
The Gehan-Freirich Survival Data
The data show the length of remission in weeks for two
groups of leukemia patients, treated and control, and were
analyzed by Cox in his original proportional hazards paper.
The data are available in a file containing three columns:
- Treatment: coded Treated (drug) or Control (placebo),
- Time: weeks of remission,
- Failure: coded 1 if a failure (relapse), 0 if censored
Thus, the third and fourth observations, 6 and 6+,
corresponding to a death and a censored observation at six weeks,
are coded 6, 1 and 6, 0, respectively.
The data are available in the usual two plain-text formats in
gehan.dat
and
gehan.raw
(group codes are 1=control, 2=treated),
and as a Stata file in
gehan.dta.
Here's an excerpt of the dat file:
treatment time failure
1 treated 6 TRUE
2 treated 6 TRUE
3 treated 6 TRUE
4 treated 6 FALSE
5 treated 7 TRUE
6 treated 9 FALSE
...
40 control 17 TRUE
41 control 22 TRUE
42 control 23 TRUE
These data actually come from a matched-pairs design,
where patients were paired according to remission status
(partial or complete) and then randomly assigned to the treated
or control group, but most analyses have ignored this fact.
See Andersen et al (1993), pages 22-23, which has references
to several papers using this dataset.
Reference: Andersen, P. K.; Borgan, O.; Gill, R. D. and
Keiding, N. (1993). Statistical Models Based on Counting Processes,
Springer-Verlag, New York.
The Somoza Dataset
These are Somoza's data on infant and child survival in
Colombia, used in the notes (Table 3). The dataset comes from the Word
Fertility Survey, which was fielded in Colombia in 1976.
Women in the reproductive ages were asked about their children
and these were tabulated by sex, year of birth (cohort),
survival status and age at death or at interview.
The file has 48 lines, corresponding to the 48 combinations of
sex, cohort and age, and six columns:
- sex: 1=Male or 2=Female,
- cohort: 1=1941-59, 2=1960-67 or 3=1968-76,
- age: 0-1/12, 1/12-3/12, 3/12-6/12, 1/2-1, 1-2, 2-5, 5-10 or 10+,
coded 1 to 8 in this order
- dead: number dead in this category
- alive: number alive at interview
The data are available in plain text format as
somoza.dat,
which uses character labels for sex, cohort and age,
and somoza.raw,
which uses numeric codes for all variables,
and in Stata format as somoza.dta
A brief excerpt of the dat file is shown below.
sex cohort age dead alive
Male 1941-59 0-1/12 99 0
Male 1941-59 1/12-3/12 35 0
...
Female 1968-76 10+ 0 0
In order to analyze these data using piece-wise exponential models
you first have to calculate events and exposure by sex, cohort and age.
The details of this calculation are shown in our Stata logs.
The final step of that process, a file with events and exposure by
cohort and age (collapsing over sex) is available in Stata format as
somoza2.dta.
Reference: Somoza, J. (1980).
Illustrative Analysis: Infant and Child Mortality in Colombia.
World Fertility Survey Scientific Reports, Number 10.
Marriage Dissolution in the U.S.
This dataset, adapted from an example in the software package aML,
is based on a longitudinal survey conducted in the U.S.
The unit of observation is the couple and the event of interest is
divorce, with interview and widowhood treated as censoring events.
We have three fixed covariates: education of the husband and
two indicators of the couple's ethnicity: whether the husband is
black and whether the couple is mixed. The variables are:
- id: a couple number.
- heduc: education of the husband, coded
- 0 = less than 12 years,
- 1 = 12 to 15 years, and
- 2 = 16 or more years.
- heblack: coded 1 if the husband is black and 0 otherwise
- mixed: coded 1 if the husband and wife have different
ethnicity (defined as black or other), 0 otherwise.
- years: duration of marriage, from the date of wedding to
divorce or censoring (due to widowhood or interview).
- div: the failure indicator, coded 1 for divorce and 0 for censoring.
The dataset has 3771 couples and is available in "raw" format
as divorce.raw and in "dat" format
as divorce.dat, see excerpt below.
The file is also available in Stata format as
divorce.dta.
id heduc heblack mixed years div
9 12-15 years No No 10.546 No
11 < 12 years No No 34.943 No
13 < 12 years No No 2.834 Yes
15 < 12 years No No 17.532 Yes
33 12-15 years No No 1.418 No
36 < 12 years No No 48.033 No
...
17294 12-15 years Yes No 7.269 No
17302 12-15 years No Yes 18.73 No
Reference: Lillard and Panis (2000),
aML Multilevel Multiprocess Statistical Software, Release 1.0,
EconWare, LA, California.