Survival Analysis Project:
Marriage Dissolution in the U.S.
Our class project will analyze data on marriage dissolution in the U.S. based on a longitudinal survey. We will conduct the analysis in three parts, starting with a basic proportional hazards model, adding time-varying covariates, and then considering multiple-spell data.
The dataset comes from Lillard and Panis (2003), aML Multilevel Multiprocess Statistical Software, Version 2.0. EconWare, Los Angeles, California. I have created a Stata system file available as http://data.princeton.edu/pop509/divorce3.dta which includes the following variables:
| Name | Description |
|---|---|
| id | Unique respondent's id |
| spans | See explanation below |
| weight | Sampling weight (will ignore) |
| marnum | Marriage number (1..6) |
| censor | Censoring indicator (1=censored,0=divorced) |
| lower | Lower bound for time of event |
| upper | Upper bound for time of event |
| hiseduc | Husband's education (in years of schooling) |
| hereduc | Wife's education (in years of schooling) |
| heblack | Indicator for whether the husband is African American |
| sheblack | Indicator for whether the wife is African American |
| age | Age of husband (at marriage) |
| agediff | Age difference between husband and wife |
| These vars are repeated for n=1,2,...,17: | |
| timen | Time at which this span ends |
| numkidn | Number of children during this span |
An unusual feature of this dataset is that it has lower and upper bounds rather than a single time variable. Censored cases are marriages that ended by widowhood or were intact as of the date of last interview; the time is known exactly for these cases and the lower and upper bounds are equal. For marriages that ended in divorce the dataset records a range of possible dates such that the divorce is known to have occurred at some time between the lower and upper bounds. The aML software can handle this uncertainty but Stata cannot. For simplicity we will assume that divorces occurred at the mid-point of the interval.
Note also that we have multiple marriages per respondent. The dataset has
3,371 respondents with a total of 4,238 marriages. The respondent with the
most marriages had six. The variable marnum is the marriage number
for the survey respondent (which may be the husband or the wife).
For the analyses in parts 1 and 2 we will focus on first marriages only,
but in part 3 we will consider all marriages together in a multiple spell model.
Another interesting feature of the dataset is that we have data on fertility, recording the dates of birth of each child within each marriage. This will allow us to explore the extent to which the divorce rate varies with number of children. The data are available in wide format with a provision for up to 17 children (the maximum in the sample). For example respondent with id 9 has been married once only, had a child at duration 3.734 years and was interviewed at duration 10.546. This respondent has two fertility spans: (from 0) up to 3.734 with 0 children and (from 3.734) up to 10.546 with one child. Below I give some hints on how to process this information.
[1] A Proportional Hazards Model
(a) Explore the determinants of divorce using a proportional hazards model
for first marriages only. I recommend that you use the same specification
as Lillard and Panis, but feel free to explore alternatives.
They model ethnicity effects using two dummy variables:
heblack and mixedrace, which is defined as
heblack != sheblack.
They consider only husband's education using two dummy variables,
dropout for hiseduc < 12 and
college for hiseduc ≥ 16.
They also consider the age difference between spouses using dummies for
heolder when agediff > 10 and
sheolder when agediff < -10.
Make sure in your exploration you describe the effects of husband's education,
couple's ethnicity and age difference, and test their significance
(net of other variables in the model) using (partial) likelihood ratio tests.
(b) Use Schoenfeld residuals to explore whether the effects of the
significant predictors can be considered proportional.
Explore further the proportionality of the effect of heblack
introducing an interaction with a linear term on marriage duration.
Find the quartiles of marriage duration using a Kaplan-Meir estimator for
the total sample
and split the dataset so you can interact heblack with dummy
variables representing durations in Q1-Q2, Q2-Q3, and >Q3 (with <Q1
as the reference cell).
Rejoin the dataset (after dropping variables that vary by quartile category)
and test an interaction between heblack and an indicator for
duration greater than Q3.
(c) Estimate the survival function for white couples where the husband has high school education. Note what proportion of marriages survives 20 years, and what proportion eventually dissolve. Compare these results with appropriate estimates for black couples with the same education. The idea here is to translate hazard ratios into more familiar probabilities. For this analysis make sure you use the last model of part (b), including an additional effect for black husbands at long marriage durations.
[2] A Time-Varying Covariate
(a) Examine the effect of children on marriage stability by introducing a linear term on number of children as a time-varying covariate, interpret the estimate and test its significance. You should find that the hazard of divorce is lower among couples with (more) children. Your interpretation should consider the potential endogeneity of number of children, which could result for example if couples with marital problems tend to postpone childbearing. One solution to this problem requires joint modeling of fertility and marital disruption using simultaneous hazard models, as in Lillard (1993).
(b) Estimate the survival function for white couples where the husband has high school education assuming they never have children. How would you translate the effect of having children into survival probabilities? The idea again is to translate hazard ratios into more familiar probabilities, but this is a bit trickier with time-varying covariates.
Hints:To do this analysis you need to create a separate record
each time a couple has a child. This is not hard to do with reshape.
Just be careful to keep track of marriage duration and to code correctly the
event/censoring indicator. As a sanity check I suggest you rerun the model
of part 1-a on the expanded dataset. You should get exactly the same estimates
as before. Then add the number of children as a time-varying covariate.
[3] A Multiple-Spell Model
(a) To take advantage of Stata's ability to fit shared frailty models we need to
provide a parametric specification for the baseline hazard. A flexible way to do
this is to use a piece-wise exponential model where the hazard is assumed to be
constant in well-chosen categories.
This model can be fit using streg with an exponential distribution
on a dataset that has been stsplit into duration categories,
as we did in the analysis of child survival in Guatemala.
Previous work suggests that cutpoints at 2, 5, 10 and 20 years (plus 0 and infinity)
do a reasonable job capturing the baseline hazard of divorce.
Prepare the dataset and fit the model of part 2-a to first marriages only as a check.
(You may find the results helpful as an alternative
way of doing part 2-b as well.)
(b) Fit the model to all marriages adding dummy variables to represent second marriages and third or higher order marriages, ignoring for now the fact that some respondents contribute multiple observations. Interpret and test the significance of the dummy variables for marriage order.
(c) Fit a shared frailty model using a random effect to account for possible correlation between the durations of a respondent's marriages. Test the significance of the random effect and interpret the estimate of its variance in terms of a correlation coefficient and in terms of a regression coefficient. Note any changes in the coefficients of marriage order after introducing the shared frailty term.
(d) How does the risk of divorce for second marriages compares with the risk for first marriages? In your answer try to distinguish the risk for average first and second marriages and the risk for the average couple. It would be useful to translate your answers into survival probabilities.
