7.1 The Hazard and Survival Functions
Let T be a non-negative random variable representing the waiting time
until the occurrence of an event.
For simplicity we will adopt the terminology of survival analysis,
referring to the event of interest as `death' and
to the waiting time as `survival' time, but the techniques to be
studied have much wider applicability.
They can be used, for example, to study age at marriage,
the duration of marriage,
the intervals between successive births to a woman,
the duration of stay in a city (or in a job),
and the length of life.
The observant demographer will have noticed that these examples include
the fields of fertility, mortality and migration.
7.1.1 The Survival Function
We will assume for now that T is a continuous random variable with
probability density function (p.d.f.) f(t) and
cumulative distribution function (c.d.f.) F(t) = Pr{ T t },
giving the probability that the event has occurred by duration t.
It will often be convenient to work with the complement of the c.d.f,
the survival function
|
S(t) = Pr{ T > t } = 1 - F(t) = |
|
t
|
f(x) dx, |
| (7.1) |
which gives the probability of being alive at duration t, or more generally,
the probability that the event of interest has not occurred by duration t.
7.1.2 The Hazard Function
An alternative characterization of the distribution of T is given by the
hazard function, or instantaneous rate of occurrence of the event,
defined as
|
l(t) = |
lim
dt0
|
|
Pr{t < T t + dt | T > t } dt
|
. |
| (7.2) |
The numerator of this expression is the conditional probability that the event
will occur in the interval (t,t+dt) given that it has not occurred before,
and the denominator is the width of the interval.
Dividing one by the other we obtain a rate of event occurrence per unit of
time. Taking the limit as the width of the interval goes down to zero,
we obtain an instantaneous rate of occurrence.
The conditional probability in the numerator may be written as the ratio of
the joint probability that T is in the interval (t,t+dt) and
T > t (which is, of course, the same as the probability that t is in
the interval), to the probability of the condition T > t.
The former may be written as f(t)dt for small dt,
while the latter is S(t) by definition.
Dividing by dt and passing to the limit gives the useful result
which some authors give as a definition of the hazard function.
In words, the rate of occurrence of the event at duration t equals
the density of events at t, divided by the probability of surviving to
that duration without experiencing the event.
Note from Equation 7.1 that -f(t) is the derivative of S(t).
This suggests rewriting Equation 7.3 as
If we now integrate from 0 to t and introduce the boundary condition
S(0) = 1 (since the event is sure not to have occurred by duration 0),
we can solve the above expression to obtain a formula for the probability
of surviving to duration t as a function of the hazard at all durations
up to t:
|
S(t) = exp{ - |
|
t
0
|
l(x)dx }. |
| (7.4) |
This expression should be familiar to demographers.
The integral in curly brackets in this equation is called the
cumulative hazard ( or cumulative risk) and is denoted
You may think of
L(t) as the sum of the risks you face going from
duration 0 to t.
These results show that the survival and hazard functions
provide alternative but equivalent characterizations of the distribution of T.
Given the survival function, we can always differentiate to obtain the
density and then calculate the hazard using Equation 7.3.
Given the hazard, we can always integrate to obtain the cumulative hazard
and then exponentiate to obtain the survival function using Equation 7.4.
An example will help fix ideas.
Example: The simplest possible survival distribution is obtained
by assuming a constant risk over time, so the hazard is
for all t. The corresponding survival function is
This distribution is called the exponential distribution with parameter
l.
The density may be obtained multiplying the survivor function by the hazard
to obtain
The mean turns out to be 1/
l.
This distribution plays a central role in survival analysis, although it is
probably too simple to be useful in applications in its own right.
[¯]
7.1.3 Expectation of Life
Let μ denote the mean or expected value of T. By definition,
one would calculate μ multiplying t by the density f(t) and
integrating, so
Integrating by parts, and making use of the fact that -f(t) is
the derivative of S(t), which has limits or boundary
conditions S(0) = 1 and S(
) = 0, one can show that
In words, the mean is simply the integral of the survival function.
7.1.4 A Note on Improper Random Variables*
So far we have assumed implicitly that the event of interest is bound to occur,
so that S() = 0. In words, given enough time the proportion surviving
goes down to zero. This condition implies that the cumulative hazard must
diverge, i.e. we must have L() = .
Intuitively, the event will occur with certainty only if the cumulative risk
over a long period is sufficiently high.
There are, however, many events of possible interest that are not bound to
occur.
Some men and women remain forever single, some birth intervals never close,
and some people are happy enough at their jobs that they never leave.
What can we do in these cases? There are two approaches one can take.
One approach is to note that we can still calculate the hazard and survival
functions, which are well defined even if the event of interest is not
bound to occur.
For example we can study marriage in the entire population, which includes people
who will never marry, and calculate marriage rates and proportions single.
In this example S(t) would represent the proportion still single at age t
and S() would represent the proportion who never marry.
One limitation of this approach is that if the event is not certain to
occur, then the waiting time T could be undefined (or infinite)
and thus not a proper random variable.
Its density, which could be calculated from the hazard and survival,
would be improper, i.e. it would fail to integrate to one.
Obviously, the mean waiting time would not be defined.
In terms of our example, we cannot calculate mean age
at marriage for the entire population, simply because not everyone marries.
But this limitation is of no great consequence if interest centers on the hazard
and survivor functions, rather than the waiting time.
In the marriage example we can even calculate a median age at marriage,
provided we define it as the age by which half the population has married.
The alternative approach is to condition the analysis on the event actually
occurring. In terms of our example, we could study marriage (perhaps
retrospectively) for people who eventually marry, since for this group
the actual waiting time T is always well defined.
In this case we can calculate not just the conditional hazard and survivor
functions, but also the mean. In our marriage example, we could calculate the
mean age at marriage for those who marry. We could even calculate a conventional
median, defined as the age by which half the people who will eventually marry
have done so.
It turns out that the conditional density, hazard and survivor
function for those who experience the event are related to the unconditional
density, hazard and survivor for the entire population. The conditional density
is
and it integrates to one.
The conditional survivor function is
and goes down to zero as t
.
Dividing the density by the survivor function, we find the conditional hazard
to be
|
l*(t) = |
f*(t) S*(t)
|
= |
f(t) S(t)-S()
|
. |
|
Derivation of the mean waiting time for those who experience the event
is left as an exercise for the reader.
Whichever approach is adopted, care must be exercised to specify clearly
which hazard or survival is being used. For example, the conditional hazard
for those who eventually experience the event is always higher than the
unconditional hazard for the entire population. Note also that
in most cases all we observe is whether or not the event has occurred.
If the event has not occurred, we may be unable to determine
whether it will eventually occur.
In this context, only the unconditional hazard may be estimated from data,
but one can always translate the results into conditional expressions,
if so desired, using the results given above.
Continue with 7.2. Censoring and The Likelihood Function