A.1 Maximum Likelihood Estimation
Let Y_{1}, , Y_{n} be n independent random variables
with probability density functions (pdfs)
f_{i}(y_{i};q) depending on a vectorvalued parameter
q.
A.1.1 The Loglikelihood Function
The joint density of n independent observations
y = (y_{1}, , y_{n}) is
f(y;q) = 
n
I = 1

f_{i}(y_{i}; q) = L(q;y). 
 (A.1) 
This expression, viewed as a function of the unknown parameter
q given the data y, is called the
likelihood function.
Often we work with the natural logarithm of the likelihood
function, the socalled loglikelihood function:
logL(q;y) = 
n
I = 1

logf_{i}(y_{i}; q). 
 (A.2) 
A sensible way to estimate the parameter q given the data
y is to maximize the likelihood
(or equivalently the loglikelihood) function,
choosing the parameter value that makes the data actually observed
as likely as possible.
Formally, we define the maximumlikelihood estimator (mle)
as the value [^(q)] such that
logL( 
^ q

;y) logL(q;y) for all q. 
 (A.3) 
Example: The LogLikelihood for the Geometric Distribution.
Consider a series of independent Bernoulli trials
with common probability of success p. The distribution of the
number of failures Y_{i} before the first success has pdf
Pr(Y_{i} = y_{i}) = (1p)^{yi} p. 
 (A.4) 
for y
_{i} = 0, 1,
.
Direct calculation shows that E(Y
_{i}) = (1
p)/
p.
The loglikelihood function based on n observations y
can be written as




n
I = 1

{ y_{i}log(1p)+logp} 
 (A.5)  

 (A.6) 
 

where [
`y] =
y
_{i}/n is the sample mean.
The fact that the loglikelihood depends on the observations
only through the sample mean shows that [
`y] is a
sufficient statistic for the unknown probability
p.
Figure A.1: The Geometric LogLikelihood for n = 20 and [`y] = 3
Figure A.1 shows the loglikelihood function for a sample of
n = 20 observations from a geometric distribution when the observed
sample mean is [`y] = 3. ^{[¯]}
A.1.2 The Score Vector
The first derivative of the loglikelihood function is called
Fisher's score function, and is denoted by
Note that the score is a vector of first partial derivatives,
one for each element of
q.
If the loglikelihood is concave, one can find the maximum
likelihood estimator by setting the score to zero, i.e. by solving
the system of equations:
Example: The Score Function for the Geometric Distribution.
The score function for n observations from a geometric
distribution is
u(p) = 
dlogL dp

= n ( 
1 p

 
1p

). 
 (A.9) 
Setting this equation to zero and solving for
p leads to
the maximum likelihood estimator
Note that the mle of the probability of success is the reciprocal
of the number of trials. This result is intuitively reasonable:
the longer it takes to get a success, the lower our estimate
of the probability of success would be.
Suppose now that in a sample of n = 20 observations we have
obtained a sample mean of [`y] = 3. The mle of the
probability of success would be
[^(p)] = 1/(1+3) = 0.25,
and it should be clear from Figure A.1 that this value maximizes
the loglikelihood. ^{[¯]}
A.1.3 The Information Matrix
The score is a random vector with some interesting statistical
properties. In particular,
the score evaluated at the true parameter value q
has mean zero
and variancecovariance matrix given by the
information matrix:
var[u(q)] = E[u(q)u(q)] = I(q). 
 (A.11) 
Under mild regularity conditions, the information matrix
can also be obtained as minus the expected value
of the second derivatives of the loglikelihood:
I(q) =  E[ 
^{2}logL(q) qq

]. 
 (A.12) 
The matrix of negative observed second derivatives
is sometimes called the observed information matrix.
Note that the second derivative indicates the extent to which
the loglikelihood function is peaked rather than flat.
This makes the interpretation in terms of information
intuitively reasonable.
Example: Information for the Geometric Distribution.
Differentiating the score we find the observed information to be
 
d^{2}logL dp^{2}

=  
du dp

= n ( 
1 p^{2}

+ 
(1p)^{2}

). 
 (A.13) 
To find the expected information we use the fact that
the expected value of the sample mean [
`y] is
the population mean (1
p)/
p, to obtain
(after some simplification)
Note that the information increases with the sample size n
and varies with
p, increasing as
p moves away from
[2/3] towards 0 or 1.
In a sample of size n = 20, if the true value of the parameter
was p = 0.15 the expected information would be I(0.15) = 1045.8.
If the sample mean turned out to be [`y] = 3,
the observed information would be 971.9.
Of course, we don't know the true value of p. Substituting
the mle [^(p)] = 0.25, we estimate the expected
and observed information
as 426.7.^{[¯]}
A.1.4 NewtonRaphson and Fisher Scoring
Calculation of the mle often requires iterative procedures.
Consider expanding the score function evaluated at the
mle [^(q)] around a trial value q_{0} using
a first order Taylor series, so that
u( 
^ q

) u(q_{0}) + 
u(q) q

( 
^ q

q_{0}). 
 (A.15) 
Let
H denote the Hessian
or matrix of second derivatives of the loglikelihood function
H(q) = 
^{2}logL qq

= 
u(q) q

. 
 (A.16) 
Setting the lefthandsize of Equation A.15 to zero
and solving for
[^(q)] gives the firstorder approximation

^ q

= q_{0}  H^{1}(q_{0}) u(q_{0}). 
 (A.17) 
This result provides the basis for an iterative approach for
computing the mle known as the
NewtonRaphson technique.
Given a trial value, we use Equation
A.17
to obtain an improved estimate and repeat the process until
differences between successive estimates are sufficiently
close to zero. (Or until the elements of the vector of first
derivatives are sufficiently close to zero.)
This procedure tends to converge quickly
if the loglikelihood is wellbehaved (close to quadratic)
in a neighborhood of the maximum and if
the starting value is reasonably close to the mle.
An alternative procedure first suggested by Fisher
is to replace minus the Hessian by its expected value,
the information matrix. The resulting procedure
takes as our improved estimate

^ q

= q_{0} + I^{1}(q_{0}) u(q_{0}), 
 (A.18) 
and is known as
Fisher Scoring.
Example: Fisher Scoring in the Geometric Distribution.
In this case setting the score to zero leads to an explicit solution
for the mle and no iteration is needed. It is instructive, however,
to try the procedure anyway. Using the results we have obtained for the
score and information, the Fisher scoring procedure leads to the
updating formula

^ p

= p_{0} + (1p_{0}p_{0} 
_ y

)p_{0}. 
 (A.19) 
If the sample mean is [`y] = 3 and we start from p_{0} = 0.1,
say, the procedure converges to the mle [^(p)] = 0.25 in four
iterations.^{[¯]}
Continue with A.2. Tests of Hypotheses