#### Data and model

We consider a sample of *i* = 1,..., *N* unrelated individuals with failure- or censored-time *t*
_{
i
}. The indicator *d*
_{
i
}is used to indicate whether *t*
_{
i
}is an event-time (*d*
_{
i
}= 1), or a censored-time (*d*
_{
i
}= 0). Let *g*
_{
i
}be a vector of *m* biallelic SNPs measured in individual *i*. With *m* biallelic SNPs there are *j* = 1,.., *Nhap* = 2^{
m
}different haplotypes possible with population frequencies *p*
_{1},..., *p*
_{
j
},... *p*
_{
Nhap
}.

Suppose all haplotypes were observed in all patients, then these could be represented with the vector *x*
_{
i
}of length *Nhap*, where *x*
_{
ij
}equals 0, 1, or 2, depending on the number of haplotypes of type *j* observed in patient *i*. (Notice that ∑_{
j
}
*x*
_{
ij
}= 2, meaning that only *Nhap* - 1 contrasts are identifiable.) The conditional hazard function for failure at *t*
_{
i
}given *x*
_{
i
}can then be specified as

*ln*(*h*(*t*_{
i
}|*x*_{
i
})) = *ln*(*h*_{0}(*t*_{
i
})) + *β'x*_{
i
}, (1)

where

*h*
_{0}(

*t*) is an unspecified baseline hazard function, and

*β* a vector of regression parameters. The survival function

*S*(

*t*
_{
i
}|

*x*
_{
i
}) equals

where *H*
_{0}(*t*) is the unspecified baseline cumulative hazard function.

In our case, haplotypes were not observed in all patients, and therefore we model the survival function conditional on the observed genotypes as a weighted mixture over all

*z*
_{
i
}possible haplotype pairs given genotypes

*g*
_{
i
}:

where *S*
_{
q
}(*t*
_{
i
}|*x*
_{
iq
}) is the survival function specified as in equation (2), and *x*
_{
iq
}is the *q*
^{
th
}haplotype pair that is possible given genotypes *g*
_{
i
}, and *w*
_{
iq
}is the probability that individual *i* has haplotype pair *q* given genotypes *g*
_{
i
}: *w*
_{
iq
}= *Pr*(*X*
_{
i
}= *x*
_{
iq
}|*g*
_{
i
}).

In most circumstances the population haplotype frequencies

*p*
_{1},...,

*p*
_{
Nhap
}are unknown, and must be estimated from the data at hand. But suppose these are known, then under the assumption of Hardy-Weinberg equilibrium the probability that an individual is carrying haplotype pair

*q* = (

*h, r*) is calculated as

*ph* *

*pr* (

*h*,

*r* = 1,...,

*Nhap*). Consequently, when observing genotype vector

*g*
_{
i
}the probability that individual

*i* has haplotype pair

*q* = (

*h*,

*r*) equals:

where the summation takes place over all haplotype pairs that are compatible with genotypes *g*
_{
i
}and *d*
_{
hri
}is an indicator function, which = 1 when the haplotype pair (*h*, *r*) is compatible with *g*
_{
i
}and 0 otherwise.

Given the weights

*w*
_{
iq
}the full likelihood of the data given the model in equation (

3) equals

and the regression parameters

*β*, and the baseline cumulative hazard function

*H*
_{0}(

*t*) may be estimated by maximizing (5). If

*H*
_{0}(

*t*) is a parametric function, maximization of (5) is uncomplicated. If

*H*
_{0}(

*t*) is a nonparametric function, we follow a similar argument as Breslow [

14]. If events occurred at times

*τ*
_{1},...,

*τ*
_{
j
},...,

*τ*
_{
k
}, we shift all censored observations in the interval [

*τ*
_{j-1},

*τ*
_{
j
}) to

*τ*
_{j-1}, and assume (piecewise) constant hazard in the intervals. In that case the logarithm of the likelihood (5) reduces to

where *λ*
_{
j
}is the jump in the cumulative baseline hazard function at time *τ*
_{
j
}, and *I*(*τ*
_{
i
}> *τ*
_{
j
}) is an indicator function.

Estimation equations for

*H*
_{0}(

*t*) and

*β* can be derived by equating to zero the first-order derivatives of the log-likelihood in (6). For

*λ*
_{
j
}we find

and for

*β*
_{ℓ} (ℓ = 1,...,

*Nhap*):

Notice that if *z*
_{
i
}= 1 for all *i*, thus when all haplotypes were observed, then
= 1, and equation (7) reduces to the usual Breslow estimator of *H*
_{0}(*t*), and equation (8) to the usual Cox estimator of *β*.

Unfortunately, the weights
depend on both *H*
_{0}(*t*) and *β*.

#### Estimation algorithm

Since in practice the population haplotype frequencies

*p* must be estimated together with

*β* and

*H*
_{0}(

*t*), and because

depends on

*H*
_{0}(

*t*) and

*β*, direct maximization of (5) or (6) is complicated. Instead we propose to use an EM algorithm. For that we consider the covariate vector

*x*
_{
i
}as missing, and we maximize the expectation of the joint log-likelihood of

*L*((

*t*
_{
i
},

*d*
_{
i
}),

*x*
_{
i
}|

*g*
_{
i
}) over the posterior probabilities of

*x*
_{
i
}given the observed genotypes

*g*
_{
i
}, and given the observed failure/censoring times (

*t*
_{
i
},

*d*
_{
i
}) and given current estimates of the parameters

*β*,

*H*
_{0}(

*t*), and

*w*
_{
iq
}:

where

is the posterior probability that

*X*
_{
i
}=

*x*
_{
iq
}given ((

*t*
_{
i
},

*d*
_{
i
}),

*g*
_{
i
},

*β*
^{(a)},

*H*
_{0}(

*t*)

^{(a)},

:

The EM algorithm consists then of iterating two steps. In the M-step of iteration

*a* + 1, (

*β*,

*H*
_{0}(

*t*),

*w*
_{
iq
}=

*p*
_{
j
}
*p*
_{
l
}
*/Σp*
_{
r
}
*p*
_{
s
}) are estimated by maximizing (10) given

evaluated using (

*β*
^{(a)},

*H*
_{0}(

*t*)

^{(a)},

), and in the E-step,

is re-estimated given using (11) with (

*β*
^{(a+1)},

*H*
_{0}(

*t*)

^{(a+1)},

). Estimation equations of

*β* and

*H*
_{0}(

*t*) are the same as in (8) and (7), but

replaced with

, and these are solved iteratively. The haplotype relative frequencies

*p*
_{
j
}are estimated as

where

*I*(

*j, q*) is an indicator-function denoting whether haplotype

*j* is part of haplotype-combination

*q*. Standard errors of

*β* can be derived from the information-matrix of the log-likelihood in equation (

6):

Since we are mainly interested in the uncertainty of *β*, we only used that part of the hessian that pertains to *β*. Notice that the first term of (14) equals the hessian-matrix that is used in the M-step of the EM algorithm.

#### Penalized log-likelihood

Mutations in general tend to be rare, and so are the haplotypes in which they are encompassed. Furthermore, when ten loci are considered there are 2^{10} = 1024 different haplotypes possible, many of which will have low frequency in samples up to thousands of individuals. If haplotypes have low frequency their associated hazard ratio estimate will be unstable. We used a penalized log-likelihood method to obtain more stable parameter estimates. Basically, we optimized the penalized log-likelihood ℓ^{
p
}, defined as
, where *Pen*(*β*) is the penalty function.

As penalty functions we considered the well-known ridge-penalty function
, and a difference-penalty function (*Pen*(*β*) = ∑_{
a
}∑_{
b
}
*a*
_{
ab
}(*β*
_{
a
}- *β*
_{
b
})^{2} (*a* > *b*)), where *a*
_{
ab
}is a fixed and known value representing the similarity of haplotypes *a* and *b*. We quantified the similarity between haplotypes (*a*
_{
ab
}) by counting the number of shared alleles which – with *m* loci – varies between zero and *m* - 1.

The penalty parameter

*λ* was found by optimizing the cross-validated log-likelihood (CVL) as described by Verweij and van Houwelingen [

15]:

*CVL*(

*λ*) =

*lnL*
^{
Breslow
}(

*β*
^{
λ
}) -

*c*(

*λ*), where

*lnL*
^{
Breslow
}(

*β*
^{
λ
}) is the log-likelihood evaluated with the penalized log-likelihood estimate

*β*
^{
λ
}. The factor

*c*(

*λ*) is an approximation of the effective dimension of the model, which in our case depends on the log hazard ratios

*β*, the baseline hazard function

*H*
_{0}(

*t*), and the haplotype frequencies

*p*. For convenience sake, we approximated the effective dimension in the same manner as Verweij and van Houwelingen [

15] as

where *U*
_{
i
}(*β*
^{
λ
}) is the contribution of individual *i* to the first-order derivative of the unpenalized log-likelihood, and *H*
^{
λ
}(*β*
^{
λ
}) is the matrix of second-order derivatives of the penalized log-likelihood evaluated at *β*
^{
λ
}, *H*
_{0}(*t*), and *p*. Notice that the last term of (15) is equal to the third term of (14).

Although the penalized likelihood estimates of *β* are somewhat biased, and it is therefore somewhat unclear how to interpret standard errors, we nevertheless assessed the stability of the penalized estimates by a parametric bootstrap procedure. We took 200 bootstrap samples, estimated *λ* in each sample by optimizing CVL, and derived standard errors from the distribution of the associated penalized estimates of *β*, *p*, and *H*
_{0}(*t*). The number of bootstrap samples used was based on results from simulations, in which the number of bootstrap samples varied from 10 to 1000. These data showed that SE estimates were relatively stable after 100 to 200 bootstrap samples.

The EM algorithm presented in this paper was programmed in MATLAB^{®} R 7.0 (The MathWorks, Natick, MA, USA) as well as in a set of R-functions and is freely available upon request from the corresponding author.