### Liability model and viability selection

Let us define a continuous variable

*y*
_{
j
} as the liability for individual

*j*,

where

*ε*
_{
j
} ~

*N*(0,1) is a residual error with a standardized normal distribution. Other model effects are defined as follows. There may be some effects not related to genetics, such as age, location and other systematic effects, and these effects are captured by

*β* and the design matrix

*X*. There are

*p* genetic loci each with an effect

*γ*
_{
k
} for

*k* = 1, ...,

*p*. The value of

*Z*
_{
jk
} is determined by the genotype of individual

*j* at locus

*k*. For example, an F

_{2} individual derived from the cross of two inbred lines can take one of three genotypes,

*A*
_{1}
*A*
_{1},

*A*
_{1}
*A*
_{2} and

*A*
_{2}
*A*
_{2}. Under the additive genetic model,

*Z*
_{
jk
} is defined as

and

*γ*
_{
k
} =

*a*
_{
k
} is the additive genetic effect for locus

*k*. Under the dominance effect model, the genetic effect for locus

*k* is a 2 × 1 vector

*γ*
_{
k
} = [

*a*
_{
k
}
*d*
_{
k
}]

^{
T
}, where

*d*
_{
k
} is called the dominance effect. The corresponding

*Z* variable is also a vector and defined as

where

*H*
_{
i
} is the

*i*-th row of matrix

*H*, as shown below,

The liability

*y*
_{
j
} is not observed but it determines the viability of individual

*j*. It is assumed that individual

*j* will survive if

*y*
_{
j
} > 0 and die otherwise. Since we can only observe the surviving individuals, all individuals in the sample have liabilities greater than zero. This will cause the selected population to deviate from the expected Mendelian segregation ratio for loci responsible for viability selection and all loci linked to the viability loci. Although all individuals have survived, some may have a high liability and some may have a low liability, but all have a liability greater than zero. We now use the concept of penetrance to describe the survivability of an individual. Let

be the expectation of the unobserved liability (a linear predictor). We use the normal or the logistic function to model the probability of survival for individual

*j*, i.e., Φ(

*η*
_{
j
}) or logistic(

*η*
_{
j
}) = exp(

*η*
_{
j
})/[1 + exp(

*η*
_{
j
})]. Conditional on the genotypes of all other loci, the penetrances for the three genotypes of locus

*k* are defined as

is the linear predictor excluding locus

*k*. This model was first introduced by Luo et al. (2005) for single locus analysis, which does not include

*η*
_{
j(-k)
} in equation (6). The data that allow us to estimate

*γ*
_{
k
} is the genotype array for all individuals at locus

*k*. Define

as a multivariate Bernoulli variable with three categories (i.e., a multinomial variable with sample size one). If individual

*j* has a genotype

*A*
_{1}
*A*
_{1}, then

*w*
_{j(11) }= 1 and

*w*
_{j(12) }=

*w*
_{j(22) }= 0. The probabilities of individual

*j* taking the three genotypes are derived from the Bayes' theorem,

is the mean of the three penetrances and

is the expected Mendelian ratio. In an F_{2} population, the expected Mendelian ratio is
. Note that if *γ*
_{
k
} = 0, vector *π*
_{
j
} = [*π*
_{j(11) }
*π*
_{j(12) }
*π*
_{j(22)}] will be equivalent to the expected Mendelian ratio for every individual at the locus.

If there is no factor to be considered other than the markers, the term

*X*
_{
j
}
*β* should disappear here. This is different from the usual linear regression analysis where an intercept should always appear in the model. With the liability selection model, there is no intercept. We now assume only one co-factor to consider. The

*X*
_{
j
} variable can be discrete or continuous, but the distribution in the unselected population must be known. In this study, we first assume that

*X*
_{
j
} is discrete, say gender, a variable indicating the gender of individual

*j* with

*X*
_{
j
} = 1 representing male and

*X*
_{
j
} = -1 representing female. In the unselected population, the sex ratio should be 1:1. If the population evaluated has a biased sex ratio, this means that the gender has an effect on the liability. We can estimate the gender effect

*β* on the liability. Let

be the expected sex ratio (prior to the selection). Define

*ξ*
_{j(1) }or

*ξ*
_{j(2) }as the posterior probability that individual

*j* is male or female, respectively. These posteriors are calculated using

is the mean penetrance of the two genders and

is the linear predictor excluding the gender effect.

We now assume that

*X*
_{
j
} is a continuous non-genetic effect, e.g., age. Let us assume that

*X*
_{
j
} follows a normal distribution in the unselected population, i.e.,

*p*(

*X*
_{
j
}) =

*N*(

*X*
_{
j
}|

*μ*,

*σ*
^{2}), where

*μ* and

*σ*
^{2} are known. Let

*β* be the effect of

*X*
_{
j
} on the liability. Define Φ(

*X*
_{
j
}
*β* +

*η*
_{j(-β)}) as the probability that individual

*j* has survived the selection. The posterior probability is defined as

Proof of this equation (16) is straightforward and thus given in the next paragraph.

Let

*f*(

*X*
_{
j
}) =

*N*(

*X*
_{
j
}|

*μ*,

*σ*
^{2}) be the normal density for variable

*X*
_{
j
} with known

*μ* and

*σ*
^{2}. The following Lemma [

27] is used to derive equation (

16).

Let us rewrite equation (

16) as

Comparing equation (

18) with equation (

17), we can see that

*ξ* = -

*η*
_{j(-β)}/

*β* and

*λ*
^{2} = 1/

*β*
^{2}. Substituting these into equation (

17), we get

This concludes the derivation of equation (16) presented in the previous paragraph.

### Likelihood, prior and posterior

It is difficult (if not impossible) to construct the joint likelihood for all loci, but conditional on the effects and the genotypes of other loci, the likelihood for locus

*k* can be derived based on the multivariate Bernoulli distribution, that is

The exact notation for this log likelihood should be

*L*(

*γ*
_{
k
}|

*η*
_{(-k)}) because it is conditioned on the gender effect and effects of other loci. We use the simplified notation to improve the readability. Let us assign a normal prior to

*γ*
_{
k
}, i.e.,

Furthermore, we assign a hierarchical prior to ∑

_{
k
},

where

*τ* is the prior degree of freedom and

*ω* is the prior scale matrix with the same dimension as ∑

_{
k
}. The reason for assigning these prior distributions is to handle a possible large number of loci involved in the model. Uniform prior for the gender effect is assumed. The log posterior (denoted by LogPost) is

where a constant has been ignored.

For the sex effect (discrete co-factor), the likelihood for

*β* conditional on

*η*
_{j(-β) }is

For the continuous co-factor, the log likelihood for parameter

*β* can be written as

Prior distribution for the non-genetic effect is assumed to be uniform (uninformative prior) and thus only the likelihood is needed to find the posterior mode estimate of *β*.

### Posterior mode estimation

Due to the possible large number of parameters, we take a sequential approach to estimating the posterior mode parameters with one locus at a time. This approach is also called the coordinate descent algorithm. Once the parameters of all loci are updated, the sequence is repeated until a certain criterion of convergence is reached.

Let us define the first step of the Newton-Raphson iteration as

and denote the variance of this updated parameter by

where the first and second partial derivatives are evaluated at

. The posterior mean and posterior variance matrix for

*γ*
_{
k
} at iteration

*t* are denoted by

and var(

*γ*
_{
k
}) =

*V*
_{
k
}, respectively. Since the posterior distribution of

*γ*
_{
k
} is approximately multivariate normal (asymptotical theory), the posterior mean is identical to the posterior mode. The posterior of ∑

_{
k
} remains scaled inverse Wishart due to the conjugate property of the prior. Therefore, the posterior mode of ∑

_{
k
} is

where τ + 1 is the degree of freedom for the inverse Wishart posterior and the number 2 represents the dimension of vector *γ*
_{
k
}.

The posterior mode estimation of

*β* conditional on the effects of all loci is

with an estimation error variance approximated by

The iteration process of the posterior mode estimation is summarized as follows.

Step 0: Initialize all parameters.

Step 1: Update the non-genetic effect using equation (29).

Step 2: Update effect of marker *k* for *k* = 1, ⋯, *p* using equation (26).

Step 3: Update ∑_{
k
} for *k* = 1, ⋯, *p* using equation (28).

Step 4: Repeat step 1 to step 3 until the iteration process converges.

### Genetic contribution from an individual locus

An obvious advantage of the liability model is that we are able to calculate the proportion of the liability variance contributed by each SDL, similar to the proportion of quantitative trait variance contributed by each QTL. Suppose that we have detected one SDL with both additive and dominance effects. The theoretical variances of the

*Z* variables in an F

_{2} population are 0.5 for the additive part and 1.0 for the dominance part. The reason is that the three genotypes are coded as +1, 0 and -1 for the additive

*Z* and -1, 1 and -1 for the dominance

*Z* [

28]. Let

*a*
_{
k
} and

*d*
_{
k
} be the additive and dominance effects of this SDL. The genetic variance explained by this locus is

The residual variance of the liability is set at unity and thus the variance of the liability is

The broad sense heritability is defined as

This is the proportion of the liability variance contributed by the

*k*th SDL. Assuming that the multiple SDL are not closely linked, the overall contribution from all SDL is approximated by

The liability model has unified QTL mapping and SDL mapping in the same framework of quantitative genetics.