On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Wang, Tao

doi:10.1186/1471-2156-12-82

Methodology article
Open access
Published: 21 September 2011

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Tao Wang¹

BMC Genetics volume 12, Article number: 82 (2011) Cite this article

6147 Accesses
8 Citations
Metrics details

Abstract

Background

In genetic association study of quantitative traits using F_∞ models, how to code the marker genotypes and interpret the model parameters appropriately is important for constructing hypothesis tests and making statistical inferences. Currently, the coding of marker genotypes in building F_∞ models has mainly focused on the biallelic case. A thorough work on the coding of marker genotypes and interpretation of model parameters for F_∞ models is needed especially for genetic markers with multiple alleles.

Results

In this study, we will formulate F_∞ genetic models under various regression model frameworks and introduce three genotype coding schemes for genetic markers with multiple alleles. Starting from an allele-based modeling strategy, we first describe a regression framework to model the expected genotypic values at given markers. Then, as extension from the biallelic case, we introduce three coding schemes for constructing fully parameterized one-locus F_∞ models and discuss the relationships between the model parameters and the expected genotypic values. Next, under a simplified modeling framework for the expected genotypic values, we consider several reduced one-locus F_∞ models from the three coding schemes on the estimability and interpretation of their model parameters. Finally, we explore some extensions of the one-locus F_∞ models to two loci. Several fully parameterized as well as reduced two-locus F_∞ models are addressed.

Conclusions

The genotype coding schemes provide different ways to construct F_∞ models for association testing of multi-allele genetic markers with quantitative traits. Which coding scheme should be applied depends on how convenient it can provide the statistical inferences on the parameters of our research interests. Based on these F_∞ models, the standard regression model fitting tools can be used to estimate and test for various genetic effects through statistical contrasts with the adjustment for environmental factors.

Background

Genetic markers with multiple alleles are common phenomena in genetic studies. It is well known that the ABO blood types in human are determined by three alleles at a genetic locus on chromosome 9. Molecular markers such as microsatellites often have multiple alleles. The major histocompatibility complex (MHC), a highly polymorphic genome region that resides on the human chromosome 6, encompasses multiple genes that encode for many human leukocyte antigens (HLA) and play an important role in regulation of the immune responses. Depending on the resolution level of allele typing, each of the HLA-A, B, C, DR, DQ and DP gene loci could contain tens to hundreds of allele types. In addition, in the haplotype analysis of single-nucleotide polymorphisms (SNPs), various haplotypes from a set of SNPs can also be treated as different alleles from a 'super' marker locus that consists of the set of SNPs.

Presently, there are mainly three types of genetic models that are commonly used in the genetic analysis of quantitative traits. One is Fisher's analysis of variance (ANOVA) models that focus on a decomposition of the genotypic variance into genetic variance components contributed by various genetic effects at quantitative trait loci (QTL) [1–6]. Another is the F_∞ models that concentrate on direct statistical modeling of the expected genotypic values at target genetic markers or QTL and the association testing of various genetic effects. The other one is the so-called functional genetic models that emphasize on modeling the functional effects of genes [7]. Both Fisher's and F_∞ models can be referred to as statistical models, while the functional genetic models have fundamentally different objectives and estimation methods from the statistical models. A considerable amount of discussion has been made about the distinction between these different types of genetic models [8–11].

The F_∞ models have been widely used in genetic association studies of quantitative traits. In building F_∞ models, how to code genotypes at a marker (or QTL) and interpret the model parameters are fundamental issues for constructing appropriate testing hypotheses and making correct statistical inferences. While the Fisher's ANOVA models can be directly applicable to genetic markers with multiple alleles, the F_∞ models by contrast have been mainly discussed in the biallelic case [1, 9, 12]. For haplotype analysis, Zaykin et al. in [13] proposed a simple coding which included only the additive effects of haplotypes but ignored their interactions. More recently, Yang et al. in [11] explored an extension of the biallelic F_∞ models to multi-allele models with a focus on the definition of various genetic effects and their relationships with the average genetic effects defined in the Fisher's models. A thorough work on coding of marker genotypes and interpretation of model parameters for F_∞ models has not been done in the past especially for genetic markers with multiple alleles.

In general, there are two different strategies in coding the marker or QTL genotypes. One is to treat each marker or QTL as a potential risk factor with its genotypes as the risk units. Then, similar to the strategy in handling categorical covariates in classical regression models, at each locus we can create one dummy variable per genotype and then include all but one (as the reference) of these dummy variables into a model. But this genotype coding is often limited by the available sample sizes especially when the number of alleles at the marker locus is large. Alternatively, as alleles are often supposed to be the basic genetic risk units that may contribute to disease phenotypes in genetic studies, we may want to treat alleles at each marker or QTL as the risk units and examine the effects of alleles. However, genetic data has some specialty that needs to be taken into account in order to build the allele-based models. In the genome of diploid species such as human being, alleles normally appear in pairs to form a genotype at each marker locus or QTL with one from the father and one from the mother, except for the sex chromosomes in males. That is, at each locus we have two within-locus risk factors that reside on a homologous pair of chromosomes. Unlike the classical two-way ANOVA model in which the two risk factors own different risk units, the paternal and maternal risk factors at a locus often share the same set of alleles. Besides, the parental origins (i.e., the phase) of the two alleles at each locus are quite often unknown. These features could sometimes complicate the allele-based coding of marker genotypes and generate confusion in interpretation of the model parameters.

In this study, we introduce three allele-based coding schemes for building F_∞ models, namely allele, F_∞ and allele-count codings. First, we formulate F_∞ models under a general regression framework to model the expected genotypic values at given markers or QTL. Then, under a standard ANOVA model setting, we present several fully parameterized one-locus models using the three allele-based coding schemes. Some potential collinearity relationships among the coding variables of the marker genotypes are clarified. Strategies to avoid the redundant model parameters are also proposed. After that, we examine the definition of model parameters under a reduced one-locus model framework. The impact of a linear relationship among the coding variables of marker genotypes on the estimability of the model parameters is fully explored based on the linear model theory. Finally, we consider extension of the one-locus models to two-locus situation. Several fully parameterized as well as reduced two-locus models are addressed. A focus of this study is to establish the relationships between the model parameters and the expected genotypic values at given marker loci or QTL for various F_∞ models from these three coding schemes under various different model frameworks, and explain how to estimate and test for various genetic effects through statistical contrasts. Relationships among different coding schemes and models are also illustrated through simulation.

Results

Fully parameterized one-locus models

In genetic studies, a quantitative trait Y is typically considered as a combination of a genetic component G and an environmental component E with perhaps the genetic by environmental interactions G × E, where G is the true genotypic value from a joint (unobservable) contribution of all the genetic factors to the quantitative trait Y. In practice, given a random sample of N individuals from a study population, let g_i be the observed genotypes at certain target marker loci or QTL and z_i be a vector of some environmental covariates that may contribute to the variation of the quantitative trait for individuals i = 1, ..., N. By ignoring the genetic by environmental interactions and assuming that the genotypic value G and environmental component E do not depend on the environmental covariates z_i and g_i , respectively, then the observed quantitative trait y_i of an individual i can be expressed through a regression model as

y_{i} = G (g_{i}) + z_{i} β + e_{i}, i = 1, \dots, N

(1)

where G(g_i ) = E(G|g_i ) is the expected genotypic value of G given the marker (or QTL) genotypes g_i , β denotes the effects of the environmental covariates, and e_i is the residual error of the model with E(e_i ) = 0. Similar to introducing dummy variables for the covariates z_i which allow us to assess various environmental effects β in the model, it is convenient to further represent G(g_i ) as G(g_i ) = x(g_i )α so that we can fit the regression model and assess the genetic effects α of the markers or QTL, where x(g_i ) is a coding function of the marker genotypes. When the marker locus is not associated with the phenotype, then G(g_i ) = E(G) is a constant which does not depend on g_i . In the rest of the paper, we will focus on the interpretation of the marker effects α in terms of the expected genotypic values G(g) = E(G|g) according to different coding schemes. When certain genetic by environmental interactions are included in the model, the interpretation of α could be modified accordingly. It has to be pointed out that QTL are generally assumed to be unknown genomic regions that may contribute to the variation of the quantitative traits with their genotypes unobserved. But the results (i.e., the coding schemes and the relationships between the model parameters and the expected genotypic values) are held for QTL as well, although the expected genotypic values at a target QTL can no longer be directly estimated via fitting the regression models.

Now, consider one target marker locus with multiple alleles A₁, ..., A_m , m ≥ 2. In general, there are m possible homozygous genotypes A_jA_j , j = 1 ..., m, and m(m - 1)/2 possible heterozygous genotypes A_jA_k , j ≠ k. Let G_jk = E(G|g = A_jA_k ) be the expected genotypic values, given the marker genotypes A_jA_k in a study population. Without knowing the parental origins of the alleles, we assume as usual that the parental origin of the alleles does not make a difference (i.e., no imprinting). We have then G_jk = G_kj for j, k = 1, ..., m, and there are totally m(m + 1)/2 possible distinctive expected genotypic values G_jk , j, k = 1, ..., m, which could be estimated through the means in the genotypic subgroups after adjustment for the environmental covariates. Here we assume no missing genotypes for the sampled individuals, and the random sample has its individuals carrying all possible genotypes. How to handle missing genotypes will be discussed in the discussion. To fully re-parameterize these expected genotypic values through a linear model, we then need totally m(m + 1)/2 parameters including the intercept in the model. By treating the paternal and maternal alleles as two independent risk factors and following the classical two-way ANOVA notation, we can represent the genotypic values G_jk as

G_{j k} = μ^{*} + α_{j}^{*} + α_{k}^{*} + δ_{j k}^{*}, j, k = 1, \dots, m

(2)

where $α_{j}^{*}$ and $δ_{j k}^{*}$ are the realized (but unobservable) additive effects of allele A_j and the allelic interaction between the two alleles A_j and A_k , respectively. The above model is different from the classical two-way ANOVA model in that here both the paternal and the maternal risk factors share the same set of alleles A₁, ..., A_m . As usual, with the unknown paternal origins of alleles at the locus, we assume the paternal and maternal alleles have the same genetic effect. More precisely, the paternal allele A_j and maternal allele A_j have the same additive allelic effects $α_{j}^{*}$ for j = 1, ..., m. Besides, the allelic interaction between a paternal allele A_j and a maternal allele A_k is the same as that between the paternal allele A_k and the maternal allele A_j ; i.e., $δ_{j k}^{*} = δ_{k j}^{*}$ , for j, k = 1, ..., m. Still, with m additive allelic effects and m(m + 1)/2 allelic interactions plus the intercept, it is clear that model (2) is over-parameterized on modeling the m(m + 1)/2 expected genotypic values G_jk for j, k = 1, ..., m. As a result, the parameters μ*, $α_{j}^{*}$ and $δ_{j k}^{*}$ in model (2) are not all estimable in terms of the expected genotypic values G_jk (see [14, 15]).

In order to avoid the inestimability issue, one way is to add constraints on the model parameters. However, those constraints, together with the symmetry property of $δ_{j k}^{*}$ , could make it difficult to fit the model using the standard software package such as SAS. Alternatively, we consider dropping certain redundant parameters in the model. Similar to the biallelic case [10], let us first introduce the following indicator variables to describe the transmission of alleles from parents to their offspring

z_{1 j} = \{\begin{matrix} 1, inherited A_{j} on paternal gamete, \\ 0, inherited other alleles on paternal gamete \end{matrix}

and

z_{2 j} = \{\begin{matrix} 1, inherited A_{j} on maternal gamete, \\ 0, inherited other alleles on maternal gamete \end{matrix}

for each allele type A_j , j = 1, ..., m. Then we define the following coding variables of the marker genotypes

\begin{gathered} w_{j} (g) = z_{1 j} + z_{2 j} = \{\begin{matrix} 2, if g = A_{j} A_{j} \\ 1, if g = A_{j} A_{j}^{c} \\ 0, if g = A_{j}^{c} A_{j}^{c} \end{matrix} \\ v_{j k} (g) = z_{1 j} z_{2 k} = \{\begin{matrix} 1, if g = A_{j} A_{k} \\ 0, otherwise \end{matrix} \end{gathered}

for j, k = 1, ..., m, where $A_{j}^{c}$ denotes any other allele type except A_j . Note that z_1j, z_2jare not observable because we do not know exactly which allele is inherited from paternal or maternal gamete for the sampled individuals without their parental information. But this unknown phase problem does not affect the definitions of w_j , v_jk since w_j only counts the number of allele A_j in the genotypes and the value of v_jk is 1 when the genotype is A_jA_k and 0 otherwise regardless of where the two alleles come from. We refer to the above coding of marker genotypes as an allele coding scheme. Model (2) can then be re-written in a linear model form as

G (g_{i}) = μ^{*} + \sum_{j = 1}^{m} α_{j}^{*} w_{j} (g_{i}) + \sum_{j = 1}^{m} \sum_{k = j}^{m} δ_{j k}^{*} v_{j k} (g_{i})

(3)

for i = 1, ..., N. As each individual always carries two alleles at a marker locus with one from the father and the other from the mother, we have $\sum_{j = 1}^{m} z_{1 j} (g_{i}) = \sum_{k = 1}^{m} z_{2 k} (g_{i}) = 1$ , for any i = 1, ..., N. Therefore, given a particular j, $w_{j k} = 2 - \sum_{k \neq j} w_{k}$ , which is a linear combination of the rest of {w_k , k ≠ j}. For v_jk , we also have $\sum_{j = 1}^{m} v_{j k} = z_{2 k}$ , or $v_{j k} = w_{k} / 2 - \sum_{l \neq j} v_{l k}$ . Hence, each of the v_jk , k = 1, ..., m, is also a linear combination of the coding variables {w_k , k ≠ j} and {v_lk , l, k ≠ j}. To avoid the redundancy of parameters due to these collinearity relationships among the coding variables in model (3), without losing generality, we consider dropping w_m and {v_km , k = 1, ..., m} in (3). Then

G (g_{i}) = μ + \sum_{j = 1}^{m - 1} α_{j} w_{j} (g_{i}) + \sum_{j = 1}^{m - 1} \sum_{k = j}^{m - 1} δ_{j k} v_{j k} (g_{i})

(4)

for i = 1, ..., N. Model (4) now provides a full re-parameterization of the m(m + 1)/2 expected genotypic values G_jk for j, k = 1, ..., m with its parameters α_j can be referred to as the additive allelic effects and δ_jk the allelic interactions with respect to the reference allele A_m . Given a random sample, we can then incorporate model (4) into (1) and fit the regression model (1) using the standard least-square approach. In terms of the expected genotypic values, it is easy to show that μ = G_mm , α_j = G_jm - G_mm and δ_jk = (G_jk - G_km ) - (G_jm - G_mm ), for j = 1, ..., m - 1 and k = j, ..., m - 1. Therefore, the additive allelic effect α_j can be interpreted as the substitution effect of replacing allele A_m by A_j when paired with another allele A_m to form the genotypes. Meanwhile, the allelic interaction δ_jk is the difference between the substitution effect of replacing allele A_m by A_j (or A_k ) when paired with allele A_k (or A_j ) and that when paired with allele A_m . Or, in other words, δ_jk is the difference between the substitution effects of replacing allele A_m by A_j (or A_k ) with paired alleles A_k (or A_j ) and A_m . Note that dropping w_j and {v_kj , k = 1, ..., m} for a particular j ≠ m instead of w_m and {v_km , k = 1, ..., m} can lead to similar interpretations of the model parameters with A_j being the reference allele. Using model (4), we can also estimate and test for various other genetic effects. For example, the so-called functional 'additive effects' $a_{j k}^{*} = (G_{j j} - G_{k k}) ∕ 2$ and the 'dominance effects' $d_{j k}^{*} = G_{j k} - (G_{j j} + G_{k k}) ∕ 2$ , j ≠ k defined in [11] can be expressed as $a_{j k}^{*} = (α_{j} - α_{k}) + (δ_{j j} - δ_{k k}) ∕ 2$ and $d_{j k}^{*} = δ_{j k} - (δ_{j j} + δ_{k k}) ∕ 2 - 2 μ$ , j ≠ k, respectively, in terms of the above model parameters. So we can estimate $a_{j k}^{*}$ , $d_{j k}^{*}$ using the fitted model parameters or test for the hypothesis of $H_{0} : a_{j k}^{*} = 0$ or $H_{0} : d_{j k}^{*} = 0$ through the general linear contrasts [15] using the standard software such as SAS. To test whether a particular allele A_j has an overall effect, the null hypothesis is H₀ : α_j = δ_jk = 0 for k = 1, ⋯, m - 1, which can be performed through either a general linear contrast (or likelihood ratio test) with the degrees of freedom being m for the test statistic. The association test for overall effects of the locus corresponds to the null hypothesis of H₀ : α_j = δ_jk = 0 for any j, k = 1, ⋯, m - 1, which has its degrees of freedom being m(m + 1)/2 - 1 for the test statistic. Currently, the so-called F_∞ model has been widely used in genetic association studies. In the simple biallelic case with two alleles A and α, an F_∞ model gives [16–19].

G_{A A} = τ + a, G_{A a} = τ + d, G_{a a} = τ - a

where G_AA = E(G|AA), G_Aa = E(G|Aa) and G_aa = E(G|aa) are the three possible expected genotypic values at the marker. The parameters a, d are often referred to as the additive and dominance effects of the allele A over a, and in terms of the expected genotypic values we have a = (G_AA - G_aa )/2 and d = G_Aa - (G_AA + G_aa )/2. This F_∞ model can also be written in a linear model form as [10]

G (g_{i}) = τ + a f (g_{i}) + d h (g_{i}), i = 1, \dots, N

where f, h are two coding variables of the marker genotypes that are defined as

\begin{gathered} f (g) = \{\begin{matrix} 1, & if g = A A \\ 0, & if g = A a \\ - 1, & if g = a a \end{matrix} \\ h (g) = \{\begin{matrix} 1, & if g = A a \\ 0, & otherwise \end{matrix} \end{gathered}

We refer to the above coding of the marker genotypes as the F_∞ coding. As a straightforward extension of the F_∞ coding scheme to multiple alleles, we can define the following coding variables

\begin{gathered} f_{j} (g) = \{\begin{matrix} 1, & i f g = A_{j} A_{j} \\ 0, & if g = A_{j} A_{j}^{c} \\ - 1, & if g = A_{j}^{c} A_{j}^{c} \end{matrix} \\ h_{j} (g) = \{\begin{matrix} 1, & if g = A_{j} A_{j}^{c} \\ 0, & otherwise \end{matrix} \end{gathered}

for each j = 1, ..., m. It is easy to see that f_j , h_j and the previous w_j , v_jk , j, k = 1, ..., m have the relationships: f_j (g) = w_j (g) - 1, h_j (g) = w_j (g) - 2v_jj (g), and v_jk (g) = h_j (g)h_k (g) as j ≠ k. Thus, for the same reason to avoid collinearity, we can exclude some redundant coding variables and write a fully parameterized one-locus model using the F_∞ coding as

\begin{gathered} G (g_{i}) = τ + \sum_{j = 1}^{m - 1} a_{j} f_{j} (g_{i}) + \sum_{j = 1}^{m - 1} d_{j j} h_{j} (g_{i}) \\ + \sum_{j = 1}^{m - 1} \sum_{k = j + 1}^{m - 1} d_{j k} h_{j} (g_{i}) h_{k} (g_{i}) \end{gathered}

(5)

for i = 1, ..., N. By having model (5) equivalent to (4), we can first build the relationships between the two model parameters and then establish the relationships between the parameters of model (5) and the expected genotypic values as following

\{\begin{matrix} τ = μ + \sum_{j = 1}^{m} (α_{j} + \frac{δ_{j j}}{2}) \\ = G_{m m} + \frac{1}{2} \sum_{j = 1}^{m - 1} (G_{j j} - G_{m m}) \\ a_{j} = α_{j} + \frac{δ_{j j}}{2} = \frac{G_{j j} - G_{m m}}{2}, j = 1, \dots, m - 1 \\ d_{j j} = - \frac{δ_{j j}}{2} = G_{j m} - \frac{G_{j j} + G_{m m}}{2}, j = 1, \dots, m - 1 \\ d_{j k} = δ_{j k} = (G_{j k} - G_{j m}) - (G_{k m} - G_{m m}), j \neq k \end{matrix}

Therefore, a_j can be interpreted as a half of the difference between the two expected homozygous genotypic values G_jj and G_mm , which is the same as the additive effect $a_{j m}^{*}$ defined in [11]. Besides, d_jj is the difference between the expected heterozygous genotypic value G_jm and the averaged expected homozygous genotypic value (G_jj + G_mm )/2, which is the same as the dominance effect $d_{j m}^{*}$ defined in [11]. It is interesting to see that d_jk , j ≠ k, has the same interpretation as δ_jk in model (4), which is the difference between the substitution effects of replacing allele A_m by A_j when paired with alleles A_k and A_m . Note that d_jj can also be interpreted as the allelic interaction - the difference between the substitution effects of replacing allele A_j by A_m when paired with another A_j and A_m . In addition, based on model (5), the additive effects $a_{j k}^{*}$ and the dominance effects $d_{j k}^{*}$ proposed in [11] have the relationship with the model parameters: $a_{j k}^{*} = a_{j} - a_{k}$ , $d_{j k}^{*} = d_{j k} + (d_{j j} + d_{k k})$ , j ≠ k. The overall effect of a particular allele A_j can be tested through the composite hypothesis of H₀ : a_j = d_jk = 0 for k = 1, ⋯, m - 1, and the overall effects of the locus can be tested via the null hypothesis of H₀ : a_j = d_jk = 0 for any j, k = 1, ⋯, m - 1.

In addition to the allele and F_∞ codings, another way of coding the marker genotypes which occasionally appears in practice is to count the number of alleles in marker genotypes for each specific allele A_j . As each individual can have 0, 1 or 2 copies of an allele A_j , by taking the genotypic group with 0 copy of allele A_j as the baseline, we can introduce the following two indicator (or dummy) variables for the genotypic groups with 1 and 2 copies of the allele A_j , respectively.

\begin{gathered} h_{1 j} (g) = \{\begin{matrix} 1, if g = A_{j} A_{j}^{c} \\ 0, otherwise \end{matrix} \\ h_{2 j} (g) = \{\begin{matrix} 1, if g = A_{j} A_{j} \\ 0, otherwise \end{matrix} \end{gathered}

for each j = 1, ..., m - 1. These coding variables of marker genotypes have relationships h_1j(g) = h_j (g) = w_j (g) - 2v_jj (g) and h_2j(g) = v_jj (g) with previous ones. We refer to this coding of marker genotypes as the allele-count coding. Similar to models (4) and (5), by excluding some redundant coding variables, the allele-count coding leads to another fully parameterized one-locus model as

\begin{gathered} G (g_{i}) = π_{0} + \sum_{j = 1}^{m - 1} π_{j} h_{1 j} (g_{i}) + \sum_{j = 1}^{m - 1} η_{j j} h_{2 j} (g_{i}) \\ + \sum_{j = 1}^{m - 1} \sum_{k = j + 1}^{m - 1} η_{j k} h_{1 j} (g_{i}) h_{1 k} (g_{i}) \end{gathered}

(6)

for i = 1, ..., N. Similarly, by having model (6) equivalent to (4), we can establish the following relationships

\{\begin{matrix} π_{0} = μ = G_{m m} \\ π_{j} = α_{j} = G_{j m} - G_{m m}, j = 1, \dots, m - 1 \\ η_{j j} = 2 α_{j} + δ_{j j} = G_{j j} - G_{m m}, j = 1, \dots, m - 1 \\ η_{j k} = δ_{j k} = (G_{j k} - G_{j m}) - (G_{k m} - G_{m m}), j \neq k \end{matrix}

Therefore, π_j in model (6) can still be interpreted as the substitution effect of replacing allele A_m by A_j when paired with allele A_m , or the difference between the genotypic values of the genotype group A_jA_m with one copy of A_j versus the genotype group A_mA_m (baseline). η_jj is the difference between the expected genotypic value G_jj in the homozygous genotypic group A_jA_j with two copies of A_j and G_mm in the baseline group A_mA_m . Besides, η_jk in model (6) has the same interpretation as δ_jk (or d_jk ) before. From model (6), the general additive effects $a_{j k}^{*} = (η_{j j} - η_{k k}) ∕ 2$ and the dominance effects $d_{j k}^{*} = η_{j k} - (η_{j j} + η_{k k}) ∕ 2 - 2 π_{0}$ , j ≠ k, which can be tested either separately or jointly. The overall effect of a particular allele A_j can be tested through the composite hypothesis of H₀ : π_j = η_jk = 0 for k = 1, ⋯, m - 1. The overall effects of the locus can also be tested via the null hypothesis of H₀ : π_j = η_jk = 0 for any j, k = 1, ⋯, m - 1.

Each of the three models (4), (5) and (6) provides a full re-parameterization of the m(m + 1)/2 expected genotypic values under the same model framework (3). The relationships between their model parameters and the expected genotypic values are summarized in Table 1. It is interesting to see from Table 1 that the null hypothesis of α_j = δ_jj = 0 is equivalent to either a_j = d_jj = 0 or π_j = η_jj = 0, which implies G_jj = G_jm = G_mm . So the three models above should provide the same test statistics for testing α_j = δ_jj = 0, a_j = d_jj = 0 or π_j = η_jj = 0.

Table 1 Parameterization of fully parameterized one-locus models (4), (5), (6).

Full size table

For a biallelic locus with alleles A (or A₁) and a (or A₂), we have m = 2 with three possible genotypic values G_AA = E(G|AA), G_Aa = E(G|Aa) and G_aa = E(G|aa). If we adopt the allele coding, then w₂(g) = 2 - w₁(g), v₁₂(g) = w₁(g) - v₁₁(g), and v₂₂(g) = 1 - w₁(g) + v₁₁(g). For the F_∞ coding, we have f₂(g) = -f₁(g) and h₂(g) = h₁(g). So we can further drop d₂ in model (5). For the allele-count coding, we have h₁₂(g) = h₁₁(g) and h₂₂(g) = 1 - h₁₁(g) - h₂₁(g). The interpretation of model parameters for these three biallelic QTL models are summarized in Table 2, which is a special case of Table 1.

Table 2 Parameterization of one-locus models (4), (5), (6) when m = 2.

Full size table

For a locus with three alleles A₁, A₂ (i.e., m = 3), we have six possibly distinctive expected genotypic values G₁₁, G₂₂, G₃₃, G₁₂, G₁₃ and G₂₃. Each of the three fully parameterized models (4), (5) and (6) can provide a full re-parameterization of the six expected genotypic values. In a matrix form, from the allele coding model (4), we have

[\begin{matrix} G_{11} \\ G_{22} \\ G_{33} \\ G_{12} \\ G_{13} \\ G_{23} \end{matrix}] = [\begin{matrix} 1 & 2 & 0 & 1 & 0 & 0 \\ 1 & 0 & 2 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \end{matrix}] [\begin{matrix} μ \\ α_{1} \\ α_{2} \\ δ_{11} \\ δ_{22} \\ δ_{12} \end{matrix}]

From the F_∞ coding model (5), we have

[\begin{matrix} G_{11} \\ G_{22} \\ G_{33} \\ G_{12} \\ G_{13} \\ G_{23} \end{matrix}] = [\begin{matrix} 1 & 1 & - 1 & 0 & 0 & 0 \\ 1 & - 1 & 1 & 0 & 0 & 0 \\ 1 & - 1 & - 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 1 & 1 \\ 1 & 0 & - 1 & 1 & 0 & 0 \\ 1 & - 1 & 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} τ \\ a_{1} \\ a_{2} \\ d_{11} \\ d_{22} \\ d_{12} \end{matrix}]

And the allele-count coding model (6) gives

[\begin{matrix} G_{11} \\ G_{22} \\ G_{33} \\ G_{12} \\ G_{13} \\ G_{23} \end{matrix}] = [\begin{matrix} 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \end{matrix}] [\begin{matrix} π_{0} \\ π_{1} \\ π_{2} \\ η_{11} \\ η_{22} \\ η_{12} \end{matrix}]

By multiplying the design matrices on the left side of the equations, we can show that the model parameters and the expected genotypic values have the relationships as summarized in Table 3, which is consistent with that in Table 1.

Table 3 Parameterization of one-locus models (4), (5), (6) when m = 3.

Full size table

Reduced one-locus models

Due to limited available sample sizes in practice, it may not always be feasible to use the fully parameterized models. Quite often, one may want to check the main effects of alleles first before including all possible allelic interactions. Here we consider the case of including possible interactions between A_j and itself for the homozygous genotypes A_jA_j , j = 1, ..., m, but ignore other interactions between different alleles A_j and A_k (j ≠ k). Then we obtain a reduced case of model (2) as below

G_{j k} = μ^{*} + α_{j}^{*} + α_{k}^{*} + δ_{j}^{*} 1_{{j = k}}

(7)

for j, k = 1, ..., m. Similarly, using the allele coding, we can present this model in a linear model form as

G (g_{i}) = μ^{*} + \sum_{j = 1}^{m} α_{j}^{*} w_{j} (g_{i}) + \sum_{j = 1}^{m} δ_{j}^{*} v_{j} (g_{i})

(8)

for i = 1, ..., N, where v_j (g) = v_jj (g) for j = 1, ..., m, with v_jj (g) defined as before.

Model (8) contains only one redundant parameter in the α*'s due to the fact that $\sum_{j = 1}^{m} w_{j} (g_{i}) = 2$ for i = 1, ..., N. In this case, as shown in Appendix A, the parameters $δ_{1}^{*}, . . ., δ_{m}^{*}$ in model (8) are estimable but the parameters μ* and $α_{1}^{*}, . . ., α_{m}^{*}$ are not estimable. To overcome the redundant parameter problem, we can drop w_m from model (8) and consider

G (g_{i}) = μ + \sum_{j = 1}^{m - 1} α_{j} w_{j} (g_{i}) + \sum_{j = 1}^{m} δ_{j} v_{j} (g_{i})

(9)

for i = 1, ..., N. Note that $v_{m} = z_{1 m} z_{2 m} = 1 - \sum_{j = 1}^{m - 1} w_{j} + \sum_{j = 1}^{m - 1} \sum_{k = 1}^{m - 1} v_{j k}$ , which cannot be completely determined by {w_j , v_j , j = 1, ..., m - 1}. Therefore, dropping {δ_jk , j, k = 1, ..., m - 1, j < k} from model (4) does not directly lead to an equivalent model of (9) as the latter contains v_m . In fact, as further dropping v_m in (9), it will lead to a more restricted model structure for the expected genotypic values with the similar interpretation of its model parameters as presented in model (4). It is also interesting to see that the haplotype coding proposed in [13] is a special case of model (9) when we further ignore all the allelic interactions and drop all the {v_j , j = 1, ..., m} in the model.

By definition, a reduced model can be derived from its original model by adding certain restrictions on the model parameters. Typically, the model parameters in a reduced model could be interpreted similarly as that in its original model when these restrictions are simple enough (e.g., by setting a subset of them being zero). When the restrictions on the original model parameters are complicated, however, the interpretation of the reduced model parameters could be different from that presented in the original model. For model (9), we can establish the relationship between its model parameters and the expected genotypic values using a classical matrix approach, as shown in Appendix B. An alternative way of building this relationship is to simply treat model (9) as a reduced form of model (8) by adding a restriction $α_{m}^{*} = 0$ and taking μ = μ*, $α_{j} = α_{j}^{*}$ for j = 1, ..., m - 1, and $δ_{j} = δ_{j}^{*}$ for j = 1, ..., m. Note that adding the restriction $α_{m}^{*} = 0$ on (8) does not change the modeling structure of the expected genotypic values because $α_{m}^{*}$ is a redundant parameter given the others. Therefore,

\{\begin{matrix} μ = G_{m m} - δ_{m}^{*} = G_{j m} + G_{k m} - G_{j k}, \\ j \neq k \neq m \\ α_{j} = G_{j m} - μ^{*} = G_{j k} - G_{k m}, k \neq j, m, \\ j = 1, \dots, m - 1 \\ δ_{j} = G_{j j} - (μ^{*} + 2 α_{j}^{*}) \\ = (G_{j j} - G_{j k}) - (G_{j l} - G_{k l}), j \neq k \neq l, \\ j = 1, \dots, m \end{matrix}

Comparing with the parameters in model (4), we can see that the interpretation of the parameters in model (9) have changed slightly. The intercept μ now becomes $(G_{m m} - δ_{m}^{*})$ instead of G_mm , the α_j is the substitution effect of replacing allele A_m by A_j when paired with any allele A_k (k ≠ j, m) instead of just A_m , while the δ_j is the difference between the substitution effect of replacing any allele A_k by A_j when paired with A_j itself and that when paired with another allele A_l (l ≠ j, k). If both α_j and δ_j are zero for a particular j < m, then G_jj = G_jm = μ and G_jk = G_km for any k ≠ j, m.

Under the same model framework (8), the F_∞ coding leads to the following model

G (g_{i}) = τ + \sum_{j = 1}^{m - 1} a_{j} f_{j} (g_{i}) + \sum_{j = 1}^{m} d_{j} h_{j} (g_{i})

(10)

for i = 1, ..., N. By applying the relationship f_j (g) = w_j (g) - 1 and h_j (g) = w_j (g) - 2v_j (g) for j = 1, ..., m, we can show that for models (10) and (8) to be equivalent their model parameters have the relationship

\{\begin{matrix} τ = μ^{*} + \sum_{j = 1}^{m} (α_{j}^{*} + \frac{δ_{j}^{*}}{2}) \\ a_{j} = α_{j}^{*} + \frac{δ_{j}^{*}}{2}, j = 1, \dots, m - 1 \\ d_{j} = - \frac{δ_{j}^{*}}{2}, j = 1, \dots, m \\ α_{m}^{*} + \frac{δ_{m}^{*}}{2} = 0 \end{matrix}

In other words, model (10) leads to a restriction $2 α_{m}^{*} + δ_{m}^{*} = 0$ on the parameters in model (8) which makes $μ^{*} = G_{m m} - (2 α_{m}^{*} + δ_{m}^{*}) = G_{m m}$ , $α_{m}^{*} = - δ_{m}^{*} / 2$ and $α_{j}^{*} = G_{j m} - (μ^{*} + α_{m}^{*}) = G_{j m} - G_{m m} + δ_{m}^{*} / 2$ , j = 1, ..., m - 1. Thus,

\{\begin{matrix} τ = G_{m m} + \frac{1}{2} \sum_{j = 1}^{m - 1} (G_{j j} - G_{m m}) \\ a_{j} = \frac{G_{j j} - G_{m m}}{2}, j = 1, \dots, m - 1 \\ d_{j} = - \frac{(G_{j j} - G_{j k}) - (G_{j l} - G_{k l})}{2}, j \neq k \neq l, \\ j = 1, \dots, m \end{matrix}

Now d_j becomes a half of the difference between the substitution effect of replacing any allele A_k by A_j when paired with another A_j and that when paired with an allele A_l (l ≠ j, k), which can no longer be referred to as a dominance effect.

With the allele-count coding, we can actually construct two equivalent models in this case

G (g_{i}) = π_{0} + \sum_{j = 1}^{m - 1} π_{j} h_{1 j} (g_{i}) + \sum_{j = 1}^{m} η_{j} h_{2 j} (g_{i})

(11)

and

G (g_{i}) = π_{0}^{'} + \sum_{j = 1}^{m} {π^{'}}_{j} h_{1 j} (g_{i}) + \sum_{j = 1}^{m - 1} {η^{'}}_{j} h_{2 j} (g_{i})

(12)

for i = 1, ..., N. Similarly, we can show that model (11) can be treated as a reduced model by adding the restriction $α_{m}^{*} = 0$ on parameters in model (8) with the following relationships

\{\begin{matrix} π_{0} = μ^{*} = G_{j m} + G_{k m} - G_{j k}, \\ j \neq k \neq m \\ π_{j} = α_{j}^{*} = G_{j k} - G_{k m}, k \neq j, m, \\ j = 1, \dots, m - 1 \\ η_{j} = 2 α_{j}^{*} + δ_{j}^{*} = (G_{j j} - G_{j m}) + (G_{j k} - G_{k m}), \\ k \neq j, m, j = 1, \dots, m - 1 \\ η_{m} = δ_{m}^{*} = (G_{m m} - G_{j m}) - (G_{k m} - G_{j k}), \\ j \neq k \neq m \end{matrix}

On the other hand, model (12) can be treated as a reduced model by adding the restriction $2 α_{m}^{*} + δ^{*} = 0$ on parameters in model (10) with the following relationships

\{\begin{matrix} {π^{'}}_{0} = μ^{*} = G_{m m} \\ {π^{'}}_{j} = α_{j}^{*} = \frac{(G_{j m} - G_{m m}) + (G_{j k} - G_{k m})}{2}, \\ k \neq j, m, j = 1, \dots, m - 1 \\ {π^{'}}_{m} = - \frac{δ_{m}^{*}}{2} = - \frac{(G_{m m} - G_{j m}) - (G_{k m} - G_{j k})}{2}, \\ j \neq k \neq m \\ {η^{'}}_{j} = 2 α_{j}^{*} + δ_{j}^{*} = G_{j j} - G_{m m}, \\ j = 1, \dots, m - 1 \end{matrix}

While the effect η_jj in model (6) is the difference between the two expected homozygous genotypic values G_jj and G_mm , the effect η_j in model (11) becomes the sum of the substitution effects of replacing allele A_m by A_j when paired with A_j itself and when paired with another allele A_k (k ≠ j, m. It is also interesting to see that the definition of parameters in models (11) and (12) are quite different. A null hypothesis of $H_{0} : π_{j}^{'} = η_{j}^{'} = 0$ for a particular j < m in model (12) implies that G_jj = G_mm and G_jm - G_mm = G_jk - G_km for any k ≠ j, m, while the null hypothesis of H₀ : π_j = η_j = 0 for a j < m in model (11) implies that G_jj = G_jm and G_jk = G_km for any k ≠ j, m, which has nothing to do with G_mm .

Under the same model framework (8), each of the above four models (9), (10), (11) and (12) contains 2m non-redundant parameters (including the intercept) to model the m(m + 1)/2 expected genotypic values. When m > 3, we have m(m + 1)/2 > 2m. Therefore, the model framework (7) enforces certain constraints on the m(m + 1)/2 genotypic values. If m = 3, then each of the four models actually provides a full re-parameterization of the six expected genotypic values G₁₁, G₂₂, G₃₃, G₁₂, G₁₃ and G₂₃. The relationships between the four model parameters and the expected genotypic values are summarized in Table 4.

Table 4 Parameterization of one-locus models (9), (10), (11), (12) when m ≥ 3.

Full size table

Comparing Table 4 with Table 1, we can see that the definition of model parameters depends not only on the coding schemes of marker genotypes but also on the underlying framework for the structure of the expected genotypic values. From Table 4, it is also interesting to see that the null hypothesis of H₀ : α_j = δ_j = 0 (j < m) in model (9) is equivalent to π_j = η_j = 0 in model (11), which implies $α_{j}^{*} = δ_{j}^{*} = 0$ in model (8) with restriction $α_{m}^{*} = 0$ , or G_jk = G_km for any k = 1, ..., m. On the other hand, the null hypothesis of H₀ : a_j = d_j = 0 (j < m) in model (10) is equivalent to $π_{j}^{'} = η_{j}^{'} = 0$ in model (12), which implies $α_{j}^{*} = δ_{j}^{*} = 0$ in model (8) with a restriction $2 α_{m}^{*} + δ_{m}^{*} = 0$ , or G_jj = G_mm and G_jj - G_jm = G_jk - G_km for any k ≠ m. In general, the two null hypotheses of α_j = δ_j = 0 and a_j = d_j = 0 may not always be equivalent. For example, when m = 3, similar to the three-allele models discussed in the previous section, we can show that the four model parameters and the expected genotypic values have the relationships as shown in Table 5, which is a special case of Table 4. We can see from Table 5 that α₁ = δ₁ = 0 is equivalent to π₁ = η₁ = 0 which implies G₁₂ = G₂₃ and G₁₁ = G₁₃; while a₁ = d₁ = 0 is equivalent to $π_{1}^{'} = η_{1}^{'} = 0$ which implies G₁₁ = G₃₃ and G₁₂ + G₁₃ = G₁₁ + G₂₃. So, depending on the underlying true setting of the expected genotypic values, the null hypotheses of α₁ = δ₁ = 0 in model (9) could be different from that of a₁ = d₁ = 0 in model (10).

Table 5 Parameterization of one-locus models (9), (10), (11), (12) when m = 3.

Full size table

Extension to two-locus models

In this section, we further explore some extensions of the previous one-locus models to two-locus models. Consider two marker loci with alleles $A_{11}, \dots, A_{1 m_{1}}$ at locus 1 and alleles $A_{21}, \dots, A_{2 m_{2}}$ at locus 2, respectively. Without distinguishing the parental origins of the alleles, there are totally m₁m₂(m₁ + 1)(m₂ + 1)/4 possible distinctive expected genotypic values: G_jkrs = E(G|A_1jA_1kA_2rA_2s) for j, k = 1, ..., m₁, j ≤ k; and r, s = 1, ..., m₂, r ≤ s. Using the allele coding, we introduce the following coding variables

\begin{gathered} w_{1 j} (g) = \{\begin{matrix} 2, & if g = A_{1 j} A_{1 j} \\ 1, & if g = A_{1 j} A_{1 j}^{c} \\ 0, & if g = A_{1 j}^{c} A_{1 j}^{c} \end{matrix} \\ v_{1 j k} (g) = \{\begin{matrix} 1, & if g = A_{1 j} A_{1 k} \\ 0, & otherwise \end{matrix} \end{gathered}

j, k = 1, ..., m₁, for marker genotypes at locus 1 and

w_{2 r} (g) = \{\begin{matrix} 2, & if g = A_{2 r} A_{2 r} \\ 1, & if g = A_{2 r} A_{2 r}^{c} \\ 0, & if g = A_{2 r}^{c} A_{2 r}^{c} \end{matrix}

v_{2 r s} (g) = \{\begin{matrix} 1, & if g = A_{2 r} A_{2 s} \\ 0, & otherwise \end{matrix}

r, s = 1, ..., m₂, for marker genotypes at locus 2, where $A_{1 j}^{c}$ (or $A_{2 r}^{c}$ ) denotes any other allele type except A_1j(or A_2r) at locus 1 (or 2). A fully parameterized two-locus model for G_jkrs can then be presented as

\begin{gathered} G (g_{i}) = μ + \sum_{j = 1}^{m_{1} - 1} α_{1 j} w_{1 j} + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} δ_{1 j k} v_{1 j k} \\ + \sum_{r = 1}^{m_{2} - 1} α_{2 r} w_{2 r} + \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} δ_{2 r s} v_{2 r s} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} (α_{1 j} α_{2 r}) w_{1 j} w_{2 r} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} (α_{1 j} δ_{2 r s}) w_{1 j} v_{2 r s} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} (δ_{1 j k} α_{2 r}) v_{1 j k} w_{2 r} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} (δ_{1 j k} δ_{2 r s}) v_{1 j k} v_{2 r s} \end{gathered}

(13)

for i = 1, ..., N. Similar to the one-locus models, we can establish the relationship between the model parameters and the expected genotypic values as shown in (C.1) of Appendix C. A nice property of this allele coding model is that a higher order effect is simply the deviation of its corresponding expected genotypic value from an approximation of the other lower order effects. Here the corresponding expected genotypic value of a marker effect is determined by the position of alleles that differ from the two reference alleles $A_{1 m_{1}}$ and $A_{2 m_{2}}$ . So, starting from the lowest order parameter μ, it seems straightforward to build the relationships between the model parameters and the expected genotypic values starting from the low-order effect parameters up to the high-order effect parameters.

For the F_∞ coding, we can define the following coding variables for the genotypes at the two marker loci separately.

\begin{gathered} f_{1 j} (g) = \{\begin{matrix} 1, & i f g = A_{1 j} A_{1 j} \\ 0, & i f g = A_{1 j} A_{1 j}^{c} \\ - 1, & i f g = A_{1 j}^{c} A_{1 j}^{c} \end{matrix} \\ h_{1 j} (g) = \{\begin{matrix} 1, & i f g = A_{1 j} A_{1 j}^{c} \\ 0, & otherwise \end{matrix} \end{gathered}

for j = 1, ..., m₁, and

\begin{gathered} f_{2 r} (g) = \{\begin{matrix} 1, & i f g = A_{2 r} A_{2 r} \\ 0, & i f g = A_{2 r} A_{2 r}^{c} \\ - 1, & i f g = A_{2 r}^{c} A_{2 r}^{c} \end{matrix} \\ h_{2 r} (g) = \{\begin{matrix} 1, & i f g = A_{2 r} A_{2 r} \\ 0, & otherwise \end{matrix} \end{gathered}

for r = 1, ..., m₂. A fully parameterized two-locus model using this F_∞ coding is then

\begin{gathered} G (g_{i}) = τ + \sum_{j = 1}^{m_{1} - 1} a_{1 j} f_{1 j} (g_{i}) + \sum_{r = 1}^{m_{2} - 1} a_{2 r} f_{2 r} (g_{i}) \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} d_{1 j k} h_{1 j} (g_{i}) h_{1 k} (g_{i}) \\ + \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} d_{2 r s} h_{2 r} (g_{i}) h_{2 s} (g_{i}) \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} (a_{1 j} a_{2 r}) f_{1 j} f_{2 r} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} (a_{1 j} d_{2 r s}) f_{1 j} h_{2 r} h_{2 s} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} (d_{1 j k} a_{2 r}) h_{1 j} h_{1 k} f_{2 r} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} (d_{1 j k} d_{2 r s}) h_{1 j} h_{1 k} h_{2 r} h_{2 s} \end{gathered}

(14)

for i = 1, ..., N. Still, using the relationships w_1j= 1 + f_1j, w_2r= 1 + f_2r, v_1jj= (1 + f_1j- h_1j), v_2rr= (1 + f_2r- h_2r), v_1jk= h_1jh_1kfor j ≠ k, and v_2rs= h_2rh_2sfor r ≠ s between the F_∞ coding variables and the allele coding variables, we can establish the relationships between the model parameters and the expected genotypic values as shown in (C.2) of Appendix C. We can easily verify that the biallelic two-locus effects $E_{F_{\infty} \cdot A B}$ in [9] is a special case of our results with m₁ = m₂ = 2. It is also interesting to see that the interpretation of model parameters in terms of the expected genotypic values becomes much more complicated than that in the previous allele coding model. When m₁, m₂ > 2, the low-order within-locus main effect a_1jis a weighted combination of the differences $(G_{j j r r} - G_{m_{1} m_{1} r r})$ , where r = 1, ..., m₂ refer to various homozygous genotypes A_2rA_2rat locus 2. The within-locus effect d_1jjis a weighted combination of the allelic interactions $(G_{j j r r} - 2 G_{j m_{1} r r} + G_{m_{1} m_{1} r r})$ , r = 1, ..., m₂, at locus 1 with reference A_2rA_2rat locus 2. Even the intercept τ of the model becomes a complex function of various homozygous genotypic values.

Applying the allele-count coding, we can define

\begin{gathered} h_{1 j}^{(1)} (g) = \{\begin{matrix} 1, i f g = A_{1 j} A_{1 j}^{c} \\ 0, o t h e r w i s e \end{matrix} \\ h_{2 j}^{(1)} (g) = \{\begin{matrix} 1, i f g = A_{1 j} A_{1 j} \\ 0, otherwise \end{matrix} \end{gathered}

for j = 1, ..., m₁, and

\begin{gathered} h_{1 r}^{(2)} (g) = \{\begin{matrix} 1, i f g = A_{2 r} A_{2 r}^{c} \\ 0, otherwise \end{matrix} \\ h_{2 r}^{(2)} (g) = \{\begin{matrix} 1, i f g = A_{2 r} A_{2 r} \\ 0, otherwise \end{matrix} \end{gathered}

for r = 1, ..., m₂. Another fully parameterized two-locus model for G_jkrs can be written as

\begin{array}{l} G (g_{i}) = π_{0} + \sum_{j = 1}^{m_{1} - 1} (π_{1 j} h_{1 j}^{(1)} + η_{1 j j} h_{2 j}^{(1)}) \\ + \sum_{r = 1}^{m_{2} - 1} (π_{2 r} h_{1 r}^{(2)} + η_{2 r r} h_{2 r}^{(2)}) \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j + 1}^{m_{1} - 1} η_{1 j k} h_{1 j}^{(1)} h_{1 k}^{(1)} \\ + \sum_{r = 1}^{m_{2} - 1} \sum_{s = r + 1}^{m_{2} - 1} η_{2 r s} h_{1 r}^{(2)} h_{1 s}^{(2)} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} [(π_{1 j} π_{2 r}) h_{1 j}^{(1)} h_{1 r}^{(2)} \\ + (π_{1 j} η_{2 r r}) h_{1 j}^{(1)} h_{2 r}^{(2)} + (η_{1 j j} π_{2 r}) h_{2 j}^{(1)} h_{1 r}^{(2)} \\ + (η_{1 j j} η_{2 r r}) h_{2 j}^{(1)} h_{2 r}^{(2)}] \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r + 1}^{m_{2} - 1} [(π_{1 j} η_{2 r s}) h_{1 j}^{(1)} h_{1 r}^{(2)} h_{1 s}^{(2)} \\ + (η_{1 j j} η_{2 r s}) h_{2 j}^{(1)} h_{1 r}^{(2)} h_{1 s}^{(2)}] \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j + 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} [(η_{1 j k} π_{2 r}) h_{1 j}^{(1)} h_{1 k}^{(1)} h_{1 r}^{(2)} \\ + (η_{1 j k} η_{2 r r}) h_{1 j}^{(1)} h_{1 k}^{(1)} h_{2 r}^{(2)}] \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j + 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r + 1}^{m_{2} - 1} (η_{1 j k} η_{2 r s}) \\ \cdot h_{1 j}^{(1)} h_{1 k}^{(1)} h_{1 r}^{(2)} h_{1 s}^{(2)} \end{array}

(15)

for i = 1, ..., N. In this case, the allele-count coding variables and the allele coding variables have the relationships $w_{1 j} = h_{1 j}^{(1)} + 2 h_{2 j}^{(1)}$ , $w_{2 r} = h_{1 r}^{(2)} + 2 h_{2 r}^{(2)}$ , $v_{1 j j} = h_{2 j}^{(1)}$ , $v_{2 r r} = h_{2 r}^{(2)}$ , $v_{1 j k} = h_{1 j}^{(1)} h_{1 k}^{(1)}$ for j ≠ k, and $v_{2 r s} = h_{1 r}^{(2)} h_{1 s}^{(2)}$ for r ≠ s. Through the equivalence of the two models (13) and (15), we can also construct relationships between the parameters in model (15) and the expected genotypic values as shown in (C.3) of Appendix C. We can see that the interpretation of parameters in the allele-count coding model (15) are as simple as that in the allele coding model (13) with the same intercept being $G_{m_{1} m_{1} m_{2} m_{2}}$ . Besides, it seems that some parameters such as (η_1jjη_2rr), (η_1jkη_2rs) and (η_1jkη_2rr) have simpler relationships than the corresponding ones in the allele coding model (13).

Finally, let us consider some reduced cases of the two-locus models. By ignoring locus-by-locus interactions (i.e., epistases), we have the following simplified two-locus model framework

G_{j k r s} = μ^{*} + α_{1 j}^{*} + α_{1 k}^{*} + δ_{1 j k}^{*} + α_{2 r}^{*} + α_{2 s}^{*} + δ_{2 r s}^{*}

(16)

for j, k = 1, ..., m₁ and r, s = 1, ..., m₂. If we further ignore the within-locus allelic interactions between different alleles, then another reduced two-locus model framework is

\begin{gathered} G_{j k r s} = μ^{*} + α_{1 j}^{*} + α_{1 k}^{*} + δ_{1 j}^{*} 1_{{j = k}} \\ + α_{2 r}^{*} + α_{2 s}^{*} + δ_{2 r}^{*} 1_{{r = s}} \end{gathered}

(17)

Similar to the one-locus models, under each of the two reduced model frameworks we can construct the two-locus models from the three coding schemes. The relationships between the model parameters and the expected genotypic values under framework (14) are summarized in Table 6, which can be treated as an extension of Table 1 to the two-locus case. The relationships between the model parameters and the expected genotypic values under framework (17) are also summarized in Table 7, which is a straightforward extension of Table 4. Further dropping $δ_{1 j}^{*}$ for j = 1, ..., m₁ and $δ_{2 r}^{*}$ for r = 1, ..., m₂ in (15) will lead to an additive model framework, which has its model parameters interpretable similar to that in Table 6. From Tables 6 and 7, we can see that both the allele and allele-count coding models have their lower-order main effects keep similar interpretation as to that in the previous fully parameterized case with epistases, while the F_∞ coding models have the definition of their lower-order main effects vary depending on whether there are epistases involved in the models.

Table 6 Parameterization of two-locus models under model framework (16).

Full size table

Table 7 Parameterization of two-locus models under model framework (17) when m₁, m₂ ≥ 3.

Full size table

As pointed out in [9], the genetic effects of a marker may have different interpretation depending upon whether the marker is fitted in a one-locus model or a two-locus model. From the linear model theory, the genetic effects of a marker in a one-locus model are defined based on the expected genotypic values of certain genotypes at this particular marker locus with genotypes at the other marker loci being averaged out based on the joint genotype distribution. For instance, marker 1 in the two-locus setting above has its effects defined in a one-locus model based on the one-locus genotypic values $E (G_{j k}) = E (G_{j k r s} | A_{1 j} A_{1 k}) = \sum_{r s} P (A_{2 r} A_{2 s} | A_{1 j} A_{1 k}) G_{j k r s}$ , which could depend on the LDs of alleles between the two loci. When the same marker is fitted in a two-locus model, its effects are usually functions of the expected genotypic values with their joint genotypes taking certain reference alleles or genotypes at the other marker loci. So, in general, even without locus-by-locus interactions, a single marker's effects could be different from the one defined in a multi-locus model when the alleles at different loci are in linkage disequilibrium (LD). Consider a 2-locus haploid model with alleles A, a at locus 1 and B, b at locus 2. If we ignore the locus-by-locus interaction, it is easy to show that the additive allelic effects are α₁ = G_AB - G_aB = G_Ab - G_ab and α₂ = G_AB - G_Ab = G_aB - G_ab at locus 1 and 2, respectively. In a one-locus model at locus 1, however, we can show that the locus has its additive allelic effect $α_{1}^{*} = α_{1} + D α_{2} ∕ (p_{A} p_{a})$ , where D = P_AB - p_Ap_B is the LD between the two loci.

Simulation Examples

We use some numerical examples to illustrate properties of the models we have discussed. First, we consider the same example discussed in [11] of a three-allele locus with allele frequencies p₁ = 0.2 for A₁, p₂ = 0.3 for A₂, and p₃ = 0.5 for A₃. The six genotypic values are G₁₁ = 10, G₁₂ = 30, G₂₂ = 50, G₁₃ = 36, G₂₃ = 46 and G₃₃ = 42. We adopt a similar strategy to specify the genotype frequencies as: $P_{j j} = p_{j}^{2} - D$ for j = 1, 2, 3 and P_jk = 2p_jpk + D for j ≠ k, where D is a measure of departure from Hardy-Weinberg equilibrium (HWE) for the three alleles at the locus and D^- ≤ D ≤ D⁺ with

D^{-} = - min_{j \neq k} {2 p_{j} p_{k}} = - 0.12

and

D^{+} = min_{j = 1, 2, 3} {p_{j}^{2}} = 0.04

We consider two cases: i) D = 0 for HWE, and ii) D = 0.02 for Hardy-Weinberg disequilibrium (HWD). The phenotypic value of an individual is simulated as a sum of its true genotypic value and an environmental noise from N(0, σ²), where the σ² is chosen to be either 0 or σ² = 288 with the latter one corresponds to a 20% heritability level when D = 0. For each of the four configurations, we simulate 10,000 random samples with 1000 individuals each. For each random sample, we fit the three fully parameterized one-locus models (4), (5) and (6) under model framework (2) using the least square approach and estimate the model parameters as well as the six genotypic values. The means and standard deviations (SD) of the least square estimates (LSE) of the model parameters and the six genotypic values from the 10,000 random samples in fitting these three models are summarized in Table 8.

Table 8 Means (SD) of LSE for three one-locus models (4), (5) and (6) when m = 3.

Full size table

As each of the three models provides a re-parameterization of the six genotypic values, for each random sample the three models always give exactly the same estimates of the six genotypic values and the residual variance as we expected, even though their model parameters are defined in different ways. As a result, under each configuration, the three models have the same means and SD for the LSE of the six genotypic values and the residual variance. Without environmental variation, each model can accurately estimate its model parameters and the six genotypic values for each random sample regardless of whether there is HWE or HWD. When there is environmental variation on the phenotypes, it is known that the least square estimators of the model parameters are unbiased under either HWE or HWD. However, the HWD may affect the variance of the least square estimators of the model parameters and the six genotypic values. Note that the genotypic frequencies are P₁₁ = 0.04, P₂₂ = 0.09, P₃₃ = 0.25, P₁₂ = 0.12, P₁₃ = 0.20 and P₂₃ = 0.30 under HWE, while with D = 0.02 the genotypic frequencies become P₁₁ = 0.02, P₂₂ = 0.07, P₃₃ = 0.23, P₁₂ = 0.14, P₁₃ = 0.22 and P₂₃ = 0.32. So, under HWD, we tend to have more individuals carrying genotypes A₁A₂, A₁A₃, A₂A₃ but less individuals carrying genotypes A₁A₁, A₂A₂, A₃A₃ in the random samples than that under HWE. Without knowing the accurate genotypic values, more individuals with certain genotypes in a random sample can then provide better estimates of the corresponding genotypic values. This explains why under HWD the estimates of G₁₁, G₂₂ and G₃₃ have larger SD (or variances) than that under the HWE, and the estimates of G₁₂, G₁₃ and G₂₃ under HWD have smaller variances than that under the HWE.

As another example, let us consider the statistical modeling of two-locus genotypic values G_jkrs , where the first locus have three alleles A₁, A₂, A₃ and the second locus have two alleles B₁, B₂. Assume that the alleles at locus 1 have the same allele frequencies as that in the previous example; i.e., p₁ = 0.2 for A₁, p₂ = 0.3 for A₂, and p₃ = 0.5 for A₃, while the two alleles at locus 2 have frequencies q₁ = 0.2 for B₁ and q₂ = 0.8 for B₂. The two-locus genotypic values G₂ = (G_jkrs ), j, k = 1, 2, 3; r, s = 1, 2 are given by

\begin{gathered} G_{2} = [\begin{matrix} G_{1111} & G_{1112} & G_{1122} \\ G_{2211} & G_{2212} & G_{2222} \\ G_{3311} & G_{3312} & G_{3322} \\ G_{1211} & G_{1212} & G_{1222} \\ G_{1311} & G_{1312} & G_{1322} \\ G_{2311} & G_{2312} & G_{2322} \end{matrix}] \\ = [\begin{matrix} 10 & 10.9 & 9.6 \\ 50 & 50.3 & 49.9 \\ 42 & 42.6 & 41.2 \\ 30 & 30.5 & 29.6 \\ 36 & 36.8 & 35.4 \\ 46 & 46.7 & 45.2 \end{matrix}] \end{gathered}

which are modified values from the previous one-locus model in a way that the G_{jk 11}= G_jk , G_{jk 12}= G_jk + e_1jkand G_{jk 22}= G_jk - e_2jkwith e_1jkand e_2jkbeing some small positive fluctuations according to the genotypes B₁B₂ and B₂B₂ at locus 2. We assume Hardy-Weinberg equilibria at both loci and specify their haplotype frequencies as: h₁₁ = p₁q₁ - D₁, h₁₂ = p₁q₂ + D₁, h₂₁ = p₂q₁ - D₂, h₂₂ = p₂q₂ - D₂, h₃₁ = p₃q₁ + (D₁ - D₂), h₃₂ = p₃q₂ - (D₁ - D₂), where D₁ (and D₂) are the linkage disequilibria (LD) between alleles A₁ and B₂ (and A₂ and B₁) at the two loci. We consider two scenarios: i) D₁ = D₂ = 0 for linkage equilibrium (LE); and ii) D₁ = 0, D₂ = 0.03 for LD. The phenotypic value of an individual is still simulated as a sum of its genotypic value and an environmental noise from N(0, σ²), where the σ² was chosen to be either 0 or σ² = 286 with the latter one corresponds to a 20% heritability level when D₁ = D₂ = 0. For each of the four configurations, we simulate 10,000 random samples with 1000 individuals each. For each random sample, we consider fitting models under three model frameworks: i) one-locus models (4), (5) and (6) at locus 1 under model framework (2); ii) two-locus models without epistases from the three coding schemes under model framework (14); iii) fully parameterized two-locus models (13), (14) and (15) with epistases. Still, for each random sample, the three allele coding models under the same model framework give exactly the same estimates of the 18 genotypic values as we expected (results not shown here). As the result, under each model framework, the three models have the same means and SD for the LSE of the 18 genotypic values and the residual variance, although the means and SD for the LSE of their model parameters are different. To compare the LSE of model parameters for models from the same coding under different model frameworks, we summarize in Table 9 the means and SD of the LSE of the model parameters from the 10,000 random samples in fitting the three allele-coding models: the one-locus model (4), the two-locus model under model framework (14), and the two-locus model under model framework (13). Models from the other two coding schemes behave similarly.

Table 9 Means (SD) of LSE for three allele-coding models regarding the two-locus genotypic values

Full size table

As we mentioned before, the one-locus models are actually modeling the expected genotypic values given the genotypes at locus 1. When D₁ = D₂ = 0, we can show that the expected genotypic values at locus 1 are G₁₁ = 10.03, G₂₂ = 50.03, G₃₃ = 41.68, G₁₂ = 29.90, G₁₃ = 35.87 and G₂₃ = 45.71, which correspond to μ = 41.68, α₁₁ = -5.81, α₁₂ = 4.03, δ₁₁₁ = -20.03, δ₁₂₂ = 0.29 and δ₁₁₂ = -10 as the true parameters in the allele coding one-locus model. When D₁ = 0, D₂ = 0.03, the expected genotypic values at locus 1 become G₁₁ = 10.03, G₂₂ = 50.08, G₃₃ = 41.55, G₁₂ = 29.97, G₁₃ = 35.81 and G₂₃ = 45.77, which correspond to μ = 41.55, α₁₁ = -5.74, α₁₂ = 4.21, δ₁₁₁ = -20.04, δ₁₂₂ = 0.09 and δ₁₁₂ = -10.06 as the true parameters in the allele coding one-locus model. In both cases, the least square estimators of the one-locus model parameters are unbiased estimators of the true parameters. Note that, unlike the one-locus model in the previous example, the LSE of the model parameters are no longer exactly the same as the true values even when no environmental noises are involved. The reason is that the expected genotypic values at locus 1 depend on not only the genotypic values but also the joint genotype frequencies in the sample, which may change slightly from sample to sample due to the sampling variation.

For the two-locus model without epistases, it cannot provide unbiased estimators for all the genotypic values because of the model mis-specification. However, the LSE of its parameters associated with locus 1 are similar to the ones in the one-locus model at locus 1. In fact, as we know from the linear model theory, the true values of its parameters associated with locus 1 are the same as the ones defined in the one-locus model at locus 1 when the two loci are in LE. Under LD, the least square estimators of its model parameters associated with locus 1 could be biased, and the biasness depends on the LD setting.

The two-locus model with epistases gives a full re-parameterization of the 18 genotypic values. Therefore, when no environmental noises are involved, the LSE of its model parameters are exactly the same as their true values for each random sample regardless of the LD between the two loci. It has to be pointed out that this phenomenon holds only when the random sample contains all the 18 possible genotypes. In our simulation setting, the frequencies for certain genotypes such as A₁A₁B₁B₁, A₁A₃B₁B₁ and A₂A₂B₁B₁ are pretty small. As the result, we occasionally (about 22-23% of the 1000 random samples) may obtain a random sample that has no individuals carrying certain genotypes. In this case, the design matrix in the fully parameterized model becomes singular and the LSE of the model parameters are no longer unique. To keep our illustration of the model properties simple, we excluded those random samples in fitting the two-locus model with epistases (reduced models are less likely to have singular design matrices). Other techniques such as ridge regression could be applied to handle those skewed random samples. In the presence of environmental noises, it is also noted that the LSE for some of its model parameters such as δ₁₁₁, (δ₁₁₁α₂₁) and (δ₁₁₁δ₂₁₁) have much larger SD than the LSE of other parameters. This is due to the low frequencies of genotypes A₁A₁B₁B₁, A₁A₃B₁B₁ and A₂A₂B₁B₁. As a random sample has few individuals carrying these genotypes, it has reduced accuracy in estimation of their corresponding true genotypic values to which the model parameters δ₁₁₁, (δ₁₁₁α₂₁) and (δ₁₁₁δ₂₁₁) are related.

Discussion

In this study, we introduced three genotype coding schemes to build F_∞ models for multi-allele markers. The relationship between the model parameters and the expected genotypic values were established in some fully parameterized as well as reduced one-locus and two-locus F_∞ models. Our results showed that the relationships between the model parameters and the expected genotypic values could become more intricate in the multi-allele case than that in the biallelic case, even though the extension of the coding schemes from biallelic to multiple alleles appears straightforward. We built the relationships between different model parameters mainly through their coding variables of marker genotypes, which simplified the tedious derivation process comparing with the classical matrix approach. The F_∞ models we proposed can be used directly for association testing of multi-allele markers and their possible interactions with quantitative traits using random unrelated samples. These F_∞ models could also be applied to test for the risk haplotypes and their interactions when incorporated with the likelihood approach (e.g., [20]), or analyze family data by combining them with the likelihood to account for the transmission probability of alleles from parents to their offspring. Although our discussion focused on genetic modeling of quantitative traits, the results can be extended to other phenotypic traits such as binary outcomes in case-control studies using logistic regression models or time-to-event data using the Cox proportional hazard models.

Throughout the paper, we assumed that all the possible genotypes are available from the sampled individuals. If certain genotypes are not observable, then the expected genotypic values on these genotypes will not be estimable by themselves, which could change the interpretation of the model parameters as well. The models we have presented can also be modified to handle the situation when some individuals have missing genotypes at certain marker loci. When the missing genotypes at a marker locus have both alleles missing at the same time, we can simply introduce an indicator variable to code for the missing genotype at the marker. The regression coefficient of this indicator variable for this missing genotype can usually be interpreted as the difference between the expected genotypic value with missing genotype at the marker locus and the intercept of the model, while the other regression coefficients would keep the same interpretation as before.

It has to be pointed out that the relationships between the model parameters and the expected genotypic values are based on the assumption that the models can correctly specify the structure of the expected genotypic values. When a fully parameterized model is applied, the definition of its model parameters do not depend on the allele frequencies, HWD among alleles within a locus, or LD structure between alleles at different loci. In fitting a reduced model, however, a simplified model may not be totally correct in modeling all the expected genotypic values. In this case, depending on how accurate the simplified model is on approximating the expected genotypic values, the allele frequencies, HWD and LD structure between marker alleles could affect the definition and LSE of its model parameters. In the presence of environmental variation on the phenotypic values, regardless of whether a fully parameterized or reduced model is applied, the allele frequencies, HWD or LD between marker alleles may affect the LSE of the model parameters and the power in detection of the associated marker alleles as shown in our simulation studies.

All the models we have discussed so far are F_∞ models. Statistically, these F_∞ models are fixed-effect models which focus on modeling the expected genotypic values directly. On the other hand, the Fisher's ANOVA models, which target on evaluation of the variations contributed by various allelic effects and interactions, can be treated as random-effect models (see [21]) in which the expected genotypic values come from a discrete random variable G(g) = E(G|g) with its limited genotypes g being randomly sampled from a study population. Both the F_∞ and the Fisher type models form basis in the analysis of quantitative traits and they provide different perspectives in assessing the genetic effects of QTL and markers. For biallelic markers, we proposed in [10] a 'mean corrected' Fisher (mc-Fisher) model for decomposition of the genotypic variances. In the multi-allele marker case, we can also construct similar mc-Fisher models by applying mean corrections on all the indicator variables of the paternal and maternal alleles in the allele coding F_∞ models. For example, based on the allele coding model (4), we can construct its corresponding mc-Fisher model by replacing the coding variables w_j and v_jk with ${\bar{w}}_{j} = w_{j} - 2 p_{j}$ and ${\bar{v}}_{j k} = (z_{1 j} - p_{j}) (z_{2 k} - p_{k}) = v_{j k} - (p_{j} w_{k} + p_{k} w_{j}) / 2 + p_{j} p_{k}$ , respectively; where p_j is the allele frequency of A_j . Then the genetic additive and dominant variance components V_A and V_D of G(g), which are defined as variations contributed by the additive allelic effect and allelic interactions respectively, can be estimated from ${\bar{w}}_{j}$ 's and ${\bar{v}}_{j k}$ 's separately. As pointed out in [10], the mc-Fisher model can provide an orthogonal partition of V(G) into the sum of V_A and V_D under Hardy-Weinberg equilibrium, and it can be fitted through the standard least-square regression approach. Similar to the F_∞ models, the definition of the model parameters in such a mc-Fisher model also depend on the choice of the reference allele 'A_m '. But the estimates of the additive and dominant variance components V_A and V_D do not depend on such a choice. In addition, when a fully parameterized model is applied, the mc-Fisher model is equivalent to its original F_∞ model in modeling the expected genotypic values. Therefore, both models have the same residual variance and the F-statistics in testing for the overall effect of the marker locus. When reduced models are applied, the mc-Fisher model could become inequivalent to its original F_∞ model especially when allelic interactions are involved.

Of the three coding schemes that we have discussed, the F_∞ coding is perhaps the most widely used in current genetic association studies of quantitative traits. From what we have shown, the three coding schemes can essentially lead to equivalent models and have the same power in detection of various genetic effects. In practice, just like the various existing coding schemes such as 'Reference', 'GLM' and 'Effect' that are commonly used in the analysis of categorical covariates [22], we usually only need to adopt one specific coding scheme in building the regression models. Which coding scheme should be applied depends on how convenient it can provide the statistical inferences on the parameters of our research interests. In general, the allele coding models can provide direct estimates of certain substitution effects of alleles and allelic interactions and, in the two-locus case, allele coding models are perhaps the easiest among the three codings in building the relationships between their model parameters and the expected genotypic values. Besides, they are generically linked to the genetic variance components as we have shown above. On the other hand, the allele-count coding models are attractive in that it often leads to simple comparisons among the three genotypic groups with 0, 1 or 2 copies of a particular allele. In the two-locus case, the allele-count coding models also have the definition of their model parameters remain as simple as (if not simpler than) that in the allele coding models even in the presence of epistases. Meanwhile, both the allele and allele-count coding show an advantage that their lower-order main effects in the models can keep the same interpretation regardless of whether there are epistases involved in the model or not. In contrast, the F_∞ coding models may have the definition of their lower-order main effects vary depending on the absence or presence of epistases in the models. Even though the one-locus F_∞ coding model parameters are closely related to the additive and dominance effects, the two-locus F_∞ coding model parameters including the lower-order main effects have more complicated interpretations than that in the allele or allele-count coding models especially when epistases are involved.

The coding of marker genotypes are not limited to the three allele-based coding schemes that we have discussed. Application of a coding scheme could also be subject to the number of individuals available in each genotype group. For example, under the model framework (7), the allele coding scheme typically creates w_j (g), v_j (g) for each allele type A_j , j = 1, ..., m. When the group of a homozygous genotype A_jA_j includes very few individuals for a particular allele A_j , we may want to combine this genotypic group with another genotype such as the one carrying one copy of the allele A_j . Then we can replace the original w_j (g) and v_j (g) by an allele presence-absence coding variable d_j (g) for this specific allele A_j while keeping two coding variables w_k (g), v_k (g) for other alleles A_k , which leads to a mixed use of the allele coding and this allele presence-absence coding variable. In certain situations, the genotype-based coding could also be very useful as it can provide direct tests on pair-wise comparisons of certain genotypic values. Comparing with the genotype-based coding, the allele coding has the advantage of further dissecting the genetic effects into the allelic effects and allelic interactions, which allow us to specify reduced models with varying degrees of interactions among the main allelic effects - a useful tool in the model building procedures. Given a fixed coding, the likelihood ratio test can be applied to compare a full model with its reduced models. Statistical model selection tools such as AIC and BIC criteria, which provide a balance between the goodness of model fitting to the data and the complexity of the models in terms of the number of parameters, could also be used to compare some non-nested reduced models or frameworks. The current study focuses on establishing the theoretical relationships between the model parameters and the expected genotypic values according to different coding schemes under various model frameworks. A power comparison of some reduced models from different coding schemes under various scenarios with respect to the allele frequencies and possible HWD or LDs between marker alleles is beyond the scope of this study and might be worth of further exploration.

Conclusions

In summary, we introduced three allele-based coding schemes to construct F_∞ models for association testing of multi-allele genetic markers with quantitative traits. Depending upon whether certain allelic effects or comparisons between genotypic groups are of the main research interest, investigators may adopt one of the three allele-based codings (i.e., allele, F_∞ or allele-count), or perhaps a genotype-based coding in building an F_∞ model. Based on the F_∞ model from a given coding scheme, standard regression model fitting tools can then be applied to estimate or test for various genetic effects. Understanding the definition of model parameters from different coding schemes under various model frameworks are crucial for constructing appropriate testing hypothesis and making the correct statistical inferences in the genetic association studies.

Appendices

A. Estimability of parameters in model (8)

Let G = (G(g₁), ...,G(g_N )) ^T denote a vector of the expected genotypic values of all the individuals in the sample, and $β^{*} = (μ^{*}, α_{1}^{*}, . . ., α_{m}^{*}, δ_{1}^{*}, . . ., δ_{m}^{*})$ be a vector of all the model parameters. We can rewrite model (8) in a matrix form as G = Xβ* +e, where e = (e₁, ..., e_N ) and the design matrix X is

X = [1_{N} W_{1} W_{2} \cdot \cdot \cdot W_{m} V_{1} V_{2} \cdot \cdot \cdot V_{m}]

(18)

with W_j = (w_j (g₁), ..., w_j (g_N )) ^T and V_j = (v_j (g₁), ..., v_j (g_N )) ^T for j = 1, ..., m. As every individual carries two and only two alleles at the locus, we have $\sum_{j = 1}^{m} W_{j} = 2 \cdot 1_{N}$ , which means that the first (m+1) column vectors 1_N, W₁, W₂, ..., W_m of the design matrix X are linearly dependent. So, rank(X) ≤ 2m; i.e., X is not a full column rank matrix.

From (7), we have $G_{j k} = μ^{*} + α_{j}^{*} + α_{k}^{*}$ for j ≠ k, and $G_{j j} = μ^{*} + 2 α_{j}^{*} + δ_{j}^{*}$ . If we write G₀ = (G₁₂, G₁₃, ..., G_1m, G₂₃, ..., G_{L-1, L}, G₁₁, ..., G_mm ) ^T , then this model gives

G_{0} = X_{0} β^{*} = [\begin{matrix} 1_{s} & X_{A} & 0_{s \times m} \\ 1_{m} & 2 \cdot I_{m \times m} & I_{m \times m} \end{matrix}] β^{*}

where s = m(m - 1)/2, and

X_{A} = [\begin{matrix} 1 & 1 & 0 & \dots & 0 \\ 1 & 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & 0 & 0 & \dots & 1 \\ 0 & 1 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & \dots & 1 \end{matrix}]

Assume that the genotypes of the sampled individuals cover all possible genotypes A_jA_k for j, k = 1, ..., m. Then the design matrix X includes all the row vectors of X₀, which implies that rank(X) ≥ rank(X₀). It is clear that rank(X₀) = m + rank([1_sX_A ]), and it can be shown that rank([1_sX_A ]) = rank(X_A ) = m when m ≥ 3. Therefore, rank(X) = 2m as m ≥ 3. Note that when m = 2, we have s = 1 and rank(X) = 3.

From the linear models theory, we know that for a vector λ = (λ₀, ..., λ_2m) ^T ∈ R^2m+1, a linear function λ^Tβ* of β* is estimable if and only if $λ ⊥ N (X)$ , where $N (X) = {c \in R^{2 m + 1} | X c = 0}$ is the null space of the design matrix X. It is also known that $N (X) \oplus ℛ (X) = R^{2 m + 1}$ , where $ℛ (X)$ is a linear space generated by the row vectors of X. Hence, we have $r a n k (N (X)) = (2 m + 1) - r a n k (ℛ (X)) = 1$ . Note that $c = {(2, - 1_{m}^{'}, 0_{m}^{'})}^{'} \in N (X)$ due to the linear dependency among the column vectors 1_N, W₁, W₂, ..., W_m in the design matrix X. Therefore, for a vector λ = (λ₀, λ₁, ..., λ_2m) ^T ∈ R^2m+1, the linear function λ^Tβ* is estimable if and only if λ ⊥ c, or equivalently, $2 λ_{0} = \sum_{j = 1}^{m} λ_{j}$ . As a result, we know that in model (8) the functions of model parameters $G_{j k} = μ^{*} + α_{j}^{*} + α_{k}^{*}$ for j ≠ k, and $G_{j j} = μ^{*} + 2 α_{j}^{*} + δ_{j}^{*}$ for j = 1, ..., m are estimable, and the parameters $δ_{j}^{*} = G_{j j} - (μ^{*} + 2 α_{j}^{*}) = G_{j j} + G_{k l} - G_{j l} - G_{j k}$ as j ≠ k, l and k ≠ l (or in abbreviation, j ≠ k, ≠ l) for j = 1, ..., m are also estimable. But the parameters μ* and $α_{1}^{*}, . . ., α_{m}^{*}$ themselves are not estimable.

B. Estimability of parameters in model (9)

For model (9), we have its design matrix

W = [1_{N} W_{1} W_{2} \cdot \cdot \cdot W_{m - 1} V_{1} V_{2} \cdot \cdot \cdot V_{m}]

where W_j = (w_j (g₁), ..., w_j (g_N )) ^T and V_j = (v_j (g₁), ..., v_j (g_N )) ^T for j = 1, ..., m. It can be shown that the W and the design matrix X defined in (16) for model (8) have the following relationship W = XT or X = WS^T , where

T = [\begin{matrix} I_{m} & 0 \\ 0_{1 \times m} & 0_{1 \times m} \\ 0 & I_{m} \end{matrix}]

and

S^{T} = [\begin{matrix} I_{m} & d & 0_{m \times m} \\ 0_{1 \times m} & 0 & 0_{1 \times m} \\ 0_{m \times m} & 0_{m \times 1} & I_{m} \end{matrix}]

with d = (2, -1, ..., -1)' ∈ R^m . Let β = (μ, α₁, ..., α_m-1, δ₁, ..., δ_m ). Therefore, as (8) and (9) are two equivalent models, we have G = Xβ* = WS^Tβ* = Wβ, which yields

β = [\begin{matrix} μ \\ α_{1} \\ ⋮ \\ α_{m - 1} \\ δ_{1} \\ ⋮ \\ δ_{m} \end{matrix}] = S^{T} β^{*} = [\begin{matrix} μ^{*} + 2 α_{m}^{*} \\ α_{1}^{*} - α_{m}^{*} \\ ⋮ \\ α_{m - 1}^{*} - α_{m}^{*} \\ δ_{1}^{*} \\ ⋮ \\ δ_{m}^{*} \end{matrix}]

From this relationship, we have $δ_{j} = δ_{j}^{*}$ , j = 1, ..., m, which are estimable as shown in Appendix A. Besides, the intercept $μ = μ^{*} + 2 α_{m}^{*} = G_{j m} + G_{k m} - G_{j k}$ and $α_{j} = α_{j}^{*} - α_{m}^{*} = G_{j k} - G_{k m}$ , k ≠ j, j = 1, ..., m - 1, are also estimable.

C. Relationships for fully parameterized two-locus models

(C.1) Relationships between parameters of the fully parameterized two-locus model (13) and the expected genotypic values are

{\begin{array}{l} μ = G_{m_{1} m_{1} m_{2} m_{2}} \\ α_{1 j} = G_{j m_{1} m_{2} m_{2}} - μ = G_{j m_{1} m_{2} m_{2}} - G_{m_{1} m_{1} m_{2} m_{2}} \\ α_{2 r} = G_{m_{1} m_{1} r m_{2}} - μ = G_{m_{1} m_{1} r m_{2}} - G_{m_{1} m_{1} m_{2} m_{2}} \\ δ_{1 j k} = G_{j k m_{2} m_{2}} - α_{1 j} - α_{1 k} - μ \\ = G_{j k m_{2} m_{2}} - (G_{j m_{1} m_{2} m_{2}} + G_{m_{1} k m_{2} m_{2}}) \\ + G_{m_{1} m_{1} m_{2} m_{2}} \\ δ_{2 r s} = G_{m_{1} m_{1} r s} - α_{2 r} - α_{2 s} - μ \\ = G_{m_{1} m_{1} r s} - (G_{m_{1} m_{1} r m_{2}} + G_{m_{1} m_{1} s m_{2}}) \\ + G_{m_{1} m_{1} m_{2} m_{2}} \\ (α_{1 j} α_{2 r}) = G_{j m_{1} r m_{2}} - α_{1 j} - α_{2 r} - μ \\ = G_{j m_{1} r m_{2}} - (G_{j m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} r m_{2}}) \\ + G_{m_{1} m_{1} m_{2} m_{2}} \\ (δ_{1 j k} α_{2 r}) = G_{j k r m_{2}} - α_{1 j} - α_{1 k} - δ_{1 j k} \\ - α_{2 r} - (α_{1 j} α_{2 r}) - (α_{1 k} α_{2 r}) - μ \\ G_{j k r m_{2}} - (G_{j k m_{2} m_{2}} + G_{j m_{1} r m_{2}} \\ + G_{k m_{1} r m_{2}}) + (G_{j m_{1} m_{2} m_{2}} + G_{k m_{1} m_{2} m_{2}} \\ + G_{m_{1} m_{1} r m_{2}}) - G_{m_{1} m_{1} m_{2} m_{2}} \\ (α_{1 j} δ_{2 r s}) = G_{j m_{1} r s} - α_{2 r} - α_{2 s} - δ_{2 r s} \\ - α_{1 j} - (α_{1 j} α_{2 r}) - (α_{1 j} α_{2 s}) - μ \\ G_{j m_{1} r s} - (G_{m_{1} m_{1} r s} + G_{j m_{1} r m_{2}} \\ + G_{j m_{1} s m_{2}}) + (G_{j m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} r m_{2}} \\ + G_{m_{1} m_{1} s m_{2}}) - G_{m_{1} m_{1} m_{1} m_{2}} \\ (δ_{1 j k} δ_{2 r s}) = G_{j k r s} - α_{1 j} - α_{1 k} - δ_{1 j k} - α_{2 r} - α_{2 s} \\ - δ_{2 r s} - (α_{1 j} α_{2 r}) - (α_{1 j} α_{2 s}) - (α_{1 k} α_{2 r}) \\ - (α_{1 k} α_{2 s}) - (α_{1 j} δ_{2 r s}) - (α_{1 k} δ_{2 r s}) \\ - (δ_{1 j k} α_{2 r}) - (δ_{1 j k} α_{2 s}) - μ \\ = G_{j k r s} - (G_{j m_{1} r s} + G_{k m_{1} r s} + G_{j k r m_{2}} \\ + G_{j k s m_{2}}) + (G_{j k m_{2} m_{2}} + G_{j m_{1} r m_{2}} \\ + G_{k m_{1} r m_{2}} + G_{j m_{1} s m_{2}} + G_{k m_{1} s m_{2}} \\ + G_{m_{1} m_{1} r s}) - (G_{j m_{1} m_{2} m_{2}} + G_{k m_{1} m_{2} m_{2}} \\ + G_{m_{1} m_{1} r m_{2}} + G_{m_{1} m_{1} s m_{2}}) + G_{m_{1} m_{1} m_{2} m_{2}} \end{array}

for j, k = 1, ..., m₁ - 1; r, s = 1, ..., m₂ - 1 and j ≥ k, r ≤ s.

(C.2) Relationships between parameters of the fully parameterized two-locus model (14) and the expected genotypic values are

{\begin{array}{l} τ = μ + \sum_{j = 1}^{m_{1} - 1} (α_{1 j} + \frac{δ_{1 j j}}{2}) + \sum_{r = 1}^{m_{2} - 1} (α_{2 r} + \frac{δ_{2 r r}}{2}) \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} [(α_{1 j} α_{2 r}) + \frac{(α_{1 j} δ_{2 r r})}{2} \\ + \frac{(δ_{1 j j} α_{2 r})}{2} + \frac{(δ_{1 j j} δ_{2 r r})}{4}] \\ = \frac{1}{4} \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} G_{j j r r} + \frac{(3 - m_{2})}{4} \sum_{j = 1}^{m_{1} - 1} G_{j j m_{2} m_{2}} \\ + \frac{(3 - m_{1})}{4} \sum_{r = 1}^{m_{2} - 1} G_{m_{1} m_{1} r r} + \frac{(3 - m_{1}) (3 - m_{2})}{4} G_{m_{1} m_{1} m_{2} m_{2}} \\ a_{1 j} = α_{1 j} + \frac{δ_{1 j j}}{2} + \sum_{r = 1}^{m_{2} - 1} [(α_{1 j} α_{2 r}) + \frac{(α_{1 j} δ_{2 r r})}{2} \\ + \frac{(δ_{1 j j} α_{2 r})}{2} + \frac{(δ_{1 j j} δ_{2 r r})}{4}] \\ = \frac{1}{4} \sum_{r = 1}^{m_{2} - 1} (G_{j j r r} - G_{m_{1} m_{1} r r}) \\ + \frac{(3 - m_{2})}{4} (G_{j j m_{2} m_{2}} - G_{m_{1} m_{1} m_{2} m_{2}}) \\ a_{2 r} = α_{2 r} + \frac{δ_{2 r r}}{2} + \sum_{j = 1}^{m_{1} - 1} [(α_{1 j} α_{2 r}) + \frac{(α_{1 j} δ_{2 r r})}{2} \\ + \frac{(δ_{1 j j} α_{2 r})}{2} + \frac{(δ_{1 j j} δ_{2 r r})}{4}] \\ = \frac{1}{4} \sum_{j = 1}^{m_{1} - 1} (G_{j j r r} - G_{j j m_{2} m_{2}}) \\ + \frac{(3 - m_{1})}{4} (G_{m_{1} m_{1} r r} - G_{m_{1} m_{1} m_{2} m_{2}}) \\ d_{1 j j} = - \frac{δ_{1 j j}}{2} - \sum_{r = 1}^{m_{2} - 1} [\frac{(δ_{1 j j} α_{2 r})}{2} + \frac{(δ_{1 j j} δ_{2 r r})}{4}] \\ = - \frac{1}{4} \sum_{r = 1}^{m_{2} - 1} (G_{j j r r} - 2 G_{j m_{1} r r} + G_{m_{1} m_{1} r r}) \\ - \frac{(3 - m_{2})}{4} (G_{j j m_{2} m_{2}} - 2 G_{j m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} m_{2} m_{2}}) \\ d_{2 r r} = - \frac{δ_{2 r r}}{2} - \sum_{j = 1}^{m_{1} - 1} [\frac{(α_{1 j} δ_{2 r r})}{2} + \frac{(δ_{1 j j} δ_{2 r r})}{4}] \\ = - \frac{1}{4} \sum_{j = 1}^{m_{1} - 1} (G_{j j r r} - 2 G_{j j r m_{1}} + G_{j j m_{2} m_{2}}) \\ - \frac{(3 - m_{1})}{4} (G_{m_{1} m_{1} r r} - 2 G_{m_{1} m_{1} r m_{2}} + G_{m_{1} m_{1} m_{2} m_{2}}) \\ d_{1 j k} = δ_{1 j k} + \sum_{r = 1}^{m_{2} - 1} [(δ_{1 j k} α_{2 r}) + \frac{(δ_{1 j k} δ_{2 r r})}{2}] \\ = \frac{1}{2} \sum_{r = 1}^{m_{2} - 1} (G_{j k r r} - G_{j m_{1} r r} - G_{k m_{1} r r} + G_{m_{1} m_{1} r r}) \\ + \frac{(3 - m_{2})}{4} (G_{j k m_{2} m_{2}} - G_{j m_{1} m_{2} m_{2}} - G_{k m_{1} m_{2} m_{2}} \\ + G_{m_{1} m_{1} m_{2} m_{2}}), j < k \\ d_{2 r s} = δ_{2 r s} + \sum_{j = 1}^{m_{1} - 1} [(α_{1 j} δ_{2 r s}) + \frac{(δ_{1 j j} δ_{2 r r})}{2}] \\ = \frac{1}{2} \sum_{j = 1}^{m_{1} - 1} (G_{j j r s} - G_{j j r m_{2}} - G_{j j s m_{2}} + G_{j j m_{2} m_{2}}) \\ + \frac{(3 - m_{1})}{4} (G_{m_{1} m_{1} r r} - G_{m_{1} m_{1} r m_{2}} - G_{m_{1} m_{1} s m_{2}} \\ + G_{m_{1} m_{1} m_{2} m_{2}}), r < s \\ (a_{1 j} a_{2 r}) = (α_{1 j} α_{2 r}) + \frac{(α_{1 j} δ_{2 r r})}{2} + \frac{(δ_{1 j j} α_{2 r})}{2} + \frac{(δ_{1 j j} δ_{2 r r})}{4} \\ = \frac{1}{4} (G_{j j r r} - G_{m_{1} m_{1} r r} - G_{j j m_{2} m_{2}} + G_{m_{1} m_{1} m_{2} m_{2}}) \\ (a_{1 j} d_{2 r r}) = - \frac{(α_{1 j} δ_{2 r r})}{2} - \frac{(δ_{1 j j} δ_{2 r r})}{4} \\ \begin{array}{l} = - \frac{1}{4} (G_{j j r r} - 2 G_{j j r m_{2}} + G_{j j m_{2} m_{2}}) \\ + \frac{1}{4} (G_{m_{1} m_{1} r r} - 2 G_{m_{1} m_{1} r m_{2}} + G_{m_{1} m_{1} m_{2} m_{2}}) \\ (d_{1 j j} a_{2 r}) = - \frac{(δ_{1 j j} α_{2 r})}{2} - \frac{(δ_{1 j j} δ_{2 r r})}{4} \\ = - \frac{1}{4} (G_{j j r r} - 2 G_{j m_{1} r r} + G_{m_{1} m_{1} r r}) \\ + \frac{1}{4} (G_{j j m_{2} m_{2}} - 2 G_{j m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} m_{2} m_{2}}) \\ (a_{1 j} d_{2 r s}) = (α_{1 j} δ_{2 r s}) + \frac{(δ_{1 j j} δ_{2 r s})}{2} \\ = \frac{1}{2} (G_{j j r s} - G_{j j r m_{2}} - G_{j j s m_{2}} + G_{j j m_{2} m_{2}}) \\ - \frac{1}{2} (G_{m_{1} m_{1} r s} - G_{m_{1} m_{1} r m_{2}} - G_{m_{1} m_{1} s m_{2}} \\ + G_{m_{1} m_{1} m_{2} m_{2}}), r < s \\ (d_{1 j k} a_{2 r}) = (δ_{1 j k} α_{2 r}) + \frac{(δ_{1 j j} δ_{2 r r})}{2} \\ = \frac{1}{2} (G_{j k r r} - G_{j m_{1} r r} - G_{k m_{1} r r} + G_{m_{1} m_{1} r r}) \\ - \frac{1}{2} (G_{j k m_{2} m_{2}} - G_{j m_{1} m_{2} m_{2}} - G_{k m_{1} m_{2} m_{2}} \\ + G_{m_{1} m_{1} m_{2} m_{2}}), j < k \\ (d_{1 j j} d_{2 r s}) = - \frac{(δ_{1 j j} δ_{2 r s})}{2}, r < s \\ (d_{1 j k} d_{2 r r}) = - \frac{(δ_{1 j k} δ_{2 r r})}{2}, j < k \\ (d_{1 j j} d_{2 r r}) = \frac{(δ_{1 j j} δ_{2 r r})}{4} \\ (d_{1 j k} d_{2 r s}) = (δ_{1 j k} δ_{2 r s}), j < k, r < s \end{array} \end{array}

for j, k = 1, ..., m₁ - 1 and r, s = 1, ..., m₂ - 1, where the relationships between the parameters of model (14) and model (13) are built based on the equivalency between the two models. The relationships between the parameters of model (14) and the expected genotypic values can then be derived by replacing the parameters of model (13) with the expected genotypic values from the previous established results in (C.1).

(C.3) Relationships between parameters of the fully parameterized two-locus model (15) and the expected genotypic values are

\{\begin{matrix} π_{0} = μ = G_{m_{1} m_{1} m_{2} m_{2}} \\ π_{1 j} = α_{1 j} = G_{j m_{1} m_{2} m_{2}} - G_{m_{1} m_{1} m_{2} m_{2}} \\ π_{2 r} = α_{2 r} = G_{m_{1} m_{1} r m_{2}} - G_{m_{1} m_{1} m_{2} m_{2}} \\ η_{1 j j} = 2 α_{1 j} + δ_{1 j j} = G_{j j m_{2} m_{2}} - G_{m_{1} m_{1} m_{2} m_{2}} \\ η_{2 r r} = 2 α_{2 r} + δ_{2 r r} = G_{m_{1} m_{1} r r} - G_{m_{1} m_{1} m_{2} m_{2}} \\ η_{1 j k} = δ_{1 j k}, j < k; η_{2 r s} = δ_{2 r s}, r < s \\ (π_{1 j} π_{2 r}) = (α_{1 j} α_{2 r}) = G_{j m_{1} r m_{2}} \\ - (G_{j m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} r m_{2}}) + G_{m_{1} m_{1} m_{2} m_{2}} \\ (π_{1 j} η_{2 r r}) = 2 (α_{1 j} α_{2 r}) + (α_{1 j} δ_{2 r r}) = G_{j m_{1} r r} \\ - (G_{j m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} r r}) + G_{m_{1} m_{1} m_{2} m_{2}} \\ (π_{1 j} η_{2 r s}) = (α_{1 j} δ_{2 r s}), r < s \\ (η_{1 j j} π_{2 r}) = 2 (α_{1 j} α_{2 r}) + (δ_{1 j j} α_{2 r}) = G_{j j r m_{2}} \\ - (G_{j j m_{2} m_{2}} + G_{m_{1} m_{1} r m_{2}}) + G_{m_{1} m_{1} m_{2} m_{2}} \\ (η_{1 j k} π_{2 r}) = (δ_{1 j k} α_{2 r}), j < k \\ (η_{1 j j} η_{2 r r}) = 4 (α_{1 j} α_{2 r}) + 2 (α_{1 j} δ_{2 r r}) \\ + 2 (δ_{1 j j} α_{2 r}) + (δ_{1 j j} δ_{2 r r}) \\ = G_{j j r r} - (G_{j j m_{2} m_{2}} + G_{m_{1} m_{1} r r}) \\ + G_{m_{1} m_{1} m_{2} m_{2}} \\ (η_{1 j j} η_{2 r s}) = 2 (α_{1 j} δ_{2 r s}) + (δ_{1 j j} δ_{2 r s}) \\ = G_{j j r s} - (G_{m_{1} m_{1} r s} + G_{j j r m_{2}} + G_{j j s m_{2}}) \\ + (G_{m_{1} m_{1} r m_{2}} + G_{m_{1} m_{1} s m_{2}} + G_{j j m_{2} m_{2}}) \\ - G_{m_{1} m_{1} m_{2} m_{2}}, r < s \\ (η_{1 j k} η_{2 r r}) = 2 (δ_{1 j k} α_{2 r}) + (δ_{1 j k} δ_{2 r r}) \\ = G_{j k r r} - (G_{j k m_{2} m_{2}} + G_{j m_{1} r r} + G_{k m_{1} r r}) \\ + (G_{j m_{1} m_{2} m_{2}} + G_{k m_{1} m_{2} m_{2}} + G_{m_{1} m_{1} r r}) \\ - G_{m_{1} m_{1} m_{2} m_{2}}, j < k \\ (η_{1 j k} η_{2 r s}) = (δ_{1 j k} δ_{2 r s}), j < k, r < s \end{matrix}

for j, k = 1, ..., m₁ - 1 and r, s = 1, ..., m₂ - 1, where the relationships between the parameters of model (15) and model (13) are built based on the equivalency between the two models. The relationships between the parameters of model (15) and the expected genotypic values are then derived by replacing the parameters of model (13) with the expected genotypic values from the previous established results in (C.1).

References

Fisher RA: The correlation between relatives on the supposition of Mendelian inheritance. Trans Roy Soc Edinburgh. 1918, 52: 399-433.
Article Google Scholar
Cockerham CC: An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics. 1954, 39: 859-882.
PubMed Central CAS PubMed Google Scholar
Cockerham CC: Estimation of genetic variances. Statistical Genetics and Plant Breeding Natl Acad Sci Natl Res. Edited by: Henson WD, Robinson HF. 1963, Council publ No. 982. Washington, D.C., 53-94.
Google Scholar
Weir BS, Cockerham C: Two-locus theory in quantitative genetics. Proceedings of the international conference on quantitative genetics. Edited by: Pollack EBT Kempthorne O. 1977, Iowa State University Press, 247-269.
Google Scholar
Kempthorne O: An introduction to Genetic Statistics. 1969, New Haven: Iowa State University Press, Ames
Google Scholar
Wang T, Zeng ZB: Models and partition of variance for quantitative trait loci with epistasis and linkage disequilibrium. BMC Genetics. 2006, 7: Article 9-
Article Google Scholar
Hansen TF, Wagner GP: Modeling genetic architecture: a multilinear theory of gene interaction. Theor Popul Biol. 2001, 59: 61-86. 10.1006/tpbi.2000.1508.
Article CAS PubMed Google Scholar
Alvarez-Castro JM, Carlborg O: A unified model for functional and statistical epistasis and its application in quantitative trait Loci analysis. Genetics. 2007, 176 (2): 1151-1167.
Article PubMed Central PubMed Google Scholar
Zeng ZB, Wang T, Zou W: Modeling quantitative trait Loci and interpretation of models. Genetics. 2005, 169 (3): 1711-1725.
Article PubMed Central CAS PubMed Google Scholar
Wang T, Zeng ZB: Contribution of genetic effects to genetic variance components with epistasis and linkage disequilibrium. BMC Genetics. 2009, 10: Article 52-
Article PubMed Google Scholar
Yang RC, Alvarez-Castro JM: Functional and statistical genetic effects with multiple alleles. Current Topics in Genetics. 2008, 3: 49-62.
Google Scholar
Lynch M, Walsh B: Genetics and Analysis of Quantitative Traits. 1998, Sunderland, MA: Sinauer
Google Scholar
Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered. 2002, 53 (2): 79-91. 10.1159/000057986.
Article PubMed Google Scholar
Searle SR: Linear Models. 1971, John Wiley & Sons Inc., New York, NY
Google Scholar
Ravishanker N, Dey DK: A First Course in Linear Model Theory. 2002, Chapman & Hall, CRC, Boca Raton, Florida
Google Scholar
Van Der Veen JH: Tests of non-allelic interaction and linkage for quantitative characters in generations derived from two diploid pure lines. Genetica. 1959, 30: 201-232. 10.1007/BF01535675.
Article CAS PubMed Google Scholar
Mather K, Jinks JL: Biometrical Genetics. 1982, Landon: Chapman and Hall, 3
Book Google Scholar
Falconer DS, Mackay TFC: Introduction to Quantitative Genetics. 1996, Harlow, UK: Longman, fourth
Google Scholar
Hayman BI, Mather KM: The description of genetic interactions in continuous variation. Biometrics. 1955, 11: 69-82. 10.2307/3001481.
Article Google Scholar
Liu T, Johnson JA, Casella G, Wu R: Sequencing complex diseases with HapMap. Genetics. 2004, 168: 503-511. 10.1534/genetics.104.029603.
Article PubMed Central CAS PubMed Google Scholar
Searle SR, Casella G, McCulloch CE: Variance Components. 1992, John Wiley & Sons, NIC, Hoboken, NJ
Book Google Scholar
Stokes ME, Davis CS, Koch GG: Categorical Data Analysis using the SAS System. 2001, SAS Institute Inc., Cary, NC, 2
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Biostatistics, Institute for Health and Society, Medical College of Wisconsin, Milwaukee, WI, 53226, USA
Tao Wang

Authors

Tao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Wang.

Additional information

Authors' contributions

TW planned the study, conducted the derivation and wrote the manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wang, T. On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits. BMC Genet 12, 82 (2011). https://doi.org/10.1186/1471-2156-12-82

Download citation

Received: 31 May 2011
Accepted: 21 September 2011
Published: 21 September 2011
DOI: https://doi.org/10.1186/1471-2156-12-82

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Abstract

Background

Results

Conclusions

Background

Results

Fully parameterized one-locus models

Reduced one-locus models

Extension to two-locus models

Simulation Examples

Discussion

Conclusions

Appendices

A. Estimability of parameters in model (8)

B. Estimability of parameters in model (9)

C. Relationships for fully parameterized two-locus models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Genomic Data

Contact us

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Abstract

Background

Results

Conclusions

Background

Results

Fully parameterized one-locus models

Reduced one-locus models

Extension to two-locus models

Simulation Examples

Discussion

Conclusions

Appendices

A. Estimability of parameters in model (8)

B. Estimability of parameters in model (9)

C. Relationships for fully parameterized two-locus models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us