Including non-additive genetic effects in Bayesian methods for the prediction of genetic values based on genome-wide markers
- Dörte Wittenburg^{1}Email author,
- Nina Melzer^{1} and
- Norbert Reinsch^{1}
DOI: 10.1186/1471-2156-12-74
© Wittenburg et al; licensee BioMed Central Ltd. 2011
Received: 27 April 2011
Accepted: 25 August 2011
Published: 25 August 2011
Abstract
Background
Molecular marker information is a common source to draw inferences about the relationship between genetic and phenotypic variation. Genetic effects are often modelled as additively acting marker allele effects. The true mode of biological action can, of course, be different from this plain assumption. One possibility to better understand the genetic architecture of complex traits is to include intra-locus (dominance) and inter-locus (epistasis) interaction of alleles as well as the additive genetic effects when fitting a model to a trait. Several Bayesian MCMC approaches exist for the genome-wide estimation of genetic effects with high accuracy of genetic value prediction. Including pairwise interaction for thousands of loci would probably go beyond the scope of such a sampling algorithm because then millions of effects are to be estimated simultaneously leading to months of computation time. Alternative solving strategies are required when epistasis is studied.
Methods
We extended a fast Bayesian method (fBayesB), which was previously proposed for a purely additive model, to include non-additive effects. The fBayesB approach was used to estimate genetic effects on the basis of simulated datasets. Different scenarios were simulated to study the loss of accuracy of prediction, if epistatic effects were not simulated but modelled and vice versa.
Results
If 23 QTL were simulated to cause additive and dominance effects, both fBayesB and a conventional MCMC sampler BayesB yielded similar results in terms of accuracy of genetic value prediction and bias of variance component estimation based on a model including additive and dominance effects. Applying fBayesB to data with epistasis, accuracy could be improved by 5% when all pairwise interactions were modelled as well. The accuracy decreased more than 20% if genetic variation was spread over 230 QTL. In this scenario, accuracy based on modelling only additive and dominance effects was generally superior to that of the complex model including epistatic effects.
Conclusions
This simulation study showed that the fBayesB approach is convenient for genetic value prediction. Jointly estimating additive and non-additive effects (especially dominance) has reasonable impact on the accuracy of prediction and the proportion of genetic variation assigned to the additive genetic source.
1 Background
Molecular marker information is commonly used to draw inferences about the relationship between genetic and phenotypic variation in various species, e.g. humans [1], dairy cattle [2] or mice [3]. Assuming linkage disequilibrium (LD) between quantitative trait loci (QTL) and markers, genetic effects can be estimated and explained as QTL effects captured by the neighbouring markers. If breeding values are the focal point, genetic effects are typically modelled as additively acting marker allele effects (e.g. [4, 5]). The mode of biological action can, of course, be different from the assumption of pure additivity. One possibility to better understand the genetic architecture of complex traits is to include intra-locus (dominance) and inter-locus (epistasis) interaction of alleles when fitting a model to a trait. The importance of non-additive effects for genetic variation has recently been investigated. Knowledge about non-additive effects is essential to benefit, for example, from heterosis effects [6], especially for cross-breeding schemes (poultry, plants etc.). In general, it can be expected that the prediction of the genetic value, in particular its additive part, is improved if non-additive effects are additionally modelled. For instance, Lee et al.[7] reported that the accuracy of prediction increased considerably when dominance effects were included compared to a purely additive genetic model when the phenotypes coat colour (+17% accuracy) or the percentage of CD8^{+} cells (+2% accuracy) were studied in mice. Added epistasis did not, however, contribute to the accuracy in this case. In an example with recombinant inbred lines of soybean [8], the accuracy of prediction was more than doubled under the epistatic model. Even though non-additive effects may occur on the level of gene action, most of the genetic variation might be assigned to additive effects when genes are at an extreme frequency [9]. The extent to which, for example, epistasis is involved in regulating complex traits is hardly known, but knowledge about it can be used to infer biological mechanisms and to reconstruct biological pathways [10]. In one of the first studies concerning non-additive influence on growth differences in chickens, Carlborg et al.[11] estimated that 10% of genetic variation in early growth (trait Gr18) was due to dominance and even 70% due to epistasis. This example showed the importance of interacting loci, though one may suppose an overestimation of the epistatic effects, a phenomenon already known as the Beavis effect [12] for single loci. Since this experiment was based on a cross of extremely different lines, further investigations are required to find evidence for interacting genes in purebreds.
Different approaches are available to model additive and non-additive genetic effects. Under the aspect of QTL detection, a genome scan can be carried out to uncover genetic effects using, for example, a variance component method [13, 14]. If additive and non-additive effects are to be modelled simultaneously over the whole genome, we have to be aware of "p bigger than n" problems, meaning there are more parameters than there are observations. To cope with the all-in-one situation, Xu presented a Bayesian approach [15], which parallels the idea of BayesA [4], and an empirical Bayes method [16] both enabling the genome-wide estimation of additive and non-additive marker effects. The Bayesian methods commonly used for the estimation of additive effects apply Markov chain Monte Carlo (MCMC) simulations which require a lot of computing time, but they convince in terms of accuracy in predicting genetic values. In particular, the BayesB approach [4] is superior to other methods, for instance ridge regression and partial least squares [17–19]. The MCMC sampling methods may collapse under high marker density if further non-additive effects are included. As an alternative, an approximate Bayesian approach is available which applies the analytically derived posterior density for a marker effect rather than samples thereof [20]. This approach (called fBayesB) was shown to be slightly less accurate, because in an iterative procedure only a single marker effect is studied at a time while the vector of phenotypes is corrected for all other previously estimated effects. The fBayesB strategy is much faster than the conventional Bayesian methods using MCMC. This solving approach offers the possibility to additionally account for genome-wide interacting effects and to estimate them with reasonable computational effort.
The objective of this study is to explore the impact of non-additive effects on the prediction of genetic values in a livestock population. An improved estimation of additive effects and a better prediction of genetic values is intended, when additive and non-additive effects are jointly involved in fitting a model to a trait. Since methods that aim to estimate non-additive effects in arbitrary populations are just emerging, it is especially important to validate such approaches with simulations. Therefore, with this study, we pursue methodological aspects, thereby assembling facts that help to interpret results obtained with practical data in future work. We consider additive, dominance and pairwise epistatic effects captured by biallelic markers spread over the whole genome. The details of statistical modelling are presented in the first part of the paper. We extend the fast Bayesian method (fBayesB), which was developed under pure additivity [20], to include non-additive effects. fBayesB is used to estimate the genetic effects on the basis of simulated datasets which resemble a dairy cattle population. Different scenarios are simulated to study the loss of accuracy of prediction if epistatic effects are not simulated but modelled and vice versa. In the second part, we summarise the results of analysing the simulated data. The amount of genetic variation assigned to each kind of genetic effect after genome-wide estimation of marker effects is determined. To briefly show how the approach behaves in practice, we also apply fBayesB to a real data example. In the third part, we outline some constraints of estimating non-additive effects via the fBayesB approach and discuss other solving strategies.
2 Methods
2.1 Statistical model
This model is set up in the way of an F_{∞} model [21]. Let μ be a population mean and 1 a vector of ones. The X and D are design matrices for allele substitution effects a= (a_{1}, ..., a_{ m } )' and dominance effects d= (d_{1}, ..., d_{ m } )', respectively. The entries of the design matrices are random variables which are realised depending on the observed marker genotypes (denoted as 11, 12, 22). For a homozygous genotype at locus j ∈ {1, ..., m} of animal i ∈ {1, ..., n}, X_{ i,j } = ± 1 and D_{ i,j } = 0; the positive effect is assigned to the more frequent allele. For a heterozygous genotype, X_{ i,j } = 0 and D_{ i,j } = 1.
where the design matrices X _{ a } and X _{ d } contain the corresponding entries of S _{ A,j } (j = 1, ..., m) and relate to the additive and dominance effects on the orthogonal scale, respectively.
L*(γ_{ s } , λ_{ s } ) denotes a mixture of a Laplace distribution with zero expectation and the point mass at zero. The mixing probability is γ_{ s } , then Pr(g_{ s,j } = 0) = 1 - γ_{ s } . Furthermore, $\text{Var}\left({g}_{s,j}\right)={\gamma}_{s}\frac{2}{{\lambda}_{s}^{2}}$, where λ_{ s } denotes a measure of uncertainty about the effects of the genetic variation source s. The hyper-parameters γ_{ s } and λ_{ s } are specified for each source, either additive (s = a) or dominance (s = d).
As an example, X _{ aa,j,k } = X _{ a,j } · X _{ a,k } , where the symbol · denotes the element-wise multiplication of column j and k of X _{ a } . Furthermore, X _{ ad,j,k } = X _{ a,j } · X _{ d,k } is calculated to obtain the coefficients for the effect g_{ ad,j,k } . This way, in total four times $\frac{m\left(m-1\right)}{2}$ epistatic effects are modelled. The prior remains the same as for M1, but we assume that the probability of having non-zero epistatic effects is smaller than the γ_{ s } for main effects.
2.2 Parameter estimation
where ${\mathit{y}}_{-j}^{\left(k\right)}$ denotes the vector of observed phenotypes corrected for all estimated effects except the j-th effect in iteration round k.
Set ${Y}_{j}={\left({\mathit{X}}_{s,j}^{\prime}{\mathit{X}}_{s,j}\right)}^{-1}{\mathit{X}}_{s,j}^{\prime}{\mathit{y}}_{-j}^{\left(k\right)}$ and ${\sigma}_{j}^{2}={\left({\mathit{X}}_{s,j}^{\prime}{\mathit{X}}_{s,j}\right)}^{-1}{\sigma}_{e}^{2}$. For convenience we denote ${Y}_{j}^{\pm}={Y}_{j}\pm {\lambda}_{s}{\sigma}_{j}^{2}.$
The Θ _{ U } (0; μ, σ^{2}) and Θ_{ L }(0; μ, σ^{2}) are the expected value of an upper and lower truncated normal distribution N(μ, σ^{2}), respectively, with truncation point zero. The Φ(x; μ, σ^{2}) denotes the normal distribution function evaluated at some point x and ϕ(x; μ, σ^{2}) is the normal density function.
Eventually, as a consequence of the standardisation, the genetic variance components are estimated as ${\widehat{\sigma}}_{s}^{2}=\sum _{j=1}^{{m}_{s}}{\u011d}_{s,j}^{2}$for each genetic variation source s. Note that this formula yields an approximation under LD because the covariance components Cov$\left({X}_{s,i,j}{\u011d}_{s,j},{X}_{s,{i}^{\prime},{j}^{\prime}}{\u011d}_{s,{j}^{\prime}}\right)$ of potentially linked loci j and j' are absent. Under the given re-parametrisation, the covariance Cov(X_{ s,i,j }, X_{ s,i',j' }) between genotype coefficients is not necessarily positive and the signs of the corresponding effects are not known. Therefore, it cannot be stated whether over- or underestimation of genetic variance components is expected. We briefly examine the impact of missing linkage information in our simulations.
The suitability of the statistical models M1 and M2 are compared among the different simulated scenarios in terms of accuracy, which is the empirical correlation between predicted and simulated DGV in a validation set. We implemented this fBayesB approach in Fortran F90.
When studying only the main genetic effects via M1, the results of fBayesB are compared with BayesB [4]. A Fortran implementation of BayesB of Berry & Stranden is available on http://www.genomicselection.net (obtained Sep 4, 2009). This version was extended to include dominance effects using a concatenated matrix ( X _{ a } X _{ d } ). In principle, it would be possible to additionally consider epistatic effects in BayesB, but this tool would probably require a few months to finish an adequate number of MCMC sampling rounds for a single simulated dataset.
2.3 Simulation study
Data generation
Mean (m) and variance (τ^{ 2 }) of normal distributions to simulate epistatic genetic effects
23-QTL scenario | 230-QTL scenario | |
---|---|---|
additive × additive | m = 0.2, τ^{2} = 0.3 | m = 0.02, τ^{2} = 0.03 |
additive × dominance | m = 0.2, τ^{2} = 0.3 | m = 0.02, τ^{2} = 0.03 |
dominance × additive | m = 0.2, τ^{2} = 0.2 | m = 0.02, τ^{2} = 0.02 |
dominance × dominance | m = 0.2, τ^{2} = 0.1 | m = 0.02, τ^{2} = 0.01 |
Scale of genetic effects
Note that the order of loci (either j < k or k < j) is necessary to assign the contribution of epistasis correctly to the different sources of genetic variation. Again, each main genetic effect was multiplied by the relevant standard deviation term.
Hyper-parameters and other settings
In this study, we involved prior knowledge about the proportion of non-zero effects of the genetic variation source s in the simulated dataset and chose γ_{ s } accordingly. In the 23-QTL scenario we set γ_{ a } = γ_{ d } = 0.005 and γ_{ s } = 10^{-6} for s ∈ {aa, ad, da, dd}. In the 230-QTL scenario we applied γ_{ a } = γ_{ d } = 0.05 and γ_{ s } = 10^{-6} for the epistatic effects. We will return to the issue of parameter choice in the Section Discussion.
Furthermore, to limit the number of iterations, we chose k_{max} = 1 000 and for the convergence criterion we used L = 10^{-8} for M1. Owing to the computational effort we set k_{max} = 200 and L = 10^{-6} for M2. Results are reported only for those repetitions where convergence was achieved.
In BayesB the main genetic effects were estimated simultaneously over the whole genome. A hyper-parameter π was required to give the proportion of non-zero genetic effects in total; we set π = 0.005 (π = 0.05) in the 23-QTL (230-QTL) scenario. Furthermore, we carried out 50 000 MCMC iterations (40% were neglected as burn-in) and within each iteration 1 000 rounds of the Metropolis-Hastings algorithm were employed.
Outline of data analysis
To begin with, we used every 10-th marker (m = 5 227), which included the true positions of the simulated QTL, in the statistical analysis. With this reduced genotype dataset, we evaluated differences in parameter estimation between fBayesB and BayesB based on the model with additive and dominance effects. A main issue was to study the impact of including or not including pairwise epistatic effects on the accuracy of genetic value prediction with fBayesB. The influence of a varying proportion of genetic variation on the accuracy of prediction was obtained by analysing the data produced with different broad-sense heritabilities. Further, we studied the consequences of spreading the genetic variation over a multitude of loci with almost equal amounts of variation in each source of genetic variation. In a next step, we used all SNP information (m = 52 273) without pre-selection of loci for the estimation of genetic effects and explored the applicability of fBayesB for a large genotype dataset. Finally, to study practical suitability, we estimated genetic effects in a sample of a heterogeneous stock of mice. Genotype and phenotype data are publicly available at http://gscan.well.ox.ac.uk/[28].
3 Results
On average 567 loci per dataset had MAF ≤ 0.01. These loci were omitted, but loci deviating from HWE (on average one locus per dataset) were not excluded from the analysis. The average LD between adjacent SNPs was r^{2} = 0.07 in the reduced genotype dataset with 5 227 SNPs.
Average estimated variance components (standard deviation in brackets) and average accuracy ρ of genetic value prediction*
Simulation without epistasis | |||||||||
---|---|---|---|---|---|---|---|---|---|
Method | Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{aa}^{2}$ | ${\sigma}_{ad}^{2}$ | ${\sigma}_{da}^{2}$ | ${\sigma}_{dd}^{2}$ | ${\sigma}_{e}^{2}$ | ρ |
BayesB | M1 | 0.743 | 0.035 | - | - | - | - | 0.775 | 0.980 |
(0.578) | (0.039) | (0.605) | |||||||
fBayesB | M1 | 0.742 | 0.035 | - | - | - | - | 0.752 | 0.978 |
(0.579) | (0.039) | (0.587) | |||||||
fBayesB | M2 | 0.748 | 0.039 | 0.008 | 0.007 | 0.007 | 0.008 | 0.638 | 0.959 |
(0.583) | (0.041) | (0.013) | (0.016) | (0.014) | (0.017) | (0.484) | |||
Simulated components | 0.757 | 0.040 | - | - | - | - | 0.798 | - | |
Simulation with epistasis | |||||||||
Method | Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{aa}^{2}$ | ${\sigma}_{ad}^{2}$ | ${\sigma}_{da}^{2}$ | ${\sigma}_{dd}^{2}$ | ${\sigma}_{e}^{2}$ | ρ |
BayesB | M1 | 1.313 | 0.158 | - | - | - | - | 2.721 | 0.785 |
(0.681) | (0.131) | (0.874) | |||||||
fBayesB | M1 | 1.310 | 0.161 | - | - | - | - | 2.619 | 0.781 |
(0.687) | (0.132) | (0.845) | |||||||
fBayesB | M2 | 1.338 | 0.193 | 0.299 | 0.138 | 0.065 | 0.057 | 1.811 | 0.833 |
(0.688) | (0.142) | (0.215) | (0.111) | (0.071) | (0.070) | (0.598) | |||
Simulated components | 1.409 | 0.217 | 0.346 | 0.133 | 0.089 | 0.020 | 2.214 | - |
Average ratio of additive genetic variance to total genetic variance*
Simulation without epistasis | ||
---|---|---|
Model | 23-QTL scenario | 230-QTL scenario |
M1 | 0.953 | 0.810 |
M2 | 0.918 | 0.581 |
Simulated ratio | 0.948 | 0.945 |
Simulation with epistasis | ||
Model | 23-QTL scenario | 230-QTL scenario |
M1 | 0.884 | 0.773 |
M2 | 0.626 | 0.401 |
Simulated ratio | 0.613 | 0.648 |
Average accuracy of genetic value prediction depending on broad-sense heritability H^{2}*
Simulation without epistasis | ||||
---|---|---|---|---|
Model | H^{2} = 0.5 | H^{2} = 0.3 | H^{2} = 0.1 | H^{2} = 0.5 best 10% |
M0 | 0.958 | 0.940 | 0.859 | 0.786 |
M1 | 0.978 | 0.953 | 0.844 | 0.774 |
M2 | 0.959 | 0.897 | 0.640 | 0.748 |
Simulation with epistasis | ||||
Model | H^{2} = 0.5 | H^{2} = 0.3 | H^{2} = 0.1 | H^{2} = 0.5 best 10% |
M0 | 0.741 | 0.707 | 0.581 | 0.618 |
M1 | 0.781 | 0.736 | 0.582 | 0.621 |
M2 | 0.833 | 0.718 | 0.339 | 0.598 |
In order to prove that we benefit from additionally modelling non-additive genetic effects if those were simulated, we compared the accuracy of genetic value prediction based on M1 with accuracy obtained from a conventional model including only additive genetic effects, called M0. Except for constellations with H^{2} = 0.1, accuracy of M1 was 1-2% (3-4%) higher in simulations without (with) epistasis than accuracy of M0, see Table 4. If we looked at the 10% animals with best predicted additive genetic value (i.e. the breeding value) in simulations with epistasis and H^{2} = 0.5, the accuracy of additive genetic value prediction was 0.618 with M0, 0.621 with M1 and 0.598 with the correct model M2. If we look at the 10% best animals when epistasis was not simulated, the accuracy of additive genetic value prediction was 0.786 based on M0, 0.774 with the correct model M1 and 0.748 with M2. Thus, model choice had an impact on predicting the total genetic values, but if only the extreme breeding values were of interest, e.g. for selection purposes, prediction with a conventional model (M0) was more precise than with the corresponding true model.
Average estimated variance components (standard deviation in brackets) and average accuracy ρ of genetic value prediction*
Simulation without epistasis | |||||||||
---|---|---|---|---|---|---|---|---|---|
Method | Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{aa}^{2}$ | ${\sigma}_{ad}^{2}$ | ${\sigma}_{da}^{2}$ | ${\sigma}_{dd}^{2}$ | ${\sigma}_{e}^{2}$ | ρ |
BayesB | M1 | 0.631 | 0.056 | - | - | - | - | 0.652 | 0.860 |
(0.204) | (0.035) | (0.180) | |||||||
fBayesB | M1 | 0.699 | 0.165 | - | - | - | - | 0.413 | 0.760 |
(0.207) | (0.065) | (0.132) | |||||||
fBayesB | M2 | 0.732 | 0.304 | 0.036 | 0.065 | 0.068 | 0.074 | 0.170 | 0.608 |
(0.214) | (0.112) | (0.028) | (0.036) | (0.042) | (0.046) | (0.066) | |||
Simulated components | 0.709 | 0.043 | - | - | - | - | 0.754 | - | |
Simulation with epistasis | |||||||||
Method | Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{aa}^{2}$ | ${\sigma}_{ad}^{2}$ | ${\sigma}_{da}^{2}$ | ${\sigma}_{dd}^{2}$ | ${\sigma}_{e}^{2}$ | ρ |
BayesB | M1 | 0.949 | 0.215 | - | - | - | - | 1.968 | 0.585 |
(0.250) | (0.067) | (0.266) | |||||||
fBayesB | M1 | 0.920 | 0.267 | - | - | - | - | 1.567 | 0.543 |
(0.197) | (0.080) | (0.282) | |||||||
fBayesB | M2 | 1.277 | 0.910 | 0.171 | 0.275 | 0.296 | 0.305 | 0.493 | 0.340 |
(0.230) | (0.269) | (0.086) | (0.106) | (0.127) | (0.126) | (0.257) | |||
Simulated components | 1.284 | 0.192 | 0.308 | 0.103 | 0.071 | 0.014 | 1.952 | - |
Comparison of empirical variances of predicted genetic values and genetic variance components estimated under LE*
Simulation without epistasis | |||||||
---|---|---|---|---|---|---|---|
23-QTL scenario | 230-QTL scenario | ||||||
Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{\mathsf{\text{epi}}}^{2}$ | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{\mathsf{\text{epi}}}^{2}$ | |
M1 | empirical | 0.743 | 0.035 | - | 0.711 | 0.163 | - |
under LE | 0.742 | 0.035 | - | 0.699 | 0.165 | - | |
M2 | empirical | 0.749 | 0.038 | 0.030 | 0.805 | 0.338 | 0.278 |
under LE | 0.748 | 0.039 | 0.030 | 0.732 | 0.304 | 0.243 | |
Simulation with epistasis | |||||||
23-QTL scenario | 230-QTL scenario | ||||||
Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{\mathsf{\text{epi}}}^{2}$ | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{\mathsf{\text{epi}}}^{2}$ | |
M1 | empirical | 1.309 | 0.161 | - | 0.981 | 0.266 | - |
under LE | 1.310 | 0.161 | - | 0.920 | 0.267 | - | |
M2 | empirical | 1.332 | 0.192 | 0.554 | 1.442 | 1.112 | 1.277 |
under LE | 1.338 | 0.193 | 0.559 | 1.277 | 0.910 | 1.047 |
Next we used the genome-wide SNP information in the statistical analysis (m = 52 273). An average of 5 685 loci per dataset were omitted because MAF ≤ 0.01. An average of nine loci deviated from HWE, but these loci were retained. We set γ_{ a } = γ_{ d } = 0.005 for both QTL scenarios. If only main genetic effects were simulated and modelled in the 23-QTL scenario with H^{2} = 0.5, the additive genetic variance component was obtained as ${\widehat{\sigma}}_{a}^{2}=0.796\left(se=0.588\right)$, whereas the dominance variance component was extremely overestimated as ${\widehat{\sigma}}_{d}^{2}=\phantom{\rule{1em}{0ex}}0.\mathsf{\text{493}}$ (se = 0.354). The accuracy ρ = 0.723 was still reasonably high. In the 230-QTL scenario, the accuracy of prediction reduced to ρ = 0.513 and ${\widehat{\sigma}}_{a}^{2}=\phantom{\rule{1em}{0ex}}0.\mathsf{\text{73}}0$ (se = 0.251), but ${\widehat{\sigma}}_{d}^{2}=\phantom{\rule{1em}{0ex}}0.\mathsf{\text{729}}$ (se = 0.308) was not estimated as well as with the reduced SNP set on the basis of M1. Including the pairwise epistatic effects via M2 exceeded practicability. On the basis of 5 227 SNP, more than 13 million effects had to be estimated for each source of epistatic variation and fBayesB required an average of six hours to converge. If 52 273 SNP markers are included, then approximately 1.3 billion effects have to be estimated for each of the four sources of epistasis. Though most markers or pairs of markers have no effect, their estimated genetic effects will be small but not exactly zero. It was not feasible to estimate about 5 billion effects via M2 under proper numerical precision owing to the restricted capacity of memory space. Furthermore, it is questionable how much computing time is required to execute several rounds of iteration. Thus, with 52 273 SNP markers, only M1 was applied to the simulated data with epistasis. This led to a reduced accuracy of ρ = 0.611 (ρ = 0.380) in the 23-QTL scenario (230-QTL scenario).
Estimated variance components for the real data example*
Model | ${\sigma}_{a}^{2}$ | ${\sigma}_{d}^{2}$ | ${\sigma}_{aa}^{2}$ | ${\sigma}_{ad}^{2}$ | ${\sigma}_{da}^{2}$ | ${\sigma}_{dd}^{2}$ | ${\sigma}_{e}^{2}$ |
---|---|---|---|---|---|---|---|
M0 | 0.169 | - | - | - | - | - | 0.405 |
M1 | 0.171 | 0.030 | - | - | - | - | 0.378 |
M2 | 0.174 | 0.046 | 0.000 | 0.000 | 0.026 | 0.000 | 0.303 |
4 Discussion
4.1 Hyper-parameters and convergence
When we investigated the influence of a varying proportion of genetic to phenotypic variance on genetic value prediction in the 23-QTL scenario, it was observed that fBayesB did not fulfil the convergence criterion in all situations. In the extreme case with M2 and H^{2} = 0.1, only 40% of all repetitions converged to a proper final solution and it happened that fBayesB simply aborted. (Usually, the algorithm converged after 13-16 iterations with M1 and after 26-28 steps with M2, but up to a 5-fold of iteration steps were necessary in the 230-QTL scenario.) In order to avoid termination, one could tune the "free" hyper-parameter λ_{ s } , which is responsible for the variation of a genetic effect a priori. For convenience, we assumed that the total prior variance was equal to one for each source of genetic variation s ∈ {a, d, aa, ad, da, dd}, see Equation (4). This prior guess depends on the hyper-parameter γ_{ s } which was equal among s ∈ {a, d} and s ∈ {aa, ad, da, dd}. Thus, it seems necessary to specifically adjust λ_{ s } and/or γ _{ s } to each source of genetic variation.
4.2 Proportion of non-zero effects
A preliminary study could show that the choice of the hyper-parameter γ_{ s } strongly influenced the accuracy of genetic value prediction and the ability of the fBayesB algorithm to converge (Melzer, Wittenburg, Repsilber: Simulating a more realistic genotype-phenotype map for development of methods to predict phenotypes based on genome-wide marker data - the livestock perspective, submitted). Since the aim of this paper was to investigate the suitability of fBayesB to cope with non-additive effects in general, we simply involved prior knowledge about the proportion of non-zero genetic effects when the hyper-parameter γ_{ s } had to be specified for each genetic variation source s ∈ {a, d, aa, ad, da, dd}. There are several possibilities which allow for a flexible setting of this hyper-parameter. In an intuitive manner, one could determine γ_{ s } via cross-validation as it was done in Melzer et al. Another encouraging approach was presented by Shepherd et al.[31], therein called emBayesB. It is a BayesB-like estimation of SNP effects without the time consuming Metropolis-Hastings algorithm, but with an EM algorithm for the estimation of γ_{ s } . It employs a binary variable indicating whether a marker is in LD with a QTL (i.e., it is a non-zero effect). So far, this approach was verified for additive genetic effects via simulations, but it is certainly applicable to the non-additive case as well. Not only the specification of the proportion of non-zero effects in the prior setting, however, is important. This study additionally showed that the more loci were responsible for genetic variation, the worse the genetic parameters were estimated, even though we accounted for this proportion in γ_{ s } . With higher proportions of small to intermediate genetic effects, the bias of estimation seriously accumulates via fBayesB. One way out is to reduce noise by eliminating zero effects. This objective is discussed in the next section.
4.3 Reduction of model dimensionality
SNP density continues to increase; soon whole-genome sequences will be used for statistical analysis [32]. The ability to uncover genetic effects with Bayesian MCMC methods worsens with increasing LD due to redundancy between markers [33]. Thus, in order to deal with the huge amounts of data, it becomes important to select relevant information. Selection is conceivable in two general ways.
In order to keep as many parameters as required in the statistical model, one could apply a filtering procedure. The significance of putative non-zero effects might be determined, for example, via a stochastic variable selection approach (SVS). In the field of genomic selection, which is based only on additive effects, an SVS implementation of Meuwissen and Goddard [34] was applied by Calus et al.[35] to simulated data as well as by Verbyla et al.[36] to dairy cattle data. In case of additional non-additive effects, SVS was developed and already successfully applied to obesity data in a mouse backcross population [37]. In that work, an upper bound of model dimensionality had to be fixed and indicator variables were involved specifying which main and epistatic effect had to be included in the model. The Bayes factor then gave evidence of putative QTL.
Dimensionality can also be reduced non-parametrically. As an example, a subset of SNPs may be selected via filtering based on entropy information and wrapping using a naive Bayesian classifier [38]. Alternatively, an informative set of SNPs can be identified on the basis of LD between loci, called tagSNP [39]. This strategy would probably also reduce the bias in variance component estimation due to LD, because only one marker represents a certain chromosome segment. A haplotyping strategy based on LD information was applied to SNP data in Australian beef cattle [40] but with limited success. The authors reported that about 30 000 SNP markers (and a large number of phenotypic records) are required for accurate breeding value prediction. Thus, we have to work with some contradiction: more markers for higher accuracy but less markers (or only the best markers) to reduce estimation errors. The best solution is probably obtained, when the models used are better able to distinguish between markers with and without effect. Meuwissen [41] presented other options to reduce a set of SNPs based on LD between loci or relatedness between individuals.
4.4 Non-additive effects
This study has shown that the inclusion of dominance effects in genetic value prediction improved accuracy compared to purely additive models (Table 4). We found that the incorporation of dominance effects was less challenging than the inclusion of epistasis, and we have made a robust step towards advancing insight into the genetic architecture. Regardless of whether dominance or epistatic effects are considered, adequate data are required to estimate non-additive effects. This is also true for periodic re-estimation of genetic effects. In contrast to genomic selection, where additive effects may be obtained from average yields of progeny of genotyped parents, genotyped individuals need to have an own phenotype (e.g. cows).
In general, and also confirmed in our investigations, parametric methods have difficulties to identify and to estimate epistatic effects. One reason is that the orthogonal decomposition of genetic effects only lead to proper results under idealised conditions (LE, absence of mutation and selection etc.) which are violated in practice [42]. As reviewed and discussed by Calus [5], non-parametric methods (e.g. [43]) have the potential to outperform parametric approaches if non-additive effects are included. With an application to broiler data [44], it was shown that kernel methods had a better predictive ability than parametric methods when genome-wide markers were used. For thousands of SNPs and millions of interactions, fBayesB is still computationally feasible but it shows an inherent bias of variance component estimation. Alternatively, machine learning techniques may discover hidden patterns of gene interaction without assuming their structure [45].
Once gene interactions are discovered, they may be used for mate allocation in livestock breeding, where individuals are mated to achieve favourable non-additive gene combinations to further increase genetic gain [46]. Apart from breeding applications, improved statistical modelling [41] and our cognitive interest in the formation of complex phenotypes will benefit from knowledge about the distribution of non-additive effects over the genome and their size.
4.5 Number of simulated QTL
An increase in the number of QTL was accompanied by a reduction in the quality of fBayesB for genetic value prediction. fBayesB was able to identify only the biggest QTL effects in the simulated scenarios, in which (nearly) the same amount of genetic variation was spread over 23 or 230 QTL. Thus, effect size in the 230-QTL scenario was roughly one-tenth of that in the 23-QTL case. This complicated the identification of genetic effects in general and, in particular, of non-additive effects, which contributed very little to the genetic variance when compared with additive effects. Many tiny effects were estimated with BayesB, even if genetic variation was caused by few QTL with large effects. In both QTL scenarios, accuracy of genetic value prediction was at a high level with BayesB. It may be more realistic to assume that most livestock traits are influenced by many loci and therefore best results can be expected with BayesB.
5 Conclusion
This simulation study showed that the fast Bayesian method (fBayesB) is convenient for genetic value prediction. It requires only a fraction of computing time compared to a conventional MCMC approach BayesB and also enables estimating pairwise interactions.
The number of simulated QTL, the proportion of genetic to phenotypic variance as well as the quantity of SNP in statistical analyses influenced accuracy of genetic value prediction and bias of variance component estimation. Both methods obtained similar results when few QTL with additive and dominance effects were simulated; the maximum accuracy was 98%. As expected, best results were obtained on the basis of the true model corresponding to the simulated scenario, but the loss of accuracy due to using the incorrect model was limited to 2-5%. If many QTL were responsible for genetic variation, accuracy decreased about 22-49% with fBayesB compared to the few QTL scenario, depending on the model. Accuracy based on modelling only additive and dominance effects was generally superior to the complex model, no matter if epistasis was simulated or not, and an additional gain of 4-10% accuracy was observed with BayesB. To sum up, existing approaches for genome-wide estimation of additive genetic effects can easily and robustly be extended by dominance effects to improve accuracy of genetic value prediction and to get further insight into the genetic architecture. In this simulation study, the inclusion of dominance was more important than involving all pairwise interactions, which did not improve prediction in general.
Declarations
Acknowledgements
This study is part of the FUGATO project "Bovine Integrative Bioinformatics for Genomic Selection (BovIBI)" with financial support of the German Federal Ministry of Education and Research (BMBF).
Authors’ Affiliations
References
- Visscher PM, Macgregor S, Benyamin B, Zhu G, Gordon S, Medland S, Hill WG, Hottenga JJ, Willemsen G, Boomsma DI, Liu YZ, Deng HW, Montgomery GW, Martin NG: Genome partitioning of genetic variation for height from 11,214 sibling pairs. American Journal of Human Genetics. 2007, 81 (5): 1104-1110. 10.1086/522934.PubMed CentralView ArticlePubMedGoogle Scholar
- Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME: Invited review: Genomic selection in dairy cattle: progress and challenges. Journal of Dairy Science. 2009, 92 (2): 433-443. 10.3168/jds.2008-1646.View ArticlePubMedGoogle Scholar
- Legarra A, Robert-Granié C, Manfredi E, Elsen JM: Performance of genomic selection in mice. Genetics. 2008, 180: 611-618. 10.1534/genetics.108.088575.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157 (4): 1819-1829.PubMed CentralPubMedGoogle Scholar
- Calus MPL: Genomic breeding value prediction: methods and procedures. animal. 2010, 4 (02): 157-164. 10.1017/S1751731109991352.View ArticlePubMedGoogle Scholar
- Melchinger AE, Utz HF, Piepho HP, Zeng ZB, Schön CC: The Role of Epistasis in the Manifestation of Heterosis: A Systems-Oriented Approach. Genetics. 2007, 177 (3): 1815-1825. 10.1534/genetics.107.077537. [http://www.genetics.org/content/177/3/1815.abstract]PubMed CentralView ArticlePubMedGoogle Scholar
- Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM: Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genetics. 2008, 4 (10): e1000231-10.1371/journal.pgen.1000231.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu Z, Li Y, Song X, Han Y, Cai X, Xu S, Li W: Genomic value prediction for quantitative traits under the epistatic model. BMC Genetics. 2011, 12: 15-[http://www.biomedcentral.com/1471-2156/12/15]PubMed CentralView ArticlePubMedGoogle Scholar
- Hill WG, Goddard ME, Visscher PM: Data and theory point to mainly additive genetic variance for complex traits. PLoS Genetics. 2008, 4 (2): e1000008-10.1371/journal.pgen.1000008.PubMed CentralView ArticlePubMedGoogle Scholar
- Carlborg Ö, Haley CS: Epistasis: too often neglected in complex trait studies?. Nature Reviews Genetics. 2004, 5 (8): 618-625. 10.1038/nrg1407.View ArticlePubMedGoogle Scholar
- Carlborg Ö, Kerje S, Schütz K, Jacobsson L, Jensen P, Andersson L: A global search reveals epistatic interaction between QTL for early growth in the chicken. Genome Research. 2003, 13 (3): 413-421. 10.1101/gr.528003.PubMed CentralView ArticlePubMedGoogle Scholar
- Beavis W: QTL analyses: power, precision, and accuracy. 1998, CRC PressGoogle Scholar
- Rönnegård L, Besnier F, Carlborg Ö: An Improved Method for Quantitative Trait Loci Detection and Identification of Within-Line Segregation in F_{2} Intercross Designs. Genetics. 2008, 178: 2315-2326. 10.1534/genetics.107.083162.PubMed CentralView ArticlePubMedGoogle Scholar
- Zimmer D, Mayer M, Reinsch N: Complex Genetic Effects in Quantitative Trait Locus Identification: A Computationally Tractable Random Model for Use in F2 Populations. Genetics. 2011, 187: 261-10.1534/genetics.110.122333.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu S: Estimating polygenic effects using markers of the entire genome. Genetics. 2003, 163 (2): 789-801.PubMed CentralPubMedGoogle Scholar
- Xu S: An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics. 2007, 63 (2): 513-521. 10.1111/j.1541-0420.2006.00711.x.View ArticlePubMedGoogle Scholar
- Fernando RL, Habier D, Stricker C, Dekkers JCM, Totir LR: Genomic Selection. Acta Agriculturae Scandinavica, Section A - Animal Sciences. 2007, 57: 192-195. 10.1080/09064700801959395.View ArticleGoogle Scholar
- Habier D, Fernando RL, Dekkers JCM: The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007, 177 (4): 2389-2397.PubMed CentralPubMedGoogle Scholar
- Solberg TR, Sonesson AK, Woolliams JA, Meuwissen TH: Reducing dimensionality for prediction of genome-wide breeding values. Genetics Selection Evolution. 2009, 41: 29-10.1186/1297-9686-41-29.View ArticleGoogle Scholar
- Meuwissen THE, Solberg TR, Shepherd R, Woolliams JA: A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genetics Selection Evolution. 2009, 41: 2-10.1186/1297-9686-41-2. [http://www.gsejournal.org/content/41/1/2]View ArticleGoogle Scholar
- Kao CH, Zeng ZB: Modeling epistasis of quantitative trait loci using Cockerham's model. Genetics. 2002, 160 (3): 1243-1261.PubMed CentralPubMedGoogle Scholar
- Álvarez-Castro JM, Carlborg Ö: A unified model for functional and statistical epistasis and its application in quantitative trait Loci analysis. Genetics. 2007, 176 (2): 1151-1167.PubMed CentralView ArticlePubMedGoogle Scholar
- Cockerham CC: An Extension of the Concept of Partitioning Hereditary Variance for Analysis of Covariances among Relatives When Epistasis Is Present. Genetics. 1954, 39 (6): 859-882.PubMed CentralPubMedGoogle Scholar
- The Bovine Genome Sequencing and Analysis Consortium: The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009, 324 (5926): 522-528.PubMed CentralView ArticleGoogle Scholar
- Hill W, Robertson A: Linkage disequilibrium in finite populations. TAG Theoretical and Applied Genetics. 1968, 38 (6): 226-231. 10.1007/BF01245622.View ArticlePubMedGoogle Scholar
- Hayes B, Goddard ME: The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution. 2001, 33 (3): 209-229. 10.1186/1297-9686-33-3-209.View ArticleGoogle Scholar
- Bennewitz J, Meuwissen THE: The distribution of QTL additive and dominance effects in porcine F2 crosses. Journal of Animal Breeding and Genetics. 2010, 127 (3): 171-179. 10.1111/j.1439-0388.2009.00847.x.View ArticlePubMedGoogle Scholar
- Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, Cookson WO, Taylor MS, Rawlins JNP, Mott R, Flint J: Genome-wide genetic association of complex traits in heterogeneous stock mice. Nature Genetics. 2006, 38 (8): 879-887. 10.1038/ng1840.View ArticlePubMedGoogle Scholar
- Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. American Journal of Human Genetics. 2007, 81 (5): 1084-1097. 10.1086/521987.PubMed CentralView ArticlePubMedGoogle Scholar
- Valdar W, Solberg LC, Gauguier D, Cookson WO, Rawlins JNP, Mott R, Flint J: Genetic and environmental effects on complex traits in mice. Genetics. 2006, 174 (2): 959-984. 10.1534/genetics.106.060004.PubMed CentralView ArticlePubMedGoogle Scholar
- Shepherd RK, Meuwissen TH, Woolliams JA: Genomic selection and complex trait prediction using a fast EM algorithm applied to genome-wide markers. BMC Bioinformatics. 2010, 11: 529-10.1186/1471-2105-11-529.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen T, Goddard M: Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics. 2010, 185 (2): 623-631. 10.1534/genetics.110.116590.PubMed CentralView ArticlePubMedGoogle Scholar
- Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R: Additive genetic variability and the Bayesian alphabet. Genetics. 2009, 183: 347-363. 10.1534/genetics.109.103952.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen THE, Goddard ME: Mapping multiple QTL using linkage disequilibrium and linkage analysis information and multitrait data. Genetics Selection Evolution. 2004, 36 (3): 261-279. 10.1186/1297-9686-36-3-261.View ArticleGoogle Scholar
- Calus MPL, Meuwissen THE, de Roos APW, Veerkamp RF: Accuracy of genomic selection using different methods to define haplotypes. Genetics. 2008, 178: 553-561. 10.1534/genetics.107.080838.PubMed CentralView ArticlePubMedGoogle Scholar
- Verbyla KL, Hayes BJ, Bowman PJ, Goddard ME: Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. Genet Res. 2009, 91 (5): 307-311. 10.1017/S0016672309990243.View ArticleGoogle Scholar
- Yi N, Yandell BS, Churchill GA, Allison DB, Eisen EJ, Pomp D: Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics. 2005, 170 (3): 1333-1344. 10.1534/genetics.104.040386.PubMed CentralView ArticlePubMedGoogle Scholar
- Long N, Gianola D, Rosa GJM, Weigel KA, Avendaño S: Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. Journal of Animal Breeding and Genetics. 2007, 124 (6): 377-389. 10.1111/j.1439-0388.2007.00694.x.View ArticlePubMedGoogle Scholar
- Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics. 2004, 74: 106-120. 10.1086/381000.PubMed CentralView ArticlePubMedGoogle Scholar
- Hayes BJ, Chamberlain AJ, McPartlan H, Macleod I, Sethuraman L, Goddard ME: Accuracy of marker-assisted selection with single markers and marker haplotypes in cattle. Genetical Research. 2007, 89 (4): 215-220.View ArticlePubMedGoogle Scholar
- Meuwissen T: Use of whole genome sequence data for QTL mapping and genomic selection. Proceedings of the 9th World Congress on Genetics Applied to Livestock Production. 2010, Leipzig, Germany: Gesellschaft für Tierzuchtwissenschaften e.V, [Abstract ID 0018, ISBN 978-3-00-031608-1]Google Scholar
- Gianola D, de los Campos G: Inferring genetic values for quantitative traits non-parametrically. Genetics Research. 2008, 90 (6): 525-540. 10.1017/S0016672308009890.View ArticlePubMedGoogle Scholar
- Gianola D, Fernando RL, Stella A: Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006, 173 (3): 1761-1776. 10.1534/genetics.105.049510.PubMed CentralView ArticlePubMedGoogle Scholar
- González-Recio O, Gianola D, Long N, Weigel KA, Rosa GJM, Avendaño S: Nonparametric Methods for Incorporating Genomic Information Into Genetic Evaluations: An Application to Mortality in Broilers. Genetics. 2008, 178: 2305-2313. 10.1534/genetics.107.084293.PubMed CentralView ArticlePubMedGoogle Scholar
- Gianola D, de los Campos G, González-Recio O, Long N, Okut H, Rosa GJM, Weigel KA, Wu XL: Statistical Learning Methods For Genome-based Analysis Of Quantitative Traits. Proceedings of the 9th World Congress on Genetics Applied to Livestock Production. 2010, Leipzig, Germany: Gesellschaft für Tierzuchtwissenschaften e.V, [Abstract ID 0014, ISBN 978-3-00-031608-1]Google Scholar
- Goddard ME, Hayes BJ: Genomic selection. Journal of Animal Breeding and Genetics. 2007, 124: 323-330. 10.1111/j.1439-0388.2007.00702.x.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.