In this section, we will describe the methods of BSR (BayesA) and SSVS (BayesB) for genomic selection and EM algorithm for BSR to obtain the estimates of parameters included in the model that maximize the posterior distribution function. Subsequently, we will modify BSR method (wBSR) by assigning the weight for each SNP according to the strength of its association to a trait for improvement of the prediction accuracy. The weight of SNP can be obtained from a prior probability of each SNP to be included in a model, which is also considered in SSVS procedure, using EM algorithm as well as the estimate of SNP effect. For the evaluation of the accuracies of the predicted GBVs, we apply wBSR with variable prior probabilities of SNP inclusion for simulated data sets as well as MCMC-based BSR and SSVS.

In the statistical model described below, we consider not haplotype effect but the effect of each single SNP. We assume that the number of SNPs genotyped is *N* and a training data set including *n* individuals with the records of phenotypes and SNP genotypes is available for the estimation of parameters in the model. We also assume that selection candidates consists of individuals with only SNP genotypes, for each of which GBV is predicted based on the model with SNP effects estimated with training data sets. We denote two alleles at each SNP by 0 and 1 and three genotypes by '0_0', '0_1', and '1_1'.

### Models for BSR and SSVS in genomic selection

In BSR (BayesA) method [

1,

2], the following linear model is fitted to the phenotypes of a training data set:

where **y** = (*y*
_{1}, *y*
_{2}, ..., *y*
_{
n
})' is a vector of phenotypic values of a trait for *n* individuals of a training data set, **u**
_{
l
}= (*u*
_{l1}, *u*
_{l2}, ..., *u*
_{
ln
})' is a vector of genotypes of *n* individuals at the *l*th SNP with *u*
_{
li
}taking a value of -1, 0, or 1 corresponding to the genotypes '0_0', '0_1', or '1_1', respectively, *g*
_{
l
}is the effect of the *l*th SNP, **b** = (*b*
_{1}, *b*
_{2}, ..., *b*
_{
f
})' is a vector of fixed non-genetic effects with dimension *f* including a general mean, **X** = (*x*
_{
ij
}) (*i* = 1, 2, ..., *n*; *j* = 1, 2, ..., *f*) is a design matrix relating **b** to **y** and **e** = (*e*
_{1}, *e*
_{2}, ..., *e*
_{
n
})' is a vector of random deviates with *e*
_{
i
}~*N*(0, *σ*
_{
e
}
^{2}). It is assumed that the prior distribution of the SNP effect, *g*
_{
l
}, is a normal distribution with a mean 0 and a variance *σ*
_{
gl
}
^{2}, which differs for every SNP. Moreover, the prior distribution of *σ*
_{
gl
}
^{2} is considered. In this study, we assume that it is a scaled inverted chi-squared distribution with a scale parameter *S* and a degree-of-freedom *ν*, *χ*
^{-2}(*ν*, *S*), following [1, 2]. The posterior distributions of relevant parameters, **b**, *g*
_{
l
}, *σ*
_{
gl
}
^{2} (*l* = 1, 2, ..., *N*) and *σ*
_{
e
}
^{2}, can be obtained by Gibbs sampling [1, 2]. For the individuals of selection candidates, GBV are predicted by
, where
is the estimate of *g*
_{
l
}. In this study, we consider not haplotype effect but the single marker effect for *g*
_{
l
}. The use of marker haplotypes instead of the single marker genotypes would cause slight modification of the model, but the procedure for estimation of effects and prediction of GBV is essentially the same.

In SSVS (BayesB) method, the model (1) is also adopted but a prior probability,

*p*, of each SNP to be included in the model is considered. Usually, a small value is given for

*p* based on the assumption that many of SNPs have actually no effects for a trait. The prior distribution of

*g*
_{
l
}is assumed to be a normal distribution with a mean 0 and a variance

*σ*
_{
gl
}
^{2} in SSVS as in BSR, whereas the prior distribution of

*σ*
_{
gl
}
^{2} is expressed as a mixture of two distributions corresponding to the inclusion and the exclusion of the SNP as follows:

assuming that the prior is *χ*
^{-2}(*ν*, *S*) when the SNP is included. When MCMC algorithm is applied for the estimation of the parameters in SSVS, *g*
_{
l
}and *σ*
_{
gl
}
^{2} are jointly updated with Metropolis-Hastings chain [1]. The GBV predicted by SSVS is presented by
as in BSR.

### EM algorithm for BSR

In Bayesian estimation, the inferences about the parameters are made based on the posterior distributions. MCMC algorithms can be used for obtaining the posterior information of the parameters in BSR method as described above. However, the posterior mode of each SNP effect which is a point estimate maximizing the density function of the posterior distribution can be calculated instead of a posterior expectation by some other iteration algorithm including EM algorithm. In QTL mapping using BSR method, Yi and Banerjee [8] utilized an EM algorithm to search the posterior mode of the marker effects included in the model. This EM algorithm can be applied for genomic selection with BSR method without any modification and we describe the estimation procedure for the EM algorithm in this section. Although, in [8], phenotypic data was transformed to have a mean 0 and a standard deviation 0.5 following Gelman et al. (2008) [9] and the derivations of the posterior estimates of parameters were illustrated in the framework of generalized linear model, original phenotypic data are subject to the EM algorithm here without any transformation and we derive the posterior estimates of parameters under the normality in what follows assuming that the trait of concern is polygenic and normally distributed.

The posterior distribution is given by combining a likelihood of the data and the prior distributions of the parameters. We denote parameters in BSR method as a vector form

**θ**,

The posterior distribution of

**θ** given the data of phenotypes,

**y**, and genotypes of SNP data,

**U** = (

**u**
_{1},

**u**
_{2}, ...,

**u**
_{
N
}), is denoted by

*g*(

**θ** |

**y**,

**U**) and written as

where *C* means a constant and it should be noted that the likelihood of **y** given the model parameters and genotypes is a normal distribution with a mean **Xb** +
and a variance *σ*
_{
e
}
^{2} and the prior of *g*
_{
l
}is a normal distribution with a mean 0 and a variance *σ*
_{
gl
}
^{2}, the prior of which is the scaled inverted chi-squared distribution *χ*
^{-2}(*ν*, *S*) as described above. The priors of **b** and *σ*
_{
e
}
^{2} are written by *p*(**b**) and *p*(*σ*
_{
e
}
^{2}), respectively, which are assumed uniform distributions over suitable ranges of the values here.

Following [

8], we regard the variances of SNP effects,

*σ*
_{
gl
}
^{2} (

*l* = 1, 2, ...,

*N*), as missing data and replace

*σ*
_{
gl
}
^{2} by the conditional posterior expectation of

*σ*
_{
gl
}
^{2}, denoted by

, given other parameters as E-step in the EM algorithm. Considering the expectation of scaled inverted chi-square distribution, it is given as

As M-step, we obtain the values of parameters other than *σ*
_{
gl
}
^{2} (l = 1, 2, ..., *N*) maximizing the log-posterior distribution with *σ*
_{
gl
}
^{2} replaced by
, which is expressed from (3) as

The mode of each parameter which maximizes the log-posterior can be given by solving an equation derived by making the partial derivative of the log-posterior with respect to the parameter equal to 0. Accordingly the modes of

*g*
_{
l
}(

*l* = 1, 2, ...,

*N*),

*b*
_{
j
}(

*j* = 1, 2, ...,

*f*) and

*σ*
_{
e
}
^{2}, denoted as

,

and

, satisfy the following equations:

The EM algorithm for BSR is summarized as follows:

1. E-step: *σ*
_{
gl
}
^{2} is estimated as
shown in (3) that is a conditional expectation given a current value of *g*
_{
l
}, which is
, for *l* = 1, 2, ..., *N*.

2. M-step: the values of *g*
_{
l
}(*l* = 1, 2, ..., *N*), *b*
_{
j
}(j = 1, 2, ..., *f*) and *σ*
_{
e
}
^{2} maximizing the log posterior distribution of parameters,
,
and
, are given according to (4), (5) and (6), where the value of each parameter are updated by replacing the other parameters by their current values.

E-step and M-step are repeated until the values of parameters converge. We stop this iteration when the change of values of parameters becomes small. For example, when
< 10^{-6}, where
and
are the current and the previous value of the parameters, the EM algorithm is regarded to be converged. We adopt this criterion for convergence of EM algorithm in the study.

### Modification of BSR

In SSVS, SNP effects can shrink more strongly than in BSR due to the assumption that only a small number of SNPs can be linked to QTL causing only a small portion of SNPs to have significant effects and many other SNPs to have negligible effects, which might result in the improvement of prediction accuracy for SSVS using a more parsimonious model. Although it was reported in [

5] that BSR could provide a more accurate result for QTL mapping with less than a hundred markers than SSVS developed by Yi et al. (2003) [

10], SSVS that is capable of deleting many SNPs with ignorable effects might perform as well or better than BSR in the case of a huge number of high-density SNPs involved in the prediction of GBV. However, the EM algorithm described above cannot be applied to SSVS because the prior distribution of

*σ*
_{
gl
}
^{2}, a mixture distribution combining

*χ*
^{-2}(

*ν*,

*S*) and 0 with probability

*p* and 1-

*p*, respectively, cannot be well treated with EM algorithm. To devise a cost-effective and EM-based method providing more accurate prediction for genomic selection with a higher degree of shrinkage, we develop a new modified BSR method incorporating a weight for each SNP depending on the strength of its association with a trait. In this method, we modify the model (1) by incorporating the variable

*γ*
_{
l
}indicating the inclusion of the

*l*th SNP in the model or exclusion of the

*l*th SNP from the model, where inclusion and exclusion of the SNP are indicated by

*γ*
_{
l
}= 1 and

*γ*
_{
l
}= 0, respectively. We assume that the prior probabilities of

*γ*
_{
l
}= 1 and

*γ*
_{
l
}= 0 are

*p* and 1-

*p*, respectively, as in SSVS. The modified model is written as

where **X**, **b**, **u**
_{
l
}, *g*
_{
l
}and **e** are as described in the model (1). We assume that the priors of *g*
_{
l
}and *σ*
_{
gl
}
^{2} are not influenced by the inclusion (*γ*
_{
l
}= 1) or exclusion (*γ*
_{
l
}= 0) of SNP in the model (2) and are as adopted in BSR. The method with the model (7), but utilizing these assumption, is called wBSR, meaning a modified BSR incorporating SNP weight, in this study since the same EM procedure as used in BSR for searching the posterior mode of parameters can be applied for this method and it is equivalent to an EM-based BSR procedure proposed by [8] when *p* = 1. We denote the variables indicating the inclusion of SNP effects in the model in a vector form as **γ** = (*γ*
_{1}, *γ*
_{2}, ..., *γ*
_{
N
}) which are treated as variables to be estimated in wBSR.

In wBSR, the posterior distribution

*g*(

**θ**,

**γ** |

**y**,

**U**) is modified from (2) and written as

*g*(

**θ**,

**γ** |

**y**,

**U**)

where the priors

*p*(

**b**) and p(

*σ*
_{
e
}
^{2}) are assumed uniform distributions. Applying the same argument as in EM algorithm used for BSR,

*σ*
_{
gl
}
^{2} is replaced by its conditional posterior expectation,

, in E-step which is given in (3). The variable

*σ*
_{
l
}indicating the inclusion of SNP in the model is unobserved, thus,

*σ*
_{
l
}is also replaced by its conditional posterior expectation

*ξ*
_{
l
}which can be written, from (8) and under the assumption that the priors of

*g*
_{
l
}and

*σ*
_{
gl
}
^{2} are independent of

*σ*
_{
l
}, as

where

**γ**
_{-l
}denotes

**γ** with the

*l*th component

*γ*
_{
l
}deleted and

In this expression, however,

*γ*
_{
j
}(

*j* ≠

*l*) is also unobserved. Therefore, we modify the expression for

*ξ*
_{
l
}by substituting

*γ*
_{
j
}with

*ξ*
_{
j
}for

*j* ≠

*l*. Accordingly, the conditional posterior expectation of

*γ*
_{
l
},

*ξ*
_{
l
}, is approximately obtained in E-step for

*l* = 1, 2, ...,

*N* following the formula:

where

In M-step, the values of

*b*
_{
j
}(

*j* = 1, 2, ...,

*f*) and

*σ*
_{
e
}
^{2} maximizing g(

**θ**|

**y**,

**U**),

, and

, satisfy the equations that are slightly changed from (

6) and (

7) and given as

For

*g*
_{
l
}, the value maximizing the posterior (8),

, depends on

*γ*
_{
l
}and is given as

= 0 for

*γ*
_{
l
}= 0 and

for

*γ*
_{
l
}= 1. As

*γ*
_{
l
}is unobserved, we substitute

*ξ*
_{
l
}for

*γ*
_{
l
}in the expressions of

,

and

. For

, the expression corresponding to

*γ*
_{
l
}= 1 is adopted for the iteration. In summary,

,

and

calculated in M-step are given as

It should be noted that *ξ*
_{
l
}given by (9) is an approximate posterior expectation of *γ*
_{
l
}that might be different from the posterior probability of SNP to be included in the model. Therefore, *ξ*
_{
l
}is referred to as the weight of the SNP that is regarded as an indicator of the strength of the association of the SNP with a trait. The SNP assigned a large weight with *ξ*
_{
l
}taking values near one is considered to essentially contribute to GBV while the contribution of the SNP assigned a small weight with *ξ*
_{
l
}taking values near the given prior value of *p* is regarded as negligible. The degree of shrinkage can be affected by the value of a prior probability *p* as well as the values of hyperparameters, *ν* and *S*, in *ξ*
^{-2}(*ν*, *S*), the prior distribution for *σ*
_{
gl
}
^{2}. The predicted GBV of wBSR is expressed as

### Simulation experiments

We evaluated the accuracy for the prediction of GBV using wBSR with variable *p* based on simulated data sets. The population and genome were simulated following the way as in [11]. In brief, the populations with an effective population size 100 were maintained by random mating for 1000 generations to attain mutation drift balance and linkage disequilibrium between SNPs and QTLs. The genome was assumed to consist of 10 chromosomes with each length 100 cM. Two scenarios were considered for the number of SNP markers available in the simulations and data sets under two scenarios were denoted as Data I and Data II. In Data I, 101 marker loci were located every 1 cM on each chromosome with total of 1010 markers on a genome. In Data II, 1010 equidistant marker loci were located on each chromosome with a total of 10100 markers. We assumed that equidistant 100 QTLs were located on each chromosome such that a QTL was in the middle of every marker bracket in Data I and the middle of every 10th marker bracket in Data II. Therefore, there were a total of 1000 QTLs located on a whole genome. The mutation rates assumed per locus per meiosis were 2.5 × 10^{-3} and 2.5 × 10^{-5} for marker locus and QTL, respectively. At least one mutation occurred in the most of all marker loci with such high mutation rate during the simulated generations. In the marker loci experiencing more than one mutation, the mutation remaining at the highest minor allele frequency (MAF) was regarded as visible, whereas the others were ignored, which caused the marker loci to have two alleles like SNP markers. The polymorphic QTLs at which mutation occurred only affected the trait, where the effects of QTL alleles were sampled from a gamma distribution with scale parameter 0.4 and shape parameter 1.66 and were assigned with positive or negative values with equal probabilities [1, 11].

In generation 1001 and 1002, the population size was increased to 1000. The population in the 1001th generation was treated as a training data, where the phenotypes of a trait and SNP genotypes of the individuals were simulated and analyzed with methods of genomic selection to estimate the SNP effects in the model. The phenotype of each individual in the 1001th generation was given as a sum of QTL effects over the polymorphic QTLs and environmental effects (or residuals) sampled from a normal distribution with a mean 0 and a variance 1 such that the heritability in the population was expected to be 0.5. The population in the 1002th generation was used as selection candidates, where the individuals were only genotyped for 1010 and 10100 SNP markers in Data I and Data II, respectively, without phenotypic records and GBV of each individual was predicted using a model with SNP effects estimated based on the population in the 1001th generation. The true breeding value (TBV) of the individual in the 1002th generation was also simulated as a sum of QTL effects corresponding to the QTL genotype and utilized for evaluation of the accuracy of predicted GBV but was regarded as unknown and unavailable in the estimation of SNP effects in the models. The accuracy was measured by the correlation between the predicted GBV and TBV.

For the evaluation of the accuracies of the predicted GBVs obtained by wBSR with *p* = 0.01, 0.05, 0.1, 0.2, 0.5 and 1.0, we simulated 100 and 20 data sets under the scenario of Data I and Data II, respectively. The accuracies of the GBVs predicted by BSR and SSVS based on MCMC algorithm were also evaluated on the same data sets in comparison with wBSR. In MCMC iteration, we repeated 11000 cycles using a burn-in period of the first 1000 cycles. The values of parameters were sampled every 10 cycles for obtaining the posterior means. In SSVS, we investigated the accuracies of predicted GBVs for *p* = 0.01, 0.05, 0.1, 0.2 and 0.5 in Data I but for *p* = 0.01, 0.05 and 0.1 in Data II due to large computational time required for MCMC algorithm. SNP markers with MAF less than 0.05, which were less than 10% of all SNPs, were not used for the estimation of effects and the prediction of GBV. We set *ν* = 4.012 and *S* = 0.002 for MCMC-based BSR and wBSR with *p* = 1.0 that is equivalent to an EM-based BSR proposed by [8], and *ν* = 4.234 and *S* = 0.0429 for SSVS and wBSR with other values of *p*. These values of *ν* and *S* were determined following [1].