We evaluated two models of genome selection for soybean embryogenesis, the main effect model and the epistatic effect model. The main effect model is already very efficient with a 0.33 r-square between the observed phenotypes and the predicted genomic values. With the epistatic effect model, the squared correlation was 0.78, more than twice the efficiency of the main effect model. This discovery provides surprisingly good news to soybean breeders. The squared correlation coefficient between the observed phenotypes and the predicted genomic values is different from the *R*
^{2} value in multiple regression analysis. The r-square here is the squared correlation between the observed phenotypes and the estimated genetic values in which the individual lines predicted contribute to the parameter estimation. The r-square of our epistatic effect model was very high (0.78). In the cross validation generated correlation, the lines predicted do not contribute to the parameter estimation, and thus the squared correlation is a good indication of the predictability. If a new line occurs from the same population but has no phenotypic value, using the marker effects estimated from the existing lines to predict the genetic value of the new line will be as accurate as 0.78. This can save tremendous resources for plant breeding because we can use marker information alone to select plants in several cycles without field evaluation of the phenotypes.

Cross validation analysis showed that the main effect model found 27 important markers, but none of the 27 markers appeared as main effects in the epistatic effect model. This means that epistasis plays a more important role in determining the variation of soybean embryogenesis. These 27 markers did not disappear completely; parts of their effects have been absorbed by the epistatic effects due to correlation of variables in the design matrix. The epistatic model provides better prediction of the genomic values, but it does not disqualify the main effect model. When epistatic effects are excluded from the model, parts of their effects have been absorbed by the main effects due to complicated correlation among the design matrices of the effects. If a breeder decides to use the additive effect model alone for genome selection, he/she still has an accuracy of 0.33.

We used the cross validation analysis to select markers and marker pairs for genome selection. We progressively increased the number of effects in the model to monitor the change in the accuracy of prediction. The optimal number of effects included is the one that maximizes the accuracy of prediction or minimizes the prediction error. The pattern of the change depends on the models used and the traits analyzed. For the main effect model, including more effects than necessary is detrimental to genome selection. For the epistatic effect model, the decrease of the prediction accuracy does not seem to be significant after the optimal number of effects is reached. The reason for the difference in the two models (additive and epistatic) is due to the fact that the large number of effects in the epistatic model caused strong shrinkage of the estimated regression coefficients. Once the prediction accuracy reached the peak value, additional effects included all have extremely small values (virtually equal to zero). Including these zero effects does not cause much damage to the prediction. These patterns of the correlation profiles cannot be generalized to other situations. Therefore, cross validation must be performed for different traits in different experiments.

Unfortunately, the marker density in this experiment was very low in the experiment, which prevented us from using the interval mapping and composite interval mapping approaches for fine mapping. The average interval for the 77 mapped markers was 40 cM. For such a low marker density, we were more concerned about not detecting any QTL. To our surprise, we found more QTL than we originally anticipated. Our cross validation analysis confirmed 27 main effects and 66 epistatic effects. Therefore, we conclude that the soybean embryogenesis is a polygenic trait. The current analysis provides a guideline for further study in fine mapping of QTL for embryogenesis of soybean. The next step is to develop more markers to saturate the entire genome. With a high density marker map, more efficient genome selection can be conducted.

The number of somatic embryos is a discrete trait, which can be modeled by the Poisson distribution. In this particular experiment, each original data point was the average of 10 plants. Therefore, treating the average value as a continuous trait is more reasonable than fitting the Poisson model. For curiosity, we did analyze the data using the Poisson distribution. We also performed a data transformation using the square root of the data [18]. The results were almost identical to the current analysis regarding the markers and marker pairs identified (data not shown).

The main purpose of this study was not to develop new statistical methods for genome selection; rather, we used an existing method [13] to investigate the possibility of using genome selection to improve soybean embryogenesis. The empirical Bayesian method is adequate to handle 80 markers with all pair-wise interactions. When marker density is high, the method may be limited for handling all pair-wise interactions. Past experience [5] showed that the method can handle the number of effects about 60 times as large as the sample size. In this case, the maximum number of effects that can be handled with 126 lines (sample size) is about 126 × 60 = 7560., which is equivalent to about 122 markers. For 1000 markers with roughly 500,000 effects, the sample size should be about 8000 in order to estimate these many effects. When the sample size is limited, improved methods are required. One of the authors in this study (XC) is developing a fast empirical Bayes (fEB) method, which can handle the number of effects as large as 800 times the sample size (unpublished result).

In plant breeding, it is common to select superior cultivars from a random pool of existing accessions. The empirical Bayesian method used in this study works equally well for such randomly selected populations. It is also common to select superior RILs from an RIL crossing experiment derived from two inbred lines [19]. The result of this study can be directly applied to plant breeding using RILs as plant material. In fact, we are in the process of adding more markers and designing a field experiment to apply the selected RILs to improve soybean embryogenesis.

This study demonstrates the importance of epistasis in determination of the embryogenesis of soybean. This finding is clearly in contrast to what Xu and Jia [5] discovered in barley where additive effects play a more important role than epistatic effects in all seven traits investigated in the experiment. However, it is consistent with the study in corn by Dudley and Johnson [9] who discovered that epistatic effects are important in genome selection for yield and oil content traits. Studying the grain yield and its component traits in maize, Ma et al. [20] found that the relative importance of main effects and epistatic effects varies among traits. In addition, they found that only a small proportion of the main-effect QTL interact with other QTL. The study by Yu et. al [21] showed that epistasis also plays a major role in hybrid vigor in rice. This study and the studies by others mentioned above all concluded that main effects and epistatic effects do not have any intrinsic connections. In other words, whether markers interact or not does not depend on the presence of main effects of the two interacting markers. This further supports the notion that epistatic effects and main effects must be studied simultaneously, rather than in sequence where main effects are studied first and epistatic effects are then evaluated only on those markers with significant main effects.

Although there were hundreds of studies in the past decades claiming discovery of major QTL in plants, only a few of them have successfully cloned QTL [22]. QTL mapping and cloning are definitely rewarding. However, many complex traits may be contributed mainly by QTL interactions [9, 20, 21]. Some may be contributed by polygene only, like the soybean embryogenesis presented in this study. QTL cloning in such cases is obviously not a practical strategy to perform marker aided selection. Fortunately, genomic selection, a special form of marker assisted selection (different from the classical marker assisted selection), has been invented to estimate the molecular breeding values by utilizing saturated markers of the entire genome [2]. Dudley and Johnson [9] further improved the efficiency of genome selection by including epistatic effects in the model. Our study further supports the study of Dudley and Johnson [9]. There is a critical difference between our study and that of Dudley and Johnson. They used a partial least squares (PLS) method to predict the total genomic values of plants, in which marker-trait association was indicated by the association of a few principal components. Our study, however, evaluated the association directly through the association of observed phenotype and marker genotypes, making the method feasible in practical breeding program.

The entire data analysis was conducted using the QTL procedure in SAS, called PROC QTL [23]. This program can perform QTL mapping for not only continuous traits but also discrete traits, such as binary, ordinal, binomial and Poisson traits. In addition, the QTL procedure provides an opportunity for the users to choose maximum likelihood, least squares, weighted least squares or Bayesian method for interval mapping, composite interval mapping or multiple interval mapping. Furthermore, the procedure can analyze multiple traits using the multivariate model. The Bayesian analyses presented in this study were conducted using this QTL procedure. The program is freely downloadable from our website (http://www.statgen.ucr.edu).