Models for prediction of fat, milk and protein yield in cows using pedigree and genomic relationship information as inputs, and wheat yield using molecular markers as predictor variables were studied. This was done using Bayesian regularized neural networks, and predictions were benchmarked against those from a linear neural network, which is a Bayesian ridge regression model. In the wheat data, the comparison was supplemented with results obtained by our group using RKHS or support vector methods. Different network architectures were explored by varying the number of neurons, and using a hyperbolic tangent sigmoid activation function in the hidden layer plus a linear activation function in the output layer. This combination has been shown to work well when extrapolating beyond the range of the training data [36]. The choice of number of neurons can be based on cross-validation, as in the present data, or on standard Bayesian metrics for model comparison [11, 15].

The Levenberg-Marquardt algorithm, as implemented in MATLAB, was adopted to optimize weights and biases, as this procedure has been effective elsewhere [38]. Bayesian regularization was adopted to avoid over-fitting and to improve generalization, and cross-validation was used to assess predictive ability, as in [28, 39].

Because Bayesian neural networks reduce the effective number of weights relative to what would be obtained without regularization, this helps to prevent over-fitting [40]. For the networks we examined, even though the total number of parameters, e.g., in Jerseys, ranged from 300 to 1795, the effective number of parameters varied from only 91 to 136 when pedigree relationships were used, and from 127 to 166 when genomic relationships were used as inputs, illustrating the extent of regularization. There were differences in predictive abilities of different architectures but the small sample used dictates a cautious interpretation. Nevertheless, the results seem to support networks with at least 2 neurons, which has been observed in several studies [20, 28, 41–43]. This suggests that linear models based on pedigree or on genomic relationships may not provide an adequate approximation to genetic signals resulting from complex genetic systems. Because highly parameterized models are penalized in the Bayesian approach, we were able to explore complex architectures. However, there was evidence of over-fitting in the Jersey training set, where correlations between observed and predicted data in the training set were always larger than 0.90, sometimes exceeding 0.95. This explains why correlations were much lower in the testing set, which is consistent with what was observed in other studies with neural networks [42]. Although more parameters in a model can lead to smaller error in the training data, it cannot be overemphasized that this is not representative of prediction error in an independent data set, as shown by [43] working with human stature.

Our results with ANN for wheat are at least as good as those obtained with the same data in two other studies. Crossa et al. [35] found cross-validation correlations with the following values when various methods were used: pedigree information (BLUP), 0.45; pedigree-based reproducing kernel Hilbert spaces regression (RKHS), 0.60; RKHS with both pedigree and markers, 0.61; Bayesian Lasso with markers, 0.46; Bayesian Lasso with markers and pedigree, 0.54, and Bayesian ridge regression on markers, 0.49. Long et al., [30] compared the Bayesian Lasso with four support-vector regression models consisting of the combination of two kernels and two loss functions. The predictive correlation for wheat yield (average of 50 random repeats of the cross-validation exercise) was 0.52 for the Bayesian Lasso, and ranged between 0.50 and 0.58 for the support vector implementations. Hence, it seems clear, at least for wheat yield in this data set, that the non-parametric methods can outperform a strong learner, the Bayesian Lasso, and that the neural networks are competitive with the highly regarded support vector methods [11].

A question of importance in animal and plant breeding is how an estimated "breeding value", i.e., an estimate of the total additive genetic effect of an individual, can be arrived at from an ANN output. There are at least two ways in which this can be done. One is by posing architectures with a neuron in which the inputs enter in a strictly linear manner, followed by a linear activation function on this neuron; the remaining neurons in the architecture, receiving the same inputs, would be treated non-linearly. A second one, is obtained by observing that the infinitesimal model can be written as y

*i* =

z'

_{
i
}
**u** +

**e**, for some incidence row vector

z'

_{
i
} peculiar to individual

*i*. Here, the breeding value of the

*i*
^{
th
} individual can be written as

. Likewise, consider a linear regression model for

*p* markers,

, where

*β*
_{
j
} is the substitution effect at marker locus

*j*;

*x*
_{
ij
} = 0,1,2 is the observed number of copies of a given allele at locus

*j* on individual

*I*, and

and

β = {

*β*
_{
j
}} are row and column vectors, respectively, each with

*p* elements. Here, the "marked breeding value" of individual

*i* would be

. Consider next a neural network with a hyperbolic tangent activation function throughout, that is

Let

p
_{
i
} = {

*p*
_{
ij
}} be the vector of input covariates (e.g., genomic or additive relationships, marker genotype codes) observed on

*i*. Adapting the preceding definitions to the ANN specification, one would have as breeding value (BV) of individual

*i*
Thus, the so defined breeding value of individual *i* depends on the values of the input covariates observed on this individual, on all connection strengths and bias parameters from inputs to neurons in the middle layer (the *u's* and the *b's*), and on all connection strengths from the middle layer to the output layer (the *w's*). In order to obtain an estimate of breeding value the unknown quantities would replaced by the corresponding maximum a posteriori (MAP) estimates or, say, by the estimate of their posterior expectation if a Markov chain Monte Carlo scheme is applied [44].

Another issue is that of assessing the importance of an input relative to that of others. For example, in a linear regression model on markers, one could use a point estimate of the substitution effect or its "studentized" value (i.e., the point estimate divided by the corresponding posterior standard deviation), or some measure that involves estimates of substitution effects and of allelic frequencies. A discussion of some measures of relative importance of inputs in an ANN is in [28, 43], for example, the ratio between the absolute value of the estimate of a given connection strength, and the sum of the absolute values of all coefficients in the network.