Accuracy of genomic selection in simulated populations mimicking the extent of linkage disequilibrium in beef cattle

Background The success of genomic selection depends mainly on the extent of linkage disequilibrium (LD) between markers and quantitative trait loci (QTL), the number of animals in the training set (TS) and the heritability (h2) of the trait. The extent of LD depends on the genetic structure of the population and the density of markers. The aim of this study was to calculate accuracy of direct genomic estimated breeding values (DGEBV) using best linear unbiased genomic prediction (GBLUP) for different marker densities, heritabilities and sizes of the TS in simulated populations that mimicked previously reported extent and pattern of LD in beef cattle. Results The accuracy of DGEBV increased significantly (p < 0.05) with the increase in the number of bulls in the TS (480, 960 or 1920), trait h2 (0.10, 0.25 or 0.40) and marker densities (40 k or 800 k). Increasing the number of animals in the TS by 4-fold and using their phenotypes to estimate marker effects was not sufficient to maintain or increase the accuracy of DGEBV obtained using estimated breeding values (EBVs) when the trait h2 was lower than 0.40 for both marker densities. Comparing to expected accuracies of parent average (PA), the gains by using DGEBV would be of 27%, 13% and 10% for trait h2 equal to 0.10, 0.25 and 0.40, respectively, considering the scenario with 40 k markers and 1920 bulls in TS. Conclusions As reported in dairy cattle, the size of the TS and the extent of LD have major impact on the accuracy of DGEBV. Based on the findings of this simulation study, large TS, as well as dense marker panels, aiming to increase the level of LD between markers and QTL, will likely be needed in beef cattle for successful implementation of genomic selection.


Background
Genomic selection is a method of marker-assisted selection based on LD that potentially explores all QTL in the genome [1]. The breeding value (BV) is estimated by the sum of the effects of marker alleles or haplotypes covering the entire genome and its accuracy could be as high as 0.85 [1].
The BV calculated from the estimated effects of markers is often called DGEBV and the blended value between DGEBV and traditional EBV is often called Genomic Estimated Breeding Value (GEBV). The accuracy of DGEBV and GEBV depends on: 1) the level of LD between markers and QTL; 2) the number of animals in the TS; 3) the heritability of the trait and 4) the distribution of QTL effects [2].
Since 2009, national evaluations of Holstein bulls are being performed in the USA and Canada based on GEBV. Before that, a comprehensive study evaluating the reliability of GEBV of many economic traits in thousands of Holstein bulls genotyped with the Illumina BovineSNP50 Genotyping BeadChip (Illumina, Inc.,San Diego, California, USA) demonstrated that genomic PA of young animals were substantially higher than the reliability of traditional PA [3,4].
In beef cattle, the situation is quite different with respect to feasibility of genomic selection. There are several reasons for this, which include the genetic structure of herds, which involve a greater number of breeds with different origin and history of selection, resulting in significant differences in genetic parameters, such as LD and effective population size (N e ), which influence the accuracy of genomic selection [5][6][7]. Furthermore, much of the impetus for using genomics in beef cattle is to allow selection for traits that are not routinely recorded. Simulation studies are still more frequent than studies with real beef cattle data, despite the availability of databases, mainly in bos taurus breeds [e.g., [8]]. However, there still exists a lack of data available for several beef cattle breeds, especially for indicine (zebu) breeds. The aim of this study was to evaluate the potential accuracy of DGEBV for two marker densities, different heritability levels and different sizes of TS in simulated populations that mimicked previously reported extent and pattern of LD in beef cattle.

Population structure
Using the QMSim software [9], populations were simulated based on forward-in-time process [10] with either 40 k or 800 k single nucleotide polymorphism (SNP) markers, and 750 QTL across the 29 bos taurus autosomes (BTAs). In the first simulation step, 1000 generations with a constant size of 1000 were simulated, followed by 1020 generations with a gradual decrease in population size from 1000 to 200 in order to create initial LD and establish mutation-drift equilibrium in historical generations. The number of individuals of each sex remained the same in this step and the mating system was based on random union of gametes, randomly sampled from both the male and female gamete pools. Therefore only two evolutionary forces were considered in this step: mutation and drift. In the second step, an expansion of the population was created by initially randomly selecting 100 founder males and 100 founder females from the last generation of the historical population. In order to enlarge the population, eight generations were simulated with five offspring per dam and an exponential growth of the number of dams. The mating was again based on the random union of gametes and no selection was also considered in this step. In the case of simulating 800 k SNP markers, the expansion of the population was carried over six generations, instead of eight, due to computer time and memory requirements which resulted in a smaller population size.
In the next simulation step, two recent generation sets were simulated by selecting 640 males and 32000 females (POP1) or 160 males and 8000 females (POP2) from the last generation of the two expanded populations with 40 k or 800 k SNP markers, respectively. Then ten generations were simulated. Generations three to eight were used as TS and EBVs for this group were estimated by excluding information for generations nine and ten. The prediction set for both POP1 and POP2 included only animals born in generation ten.
The parameters used in the recent generations mimicked more closely to a real production system with one progeny per dam per year, 50% of male progeny, selection for high values of EBV and culling for low values of EBV with a replacement rate of 60% for sires and 20% for dams. Sires and dams were randomly mated.
The breeding values were estimated by BLUP, using Henderson's mixed linear equations [11] for an individual animal model, considering the true additive genetic variance. The rate of missing sire and dam information was 5%. A single trait with heritability of 0.10, 0.25 or 0.40 and phenotypic variance of 1.0 was simulated. The true breeding value of an individual was equal to the sum of the QTL additive effects and the phenotypes were generated by adding random residuals to the true breeding values. The whole simulation process was repeated 10 times. The parameters of simulation process are summarized in Table 1, while Figure 1 depicts the simulation steps.
The targeted extent and pattern of LD in the simulation was the average of the values reported by [ [12] and [13]] for different beef cattle breeds. Detailed information on the extent and pattern of LD were kindly provided by those authors.

Genome
The simulated genome consisted of 29 pairs of autosomes with length identical to the real bovine genome based on Btau_3.1 assembling [14] totaling 2333 cM. In most reported simulation studies, just one chromosome was simulated, due to the computing time and memory requirements. The advantage of simulating the real number of autosomes with length identical to the bovine genome is to create a more realistic scenario with respect to the number of physically unlinked marker and QTL loci. The SNP markers were evenly distributed and the initial number of markers was chosen such that it would generate two densities of segregating bi-allelic loci with minor allele frequency (MAF)>0.1: 40 k or 800 k. The markers were neutral in their effect on the trait. A number of QTL was simulated to generate 750 segregating loci with two, three or four alleles and MAF>0.1, whose positions were randomly distributed. Additive allelic effects were randomly sampled from gamma distribution with shape parameter equal to 0.4. The rate of missing marker genotypes was 0.01 and the rate of marker genotyping error was 0.005. A recurrent mutation rate of 10 -5 for both markers and QTLs was considered to establish mutation-drift equilibrium in historical generations. The same mutation rate was also applied in all subsequent generations after the historical ones. The parameters used for simulating the genome are given in Table 1 and a summary of the number of simulated SNP markers for each autosome is given in Table 2.

Training and prediction sets
The training set was composed of bulls from generations three to eight for both POP1 and POP2. The reason for excluding the first two generations was to use data from a population undergoing selection, which would be mostly the case in beef cattle. In order to estimate the effects of the 40 k segregating markers, all the bulls (n = 1920) that had more than 50 offspring (average progeny size = 73) were selected from POP1 for the three levels of heritability (0.10, 0.25 or 0.40). Of these, 960 and 480 bulls were randomly sampled so that the average accuracy of EBVs was kept the same across the TS sizes within each level of heritability. Therefore, the average progeny size was approximately the same (73) for the three TSs with an average accuracy of EBVs equal to 0.79, 0.90 or 0.94 for heritability of 0.10, 0.25 or 0.40, respectively. For the purpose of estimating the effects of 800 k markers in POP2, only 480 bulls had more than 50 offspring (average progeny size = 73) because of the smaller population size and, therefore, these were selected from POP2 for the three levels of heritabilities (0.10, 0.25 or 0.40). However, considering that this number of bulls with highly accurate EBV (n = 480) is realistic for most beef cattle breeding programs, this scenario would be useful for assessing the accuracy of genomic selection using a high density marker panel. Two other additional TS were created by randomly selecting 7680 and 1920 animals from POP1 and POP2, respectively. In this case, the phenotypes were used to estimate the marker effects, what represented a 4-fold increase in the TS. Figure 2 depicts these simulated scenarios with respect to the TS sizes.
The prediction sets were composed of 8000 and 2000 individuals randomly chosen from the 10 th generation of POP1 and POP2, respectively.

Genetic evaluations
The EBVs of animals were calculated by the standard animal model: where y is the vector of trait phenotypes, μ is the overall mean, Z is an incidence matrix and a is the vector of animals' additive genetic effects and e is a vector of random errors. The variance of y was assumed to be ZAZ'σ 2 a + Iσ 2 e where A is the additive genetic relationship matrix and σ 2 e is the residual variance. A ridge regression model was used to estimate SNP effects for using in computing DGEBV.

Generation Populations
Step  The results are for each of the 29 autosomes in the recent generations for moderate heritability (0.25) and POP1 (40 k) and POP2 (800 k) over 10 replicates. SD: Standard Deviation.
The model can be written as where y is the vector of either trait phenotypes or EBVs calculated from trait phenotypes and pedigree data by BLUP, μ is the overall mean, X is the matrix of marker genotypes for each animal and b is the vector of marker effects, and e is a vector of random errors.
The marker genotypes of an individual were re-coded as the number of copies of one of the SNP alleles, i.e., 0, 1 or 2. The ridge regression factor used assuming a common variance for the marker effects, i.e σ 2 a / n i=1 2p i q i where n is the number of markers. This is equivalent to GBLUP method proposed by VanRaden [15].
The DGEBV was computed for the animals in the prediction set aŝ a = X' β where X is the matrix of marker genotypes for each animal in the prediction set and β is the vector of estimated marker effects.
The gebv software [16] was used to estimate marker effects and DGEBV.
The accuracy of DGEBV was calculated as the correlation between the DGEBV and the true breeding value and standard errors were computed as the standard deviation of the accuracies across the 10 replicates, divided by √ 10 . The differences between the average accuracies of DGEBV across scenarios were tested by t-test.
The accuracies of DGEBV were regressed on number of animals in the TS, using a quadratic regression, to assess the effect of increasing the TS size.

Linkage Disequilibrium
Linkage Disequilibrium was measured by r 2 , which is the squared correlation of the alleles at 2 loci [17]: where It was demonstrated that r 2 is a more suitable measure to estimate LD for biallelic markers, such as SNPs, because r 2 is less sensitive to allelic frequency and sample size than D' [18][19][20].
Due to the computing time required to calculate all possible pair-wise LD, the LD statistics for all pair-wise LD were calculated only for BTA 1 and for one replicate. However, the statistics should illustrate well the decay of simulated LD for the entire genome. The statistics for LD between all adjacent SNP were calculated, however, for all 29 BTA.

Linkage disequilibrium
The average levels of LD between adjacent SNP in the recent generations of POP1 and POP2 are given in Table 2. Table 3 presents the average LD for different distances between all closely located SNP pairs (<1 Mb) in bins of 0.1 Mb for BTA1.
The overall r 2 between adjacent SNP across all autosomes for moderate heritability (0.25) in the recent generations was 0.25 and 0.32 for POP1 (40 k) and POP2 (800 k), respectively ( Table 2). This average r 2 between adjacent SNPs for POP1 is similar to the value reported by [21] for Gyr cattle in Brazil (r 2 = 0.21), using bulls genotyped with the Illumina BovineSNP50 chip and similar to the r 2 reported by [12] for a multi-breed herd genotyped with the Illumina BovineSNP50 chip in Canada (r 2 = 0.21 between markers 30-35 kb apart). Table 3 shows the average r 2 between all SNP pairs for different distance ranges up to 1 Mb for BTA1. The highest average estimated r 2 was in the range of 0.22-0.24 for a distance of 0-0.10 Mb. The r 2 values decreased with increasing distance between SNP, but not as sharply as the decrease reported for Holstein cattle [22]. The trend of decay of LD with the increase in physical distance was, as expected, exponential for both POP1 and POP2 (  In the current simulation study, for moderate heritability (0.25) and POP1 (40 k), the % of r 2 > 0.30 for ranges 0-0.1 Mb, 0-0.2 Mb, 0-0.5 Mb and 0-1 Mb was 28%, 25%, 15% and 9%, respectively. These results are similar to findings of [21] for Gyr cattle, who reported a % of r 2 > 0.30 of 23%, 20%, 14% and 10% for the same ranges, respectively.
Using the Illumina BovineSNP50 chip, [13] reported r 2 of 0.31, 0.22 and 0.15 for Angus and crossbred cattle for distance ranges between 0-0.03 Mb, 0.03-0.06 Mb and 0.06-0.10 Mb, respectively. The simulated results in the present study for the distance range 0-0.10 Mb seem in line with those results reported by [13].
Mckay et al. [24] reported the r 2 for eight breeds of cattle varying from 0.15 to 0.20 for a distance range from 0-0.1 Mb, which is similar to the r 2 observed in the simulated data for the same distance range (Table 3). Table 4 shows the average LD (r 2 ) between adjacent SNPs and distribution of SNP pairs across different LD ranges on chromosome 1 in the recent generations for three heritability levels and POP1 and POP2. The % of SNP pairs with r 2 > 0.30 did not differ across the heritabilities, probably because the number of generations under selection was small (10).
When comparing the marker density, 43% of SNP pairs showed high LD (r 2 > 0.30) in POP2 (800 k) while The results are for chromosome 1 in the recent generations for three heritability levels and POP1 (40 k) and POP2 (800 k) for one replicate. For POP2 (800 k), 40 k markers were randomly sampled to represent the 800 k panel.
33% of the SNP pairs showed high LD in POP1 (40 k) for moderate heritability (0.25). This difference of ten points seems not very high, but when it is translated to number of SNP pairs in high LD in each marker density, it becomes quite significant and has an impact on the accuracy of genomic selection, as shown later in the results.

Accuracy of DGEBV
Overall, the accuracy of DGEBV increased significantly (p < 0.05) with the increase in the number of bulls in the TS, heritability of the trait and density of markers (Table 5). When the EBV was used to estimate the marker effects in POP1 (40 k), the number of bulls in the TS, which varied from 480 to 1920, increased the accuracy of DGEBV for all three heritability levels.
The DGEBV accuracies were regressed on the number of bulls in the TS, using a quadratic regression. The estimated regressions were y = 0.256 + 0.00026x -5.2e -8 x 2 (p < 0.0001; R 2 = 0.91), y = 0.220 + 0.00036x -8.2e -8 x 2 (p < 0.0001; R 2 = 0.96) and y = 0.228 + 0.00037x -8.3e -8 x 2 (p < 0.0001; R 2 = 0.97) for heritability = 0.10, 0.25 and 0.40, respectively. Figure 3 depicts the estimated regressions for each heritability level. It is expected that the accuracy of DGEBV would increase 0.15, 0.20 and 0.21 points for an increase from 480 to 1480 bulls in the TS, and 0.11, 0.12 and 0.14 points from 920 to 1920 bulls in the TS, for heritability of 0.10, 0.25 and 0.40 respectively. These results do not fully agree with those from [3], who used Holstein bulls genotyped for the Illumina BovineSNP50 chip and reported that gains from genomic data increased almost linearly with number of bulls in the TS. Although it not possible to extrapolate out of the range of number of bulls in the TS, the current results suggest that it will be necessary genotyping a large number of bulls to obtain large gains with genomic selection, and a plateau might be expected, limiting the possible gain in accuracy of DGEBV for the density of markers considered (40 k) and the extent of LD simulated.
The highest accuracy (0.64) was observed for heritability of 0.40 and 1920 bulls in the TS (Table 5). This value implies that 41% of genetic variance of the trait was explained by marker effects. These results are similar to those reported by [8], using data from 2000 Angus bulls. The author reported correlations from 0.5 to 0.7 between conventional EBV and genomic predictions. The author, however, did not report the accuracy of conventional EBV used to calculate the correlations.
The effect of heritability was not as evident for 480 bulls in the TS, as it was for 960 and 1920 bulls (Table  5), reflecting the fact that the sample size became a more limiting factor than the heritability level in the TS of 480 bulls. The results are for chromosome 1 in the recent generations for three heritability levels and POP1 (40 k) and POP2 (800 k) over 10 replicates. The response variable used for calculating the marker effects also affected the accuracy of DGEBV. Table 5 shows that when the phenotypes were used as response variable to estimate marker effects in POP1 (40 k), but with a 4-fold TS size (n = 7680), the accuracy of DGEBV was higher than that from using EBVs from 480 bulls for all heritability levels (p < 0.05). However, when EBVs from 1920 bulls instead of own phenotypes from 7680 bulls were used to estimate the marker effects, the accuracy of DGEBV was higher for heritabilities of 0.10 and 0.25 (p < 0.05). Therefore, in this study, genotyping 4-fold more animals (7680 vs. 1920) and using own phenotypes instead of EBVs to estimate the marker effects was not sufficient to increase the accuracy of DGEBV for moderate to low heritability traits. Table 5 also presents the DGEBV accuracy for the 800 k markers scenario. Because of the computing time and memory requirements, it was possible to simulate only 480 bulls with accurate EBV in the TS. The estimated accuracies of DGEBV were 0.43, 0.46 and 0.48 for heritability 0.10, 0.25 and 0.40, respectively. The accuracies of DGEBV in this scenario were higher (p < 0.05) than those from 40 k SNP markers and 480 bulls in the TS and similar to those with 40 k SNP markers and 960 bulls in the TS. These results indicate that the higher level of LD between the 800 k SNP markers would require 2 times less animals in the TS. Muir [25], using simulated data, reported that increasing number of markers had contradictory effects on accuracy of GEBV, because of co-linearity between the effects of markers. Then it might be expected that the 800 k SNP scenario would benefit more from an increase in the TS size and differences with respect to the accuracy of DGEBV from the 40 k SNP scenario would increase.
When the number of bulls genotyped with 800 k SNP markers was 1920 and their phenotypes were used for estimating marker effects, the accuracy of DGEBV increased only for heritability 0.40 (p < 0.10), while for the other two levels of heritability the accuracy decreased (p < 0.05) compared to accuracies from 480 bulls in the TS and using EBV for estimating marker effects.
One of the main advantages of using genomic information is for traits difficult or expensive to measure for which traditional evaluations are not usually available or to obtain accurate EBV early in the animal's life, when the own phenotypic information has not been measured yet. For the latter situation, if we consider that the average EBV of the parents is the only available information early in the animal's life, one could compare the accuracies provided by PA and the accuracies of DGEBV. Since the reliability of PA is a quarter of the sum of reliabilities of the EBV of parents, considering the average reliability of EBV of bulls in the TS reported in this study and the average reliability of the EBVs of the dams in the simulated data, the expected accuracies for PA would be 0.44, 0.53 and 0.58 for heritability = 0.10, 0.25 and 0.40, respectively. These accuracies are smaller than those accuracies of DGEBV considering 1920 bulls in the TS for POP1 (40 k) for all heritability levels. Therefore, the gains in accuracy by using DGEBV would be of 27%, 13% and 10% compared to accuracy of PA 0.10: acc=0.256 + 0.00025625*n -5.2e -8 *n 2 (R 2 =91%) 0.25: acc=0.220 + 0.00035521*n -8.2e -8 *n 2 (R 2 =96%) 0.40: acc=0.228 + 0.00037188*n -8.3e -8 *n 2 (R 2 =97%) for these conditions. If PA is unavailable, then the use of DGEBV could potentially be even more valuable.
Schenkel et al. [4] reported results from a large genomic prediction validation study in Canada for 44 traits of dairy cattle, using 6403 bulls in the TS genotyped for the Illumina BovineSNP50 chip. The genomic predictions showed an increase of reliability for 42 of the 44 traits. For production traits the average gain in reliability was of 41%, using domestic proofs only and 74%, using both domestic and MACE proofs. Therefore, as expected, the gains reported in the current study that tried to mimic the LD structure that one might expect in a beef cattle population were lower than those reported in dairy cattle.
The main issue that has been discussed in most studies involving genomic selection is whether it will bring additional benefits, considering the cost of genotyping. An alternative to reduce the cost of genotyping and, thus, increase the number of genotyped animals, is the imputation of genotypes, which would allow genotyped animals with a less dense SNP panels (e.g., 3 k) to be imputed to a denser SNP panels (e.g., 50 k), using a reference population genotyped with the denser SNP panel [26]. This alternative should be considered primordial in the projects that involve genomic selection, mainly in countries where the number of available bulls with accurate EBV for genotyping is much smaller than those considered in this study. In this way, with the use of imputation, more bulls, even those with less accurate EBV, could be incorporated to the TS. Collaboration among institutions involved in genomic selection research and application in beef cattle through the sharing genotypes could also increase the TS sizes, which would make more feasible the incorporation of genomic information in breeding programs.

Conclusions
As reported in dairy cattle, the size of the training set and the extent of linkage disequilibrium have major impact on the accuracy of direct genomic EBV. Based on the findings of this simulation study, large training sets, as well as dense marker panels, aiming to increase the level of linkage disequilibrium between markers and QTL, will likely be needed in beef cattle for successful implementation of genomic selection.
Gains in accuracy by using direct genomic EBV and, therefore, before the animal having an accurate EBV, showed more advantage for traits with a moderate to low heritability.
Increasing the number of animals in the TS by 4-fold and using their phenotypes to estimate marker effects was not sufficient to maintain or increase the accuracy of direct genomic EBV obtained using estimated breeding values when the trait heritability was lower than 40% for both marker densities, i.e. 40 k and 800 k SNPs.
This investigation provides preliminary insights on expected gains in accuracy by using direct genomic EBV in beef cattle. Additional simulation studies are warranted.