An assessment of population structure in eight breeds of cattle using a whole genome SNP panel.

BACKGROUND
Analyses of population structure and breed diversity have provided insight into the origin and evolution of cattle. Previously, these studies have used a low density of microsatellite markers, however, with the large number of single nucleotide polymorphism markers that are now available, it is possible to perform genome wide population genetic analyses in cattle. In this study, we used a high-density panel of SNP markers to examine population structure and diversity among eight cattle breeds sampled from Bos indicus and Bos taurus.


RESULTS
Two thousand six hundred and forty one single nucleotide polymorphisms (SNPs) spanning all of the bovine autosomal genome were genotyped in Angus, Brahman, Charolais, Dutch Black and White Dairy, Holstein, Japanese Black, Limousin and Nelore cattle. Population structure was examined using the linkage model in the program STRUCTURE and Fst estimates were used to construct a neighbor-joining tree to represent the phylogenetic relationship among these breeds.


CONCLUSION
The whole-genome SNP panel identified several levels of population substructure in the set of examined cattle breeds. The greatest level of genetic differentiation was detected between the Bos taurus and Bos indicus breeds. When the Bos indicus breeds were excluded from the analysis, genetic differences among beef versus dairy and European versus Asian breeds were detected among the Bos taurus breeds. Exploration of the number of SNP loci required to differentiate between breeds showed that for 100 SNP loci, individuals could only be correctly clustered into breeds 50% of the time, thus a large number of SNP markers are required to replace the 30 microsatellite markers that are currently commonly used in genetic diversity studies.

between the Bos taurus and Bos indicus breeds. When the Bos indicus breeds were excluded from the analysis, genetic differences among beef versus dairy and European versus Asian breeds were detected among the Bos taurus breeds. Exploration of the number of SNP loci required to differentiate between breeds showed that for 100 SNP loci, individuals could only be correctly clustered into breeds 50% of the time, thus a large number of SNP markers are required to replace the 30 microsatellite markers that are currently commonly used in genetic diversity studies.

Background
Population structure and diversity within and between breeds of cattle have been studied to learn more about the origin, history and evolution of cattle [1][2][3]. Diversity studies and subsequent investigations concerning domestication events of Bos taurus and Bos indicus cattle have included sequencing from the displacement loop of mitochondrial DNA (mtDNA) [1]. Bradley et al. [1] used mtDNA sequence variation in 90 extant bovines from Africa, Europe and India to identify patterns of genetic variation consistent with the demographics of the domestication process. When nuclear marker have been used to study diversity in cattle, they have principally entailed microsatellite markers [2]. MacHugh et al. [2] used 20 microsatellites to help clarify the genetic relationships between cattle populations from Africa, Europe and Asia and provided support for a separate origin of domestication for Bos taurus and Bos indicus cattle. Analysis of allelic variation has been used to characterize the genetic relationships between breeds [4][5][6][7]. Kumar et al. [4] used 20 microsatellite markers to estimate the extent of genetic differentiation among breeds of cattle from India, Europe and the Near East. Assuming two ancestral populations, the mean admixture coefficients ranged from 0.0 to 0.1 in Indian Bos indicus breeds, 0.9 to 1.0 in European Bos taurus breeds and from 0.1 to 0.9 in hybrid breeds from the Near East. This variation in admixture coefficients reflects the ancestral divergence between the Bos taurus and Bos indicus subspecies. Similarly, Wiener et al. [5] characterized the diversity within and between eight British breeds of cattle using 30 microsatellite markers and found that the majority of the allelic variation (87%) was found within breeds. In addition, the studied breeds of cattle did not cluster according to their current geographic location, suggesting that the genetic origin of breeds was from different geographical regions. In a study of the origin of Chirikof Island cattle, MacNeil et al. [6] also found that 86% of the genetic variation in 34 microsatellite loci was found within Bos taurus breeds while the remaining 14% of genetic variation was found between breeds. However, the indigenous Chirikof Island cattle were strongly differentiated from the European Bos taurus cattle suggesting that a comparison between Asian Bos taurus breeds might next be appropriate. On the other hand, no significant divergence appears to exist between geographically separated populations of Holstein cattle probably due to historic occurrences of gene flow between populations and selection for similar traits [8]. Up to now most studies have focused on a small set of microsatellite loci, typically the 30 suggested by the FAO [9]. The true extent of autosomal diversity among cattle breeds has yet to be extensively explored. Here, we examine population substructure and interbreed diversity among eight breeds of cattle using 2,641 autosomal genome-wide SNPs.

Results and Discussion
Preliminary analyses were performed using the STRUC-TURE software. We first explored the appropriate number of iterations for the initial burn-in and estimation phases of the analysis. These preliminary analyses indicated that the probability of the number of ancestral populations (the K parameter from STRUCTURE) being greater than five was very small and therefore we restricted our analyses of all datasets to K ≤ 5 to limit computation time (data not shown). Analyses were performed on three datasets which used the full complement of markers but varied according to breed representation. The first analysis included data for all eight breeds, the second dataset included only the six taurine breeds and the third analysis included all Bos taurus breeds excluding the Japanese Black. The number of ancestral populations (K) that were subsequently admixed to form these breeds was estimated using the method described by Evanno et al. [10] and was found to be no greater than two for each data set ( Figure  1). The ΔK method of Evanno et al. [10] cannot be calculated at K = 1; however, the log-likelihood of the data, logP(X|K) in Figure 1, indicates that K = 1 can be excluded for all three of the analyzed datasets. The average estimated admixture coefficients (the Q parameters from STRUCTURE) of individuals from each breed are summarized in Figure 2 assuming two ancestral populations in each case.
The results presented in Figures 1a and 2a demonstrate that a considerable source of variation among cattle is the partitioning of breeds into the Bos indicus and Bos taurus subspecies. The variance between these groups (Table 1) accounted for 18.8% of the total variation (F CT = 0.19) which was significant (P ≤ 0.036). However, 71.06% of the genetic variation was found within populations. It should be noted that the SNP loci used in this study were detected in Bos taurus and their average minor allele fre-(A-C). ΔK (--) and the mean log P(X|K) (-ᮀ-) based on the 5 replicate STRUCTURE runs indicate that K = 2 is optimal for each dataset quency was much lower in Bos indicus. This ascertainment bias may have resulted in underestimated F ST values between breeds within both the Bos taurus and Bos indicus types, making them appear more similar than they really are and overestimating F ST values between the Bos taurus and Bos indicus breeds, making them appear more different than they really are. Despite this, the topography of the phylogeny and of breed composition from the STRUCTURE analyses should be correct, even if the distances between breeds are biased. Unfortunately, this supposition cannot be examined using our data, because the vast majority of the SNPs in the public domain and that were sampled for this study were discovered in Bos taurus cattle using procedures that guaranteed that the most common SNPs would be detected. The same problem exists for the studies that have historically employed microsatellite loci, since the sampled loci were cloned primarily from Bos taurus cattle and the loci selected for phylogenetic analysis were those possessing the most alleles when surveyed across populations. For example, Bos taurus and Bos indicus breeds have previously been clustered within subspecies using STRUCTURE in an analysis using 20 microsatellites. In a model with K = 2, Kumar et al. [4] found that the mean admixture coefficient of taurine breeds ranged from 0.9 to 1.0 while that for the indicine breeds ranged from 0 to 0.1. Our mean admixture coefficients ranged from 0.03 to 0.08 in Bos indicus and from 0.54 to 0.67 in Bos taurus breeds. While our findings are similar to those of Kumar et al. for the Bos indicus breeds, results for the Bos taurus breeds differ substantially. This may be due either to the difference in number of markers examined or, perhaps more likely, due to the different mutation rates between microsatellite and SNP loci. As the mutation rate for microsatellite loci is higher than for SNP loci [11], using microsatellite markers would most likely give higher estimates of divergence as measured by admixture.
To explore the population structure among the taurine breeds, a second STRUCTURE analysis was performed removing the two indicine breeds, and using data from the six taurine breeds (Figure 1b). This analysis identified Japanese Black cattle as being distinct from the cluster comprising the remaining five taurine breeds (Figure 2b). However, the partitioning was not strongly supported by the analysis of molecular variance (Fct = 0.09; P < 0.17; Table 1). The mean admixture coefficients for the European taurine breeds ranged from 0.43 to 0.60 while values for the Japanese Black ranged from 0.1 to 0.29. The upper and lower quartile range of the admixture coefficients for the individual Japanese Black animals were not as symmetric as found for the European taurine breeds ( Figure  2b) and were skewed towards the European taurine breeds, suggesting a recent influence of European Bos taurus breeds within Japanese Black. Previously published reports describe the use of European breeds to upgrade Japanese Black cattle [12] which is supported by these data. Several domestication events have been suggested for cattle involving different strains of aurochs, including an independent taurine domestication event in Asia [12,13]. These results suggest that the Japanese Black breed is genetically distinct from the European taurine breeds and because the divergence greatly exceeds the variation between the beef and dairy breeds (Figure 2b), we believe that an independent Asian domestication event is more likely to explain the divergence than does selection or drift following domestication. The within breed variation in the admixture coefficient Q in Figure 2 also supports this contention. Provided the Japanese Black population does not represent a recent cross among divergent populations, the increased variation within this population is consistent with the hypothesis of a local Asian domestication event. Additional Asian derived cattle breeds will need to be tested to assess the weight of evidence for this hypothesis. However, our data are completely consistent with the origin of Japanese Black cattle being from an independent Asian domestication.
The third STRUCTURE analysis considered the remaining Bos taurus breeds after excluding the Japanese Black and resulted in a clustering of the meat and dairy breeds (Figures 1c, 2c). The mean admixture coefficients demonstrate considerably less variation within the Continental European breeds, which is consistent with the small effective F CT is the correlation of random haplotypes within a group of populations, relative to that of random pairs of haplotypes drawn from the whole species. F SC is the correlation of the molecular diversity of random haplotypes within populations, relative to that of random pairs of haplotypes drawn from the region. F ST is the correlation of random haplotypes within populations, relative to that of random pairs of haplotypes drawn from the whole species.
population size that must have accompanied the introduction into North America of small samples of animals from these Continental breeds. The strong selection for milk production in the Holstein breed in conjunction with the extensive use of artificial insemination has reduced the genetic diversity within this breed and is apparent in these data. Surprisingly, therefore, the Dutch Black and White cattle had the greatest variation among all of the breeds studied suggesting that selection for milk production has been less intense in this breed than in Holsteins. Interestingly, 4.65% of the variation was found between the beef and dairy groups (F CT = 0.04) ( Table 1) with a p value of 0.10 that was suggestive, but not significant. This variation suggests that artificial selection within cattle for alternate agricultural purposes has led to a genome wide divergence among the beef and dairy breeds. Additional analyses in which the genomic regions at which divergence between the types is greatest are overlaid with detected meat and milk QTL would be of considerable interest.
All of the STRUCTURE analyses using the three datasets supported the existence of two ancestral populations partitioning the breed types. Assuming that these represent the true number of ancestral populations, we sought to answer the question, how many loci would be required to precisely estimate the number of ancestral populations? We randomly sampled the dataset of 2,641 loci to produce 10 datasets with 25, 50, 100 or 150 loci and repeated each of the previous analyses. The results presented in Figure 3 show the results of each of the replicate analyses. At the subspecies level (Figure 3a), the correct number of ancestral populations was accurately inferred with as few as 25 loci. This is clearly due to the large divergence between the Bos indicus and Bos taurus subspecies which is demonstrated by the difference in the mean admixture coefficients in Figure 2a. The second analysis, which included only taurine breeds, demonstrates that using as many as 150 randomly chosen loci only yields the correct number of clusters in 40% of the instances (Figure 3b). This is most likely a result of the closer relationship between the taurine breeds (Figure 2b), and the presence of two levels of substratification among these breeds (Asian vs. beef vs. dairy). The third analysis, which excluded the Japanese Black breed, more frequently detected two ancestral populations (Figure 3c), which primarily detected the remaining beef vs. dairy strata in the data. Not surprisingly, it is evident from this set of analyses that the number of random SNP loci needed to accurately infer population structure is dependent on the divergence between populations. Earlier studies seeking to characterize the genetic diversity within and between breeds of cattle used 30 microsatellites [5]. Approximately three times the number of SNPs are needed compared to microsatellites [14], therefore, 150 SNPs should have been ample for inference of population structure. However, 150 SNPs still detected incorrect clustering among taurine breeds. Seldin et al. [15] reported similar results when trying to differentiate Northern and Southern European human populations using 400 randomly selected SNP loci and suggested the limited number of SNPs used as the potential problem. Clearly, future studies which seek to evaluate the relationships among closely related cattle breeds will require a larger number of SNP loci to accurately infer breed relationships.
Finally, we generated pairwise population Fst estimates using the complete dataset of 2,641 loci (Table 1) and used these to construct an unrooted Neighbor-Joining tree ( Figure 4). This analysis is consistent with expectations, with the two French breeds clustering together, the Dutch Black and White and Holstein also clustered, with short branch lengths. The European group of Limousin, Charolais, Angus, Holstein and Dutch Black and White is separated from the Japanese Black, and the taurine breeds are separated from the indicine breeds. These estimates of genetic distance and the tree topology support the findings of the STRUCTURE analyses.

Conclusion
The recent completion of a draft bovine genome sequence assembly has provided sufficient numbers of SNP loci to replace microsatellite loci and augment mtDNA sequences for population genetic analyses in cattle. We have shown that SNP loci can be used to identify population substructure among cattle breeds. However, we have demonstrated that a large number of SNP loci must be used to obtain an equivalent degree of precision in estimates of diversity compared with microsatellite loci, due to the lower information content of individual SNP loci. At issue is the importance of ascertainment of these loci to the phylogenies that are constructed. Because the majority of available SNPs were detected as the most common SNP within Bos taurus breeds, certain biases must exist within the analyses. However, the extent of these biases can only be quantified when these analyses are repeated using unbiased samples of loci, which to date, do not exist. Family structure and the number of individuals per family varied between breeds but the general family structure consisted of a grandparent, parent and three or more progeny. To determine the phase of alleles on the chromo-somes using linkage information, we selected small families where members within the families were closely related but the families themselves were as unrelated as possible. This three generation family structure allowed for the efficient estimation of marker phase relationships in the progeny and also produced the most likely phase relationships in each of the parents and grandparents. al. [16]. Briefly, sequence information for SNPs (see Additional file 1) was obtained from public databases and SNPs were genotyped as a GoldenGate ® assay using an Illumina BeadStation 500 G [17]http://www.illu mina.com. Loci included in this study met the following criteria; minor allele frequency (MAF) ≥ 0.05 in Angus based on previous screens (data not shown) and concordant locus order between radiation hybrid (RH) maps [18] and genomic sequence location. The software GENO-PROB V2.0 [19,20] was used to assess genotype score quality and to produce whole chromosome phased maternal and paternal haplotypes based on the pedigree and map locations of the loci.

Population Structure Analysis
STRUCTURE and the linkage model of Falush et al. [21] were used to evaluate the extent of substructure among contemporary breeds of European and Asian Bos taurus and Bos indicus cattle. Exploratory STRUCTURE runs were used to determine the optimum number of iterations for the initial burn-in and estimation phases of the analysis to ensure that the Gibbs sampler had explored a sufficiently large sample space to provide reliable posterior probabilities. From these preliminary analyses, we determined that an initial burn-in of 10,000 iterations followed by 10,000 iterations for parameter estimation was sufficient to ensure convergence of parameter estimates (data not shown). We performed a series of analyses (runs) that were based on inclusion of differing combinations of cattle breeds in an attempt to determine the minimum number of ancestral populations that were admixed to best explain the genomic architecture of the current set of breeds. The first run used all of the animals from all 8 breeds. The second run used the 6 taurine breeds (Angus, Charolais, Limousin, Dutch Black and White Dairy, Holstein and Japanese Black) while the third run used the taurine breeds without the Japanese Black. To estimate the number of populations (the K parameter of STRUC-TURE), each of these three data sets was analyzed allowing the value of K to vary from 1 to 5 and each run was repeated five times to produce a total of 75 STRUCTURE runs. Using the method of Evanno et al. [10] we calculated ΔK which is an ad hoc quantity related to the second order rate of change of the log probability (likelihood) of the data Pr(X|K) (equation 12 in [22]) with respect to the number of population clusters K.
Assuming the full dataset of 2,641 loci would yield the most accurate estimate of the true number of ancestral populations, we sought to determine the effect of the number of loci analyzed on inferences of K. Three new data sets each with 10 replicates were created by randomly sampling 25, 50, 100 or 150 loci from the 2,641 markers. Each replicate was analyzed using STRUCTURE as previously described except the admixture model was used rather than the linkage model as linkage among the sampled loci was assumed to be lost due to randomly sampling loci throughout the genome.
Finally, Fst values, population corrected average pairwise differences and analyses of molecular variance (AMOVA) were performed with the program ARLEQUIN [23] using the phased genotypes for each animal produced by GENOPROB V2.0. Significance levels for variance components and F statistics were estimated using 10,000 permutations. MEGA [24] was used to construct a Neighbor-Joining tree from the pairwise Fst values ( Table 2).

Authors' contributions
SDMcK conceived the study and participated in its design, data collection, analysis and manuscript preparation. RDS participated in study design, data analysis and manuscript preparation. BMM participated in study design, genotyping and data analysis. JA provided bioinformatics support. WC organized collection of Dutch Black and White samples, DC provided Charolais samples, EDN organized collection of Nelore samples, CAG provided bioinformatics support and organized collection of Brahman samples, CG provided bioinformatics support, HM organized the collection of Japanese Black samples, LKM provided bioinformatics support, ZW provided statistical support,