The principal goal of this study was to evaluate marker selection methods and determine the minimum number of SNP markers from the BovineSNP50 BeadChip required to effectively and confidently assign individual genotypes to European cattle breeds. While all SNP selection methods yielded reduced marker panels capable of breed identification, the power of assignment varied markedly among analysis methods.
Behaviour of the marker selection methods
The pairwise Wright's FST selection method marginally outperformed other selection methods in the individual assignment analysis (Table 3, Figure 2). Nonetheless, three other selection methods, delta, pairwise W&C's FST and PCA, did not perform poorly at ranking markers or for assignment success rates. Across these selection methods, to achieve 95% assignment success, < 80, < 100, < 140 and < 200 SNP markers were required at the stringency threshold levels of LLR > 0, LLR > 1, LLR > 2 and LLR > 3, respectively (Table 3, Figure 2). These four selection methods (delta, pairwise Wright's FST, pairwise W&C's FST and PCA) to a large extent agreed on the most informative SNP markers. The resulting estimates of genetic informativeness of each SNP marker were highly correlated across the four selection method and there was a large degree of overlap among the top-ranked 500 SNP markers (Table 2). This was to be expected because all methods were applied to individual SNP marker allele frequencies. In addition, it has been demonstrated that delta and Wright's FST function similarly . However, PCA exhibited the poorest correlation with the other methods and lowest overall individual assignment power. Paschou et al.,  advocated using PCA to determine marker informativeness because PCA renders an overall estimate for a SNP marker, as compared with other selection methods where it is necessary to estimate an average from pairwise calculations when the number of populations (K) > 2. PCA is an approach used to characterise the structure of a set of variables (in this case SNPs). The inferred relationships between objects (e.g., populations/breeds) are determined by the structure of the covariance matrix between the marker allele frequencies. Thus, the informativeness of a given marker will depend on the other markers included in the analysis and this could influence the informative markers that PCA identified. In contrast, delta and FST do not take into account the relationships amongst markers and the level of information of each marker is estimated independently of the others.
The remaining two selection methods, global Wright's and W&C's FST, performed comparatively poorly in the individual assignment test. As similarly observed by Kersbergen et al. , global FST may not be appropriate to assess the level of genetic information in SNP markers when K > 2, as the method could result in the selection of SNP markers which are specific in distinct populations [Additional file 1: Supplemental Figure S1]. The selected SNP markers that were specific for only the most distinct breed were not segregating in the majority of the other breeds [Additional file 1: Supplemental Figure S1], and thus the expected heterozygosity would be low. Indeed, it is suggested that genetic markers with high expected heterozygosity are informative and therefore useful in individual assignment analysis [15, 33], such as those identified using pairwise Wright's FST, delta, pairwise W&C's FST and PCA. As a result the performance of individual assignment tests using global FST selected markers may be compromised compared to the other selection methods. Consequently, when K > 2 it is preferable to estimate FST, either Wright's or W&C's, on a population pairwise basis and then estimate the average across the pairwise comparisons to obtain an overall estimate for a marker.
Assignment precision: minimum number of markers required
Since pairwise Wright's FST outperformed the other selection methods (Table 3) this selection method was subsequently adopted to estimate the minimum number of SNP markers required to achieve the desired assignment success. At the most commonly used stringency threshold (LLR > 0) and the accepted level of appropriate assignment success (95%) , < 60 SNP markers were required for the correct assignment of the 384 individual genotypes. When stricter stringency threshold levels are applied, the number of SNP markers required to attain 95% assignment success increased (Table 3). Depending on the chosen degree of confidence, the required number of markers ranges from 60 to 140 SNPs (80, 105 and 140 at LLR > 1, LLR > 2 and LLR > 3, respectively). While the percentage of assignment success decreases with increasing stringency thresholds, so too does the risk of false assignment. Consequently, there is greater confidence in the estimated genotype likelihoods and LLR calculations if a strict stringency threshold (LLR > 3) is adopted.
It is difficult to compare the results obtained here to other studies conducted on individual assignment analysis in cattle breeds. First, most previous studies used microsatellite markers and, second, these studies had only a limited number of markers (e.g., [5, 8]). These studies also primarily focused on the practicality of assigning individuals among cattle breeds with the available markers and were not concerned with how many markers would be required to achieve confident assignment of individual genotypes. In a study of French cattle breeds, Maudet et al.,  found that using 23 microsatellite loci > 93% of individuals could be assigned to their breed origin. A more recent study used SNP markers but did not have a large dataset at their disposal and could, again, only address the practicality of individual assignment with the limited set of available markers . Using 90 SNP markers genotyped in 24 European cattle breeds they were able to correctly assign 85% of individuals to their breed origin. McKay et al.,  used STRUCTURE to assess the number of loci required to estimate the number of ancestral populations in 6 Bos taurus breeds. The use of 150 randomly chosen loci (from a dataset of 2,641 loci) yielded the correct number of clusters in only 40% of cases, consistent with reduced assignment power for randomly-selected markers found in the current study (Figure 2). The lower assignment power in those studies was most probably a direct consequence of using an insufficient number of informative loci. The comparatively high assignment power of fewer SNP markers in the current study was probably due to the availability of > 40,000 SNP markers and the benefit of selecting markers that contain the most genetic information with respect to the reference populations. Only a few highly polymorphic microsatellite loci are required in individual assignment studies. However, dense SNP panels are now available for many species and SNP markers possess numerous advantages, including cost, throughput and reliability, making them a favourable choice over microsatellites.
Assignment success: individual breeds
It is evident that certain breeds in this study require far fewer markers to achieve > 95% assignment success than others, regardless of the selection method used (Table 4, Figure 3). For example, the Jersey, Brown Swiss, Guernsey and Piedmontese breeds achieved 100% assignment success, even at stricter stringency thresholds using 50 SNP markers (pairwise Wright's FST, LLR > 2, Table 4). In contrast, the French breeds like the Charolais, Limousin and Simmental achieved ~ 90% assignment success at LLR > 0, which fell to < 50% with increasing stringency threshold using 50 SNP markers (Table 4). Similarly, the breeds that exhibited a lower power of assignment success (Table 4) also had higher type I and II error rates (Table S1).
A problem associated with the use of SNP markers in population genetics is ascertainment bias, which could influence population genetic estimates and may contribute to differences in assignment performance for individual breeds . Heterogeneity amongst sample representatives can introduce ascertainment bias and breeds not included in the SNP discovery process could have lower minor allele frequencies (MAF) [15, 36]. The average MAF was lowest in the Brown Swiss, Guernsey and Jersey breeds (Table 5), one of which was represented in the SNP discovery process and the three breeds which were central to the process (Angus, Hereford, Holstein) did not have the highest average MAF values. In addition, no one particular SNP discovery method was over-represented in the top identified SNP markers [Additional file 2: Supplemental Table S2] as the discovery method proportions were similar to that represented on the Bovine SNP50 assay . SNP ascertainment bias would have been more pronounced if B. t. indicus breeds had been included in this study . Morin et al.,  concluded that ascertainment bias may be an issue in the assessment of population size and demographic changes. It is least important for individual identification and assignment tests, where the intentional selection of informative markers provides greater power than do randomly chosen markers.
A factor that could affect the power of assignment success and variation in power of assignment between breeds is the level of pairwise genetic differentiation amongst the breeds. It is known that the number of markers required to obtain a high accuracy of assignment is influenced by the level of population genetic differentiation [8, 37]. That is, it depends closely on the populations under consideration and respective levels of genetic heterogeneity. As demonstrated in Figure 3, the level of genetic differentiation of a breed, measured by FST, is correlated with power of assignment success. Low breed genetic differentiation was observed in Charolais and Simmental, which similarly showed higher rates of Type I and II errors (Figure 3, [Additional file 1: Supplemental Table S1]). False positive assignments also occurred between breeds of known recent ancestry, for example, Angus and Red Angus, and Finnish Ayrshire and Norwegian Red . In addition, cases of mistaken assignment occurred between Charolais, Simmental, Limousin and Shorthorn, where the pairwise FST values amongst these breeds were < 0.1. In a study on individual assignment using microsatellites, Ciampolini et al.,  reported that of the four breeds under consideration, Charolais and Limousin had the lowest level of pairwise genetic differentiation and were the most difficult to discriminate between (FST = 0.041). As assignment success is a function of both the number of markers and population genetic differentiation, the level of breed genetic differentiation is indicative of the potential number of SNP markers necessary to attain high levels of power in individual assignment tests [6, 37].
Informative marker panels in population genetics
Evaluation of the selection methods revealed that only a small proportion of the markers from the BovineSNP50 BeadChip were highly informative for discriminating among 17 breeds, and the majority contained medium to low levels of genetic information (Figure 1). This is consistent with the development of the assay in which SNPs with high MAF across B. t. taurus breeds were preferentially selected in the assay design. Consequently, sets of randomly chosen SNP markers contained sufficient genetic information to produce moderate levels of individual assignment power (Figure 2). However, in contrast, a substantially reduced set of highly informative SNP markers were capable of precisely discriminating amongst the European cattle breeds (Figure 2).
Studies have shown that a reduced set of selected informative markers can effectively capture the genetic structure of human populations [23, 24]. For instance, Lao et al.,  found that 10 SNP markers from a 10K SNP array contained enough genetic information to differentiate individuals from Africa, Europe, Asia and America and additional loci contributed very little extra information. Indeed, it is generally considered that uninformative markers (i.e., monomorphic loci) may add noise to the results and compromise power of population genetic studies [38, 39]. It could be useful to create a minimum panel of maximum power, particularly when using Bayesian genotypic clustering software such as STRUCTURE to elucidate population structure, because these approaches are computationally demanding (which intensifies as the number of markers increases) . Consequently, it is practical and cost-effective to apply a selection method to dense assays to isolate the highly diagnostic markers and increase the power of analysis.
The number of markers required for population assignment will depend on the species, the populations under consideration, their respective level of genetic differentiation and the desired stringency of assignment. For instance, within dogs 27% of the genetic variation is found between breeds, whereas for humans the level between populations is only 5%-10% . As a result, the number of SNP markers required for individual assignment and discrimination amongst populations (breeds) will differ between species under consideration.