Skip to main content

Assessing the power of principal components and wright’s fixation index analyzes applied to reveal the genome-wide genetic differences between herds of Holstein cows



Due to the advent of SNP array technology, a genome-wide analysis of genetic differences between populations and breeds has become possible at a previously unattainable level. The Wright’s fixation index (Fst) and the principal component analysis (PCA) are widely used methods in animal genetics studies. In paper we compared the power of these methods, their complementing each other and which of them is the most powerful.


Comparative analysis of the power Principal Components Analysis (PCA) and Fst were carried out to reveal genetic differences between herds of Holsteinized cows. Totally, 803 BovineSNP50 genotypes of cows from 13 herds were used in current study. Obtained Fst values were in the range of 0.002–0.012 (mean 0.0049) while for rare SNPs with MAF 0.0001–0.005 they were even smaller in the range of 0.001–0.01 (mean 0.0027). Genetic relatedness of the cows in the herds was the cause of such small Fst values. The contribution of rare alleles with MAF 0.0001–0.01 to the Fst values was much less than common alleles and this effect depends on linkage disequilibrium (LD). Despite of substantial change in the MAF spectrum and the number of SNPs we observed small effect size of LD - based pruning on Fst data. PCA analysis confirmed the mutual admixture and small genetic difference between herds. Moreover, PCA analysis of the herds based on the visualization the results of a single eigenvector cannot be used to significantly differentiate herds. Only summed eigenvectors should be used to realize full power of PCA to differentiate small between herds genetic difference. Finally, we presented evidences that the significance of Fst data far exceeds the significance of PCA data when these methods are used to reveal genetic differences between herds.


LD - based pruning had a small effect on findings of Fst and PCA analyzes. Therefore, for weakly structured populations the LD - based pruning is not effective. In addition, our results show that the significance of genetic differences between herds obtained by Fst analysis exceeds the values of PCA. Proposed, to differentiate herds or low structured populations we recommend primarily using the Fst approach and only then PCA.


Farmed animals should have large genetic variation in exterior, production and fitness traits. Genetic variation is the basis for survival and maintaining of cattle populations. On genome level variation appears as considerable allelic diversity and heterozygosity. Genomic data help to track herds’ genetic divergence at molecular level. Knowledge of genetic diversity is also important for small breed conservation and crossbreeding strategies [1, 2]. Contemporary technologies allow to use massive SNPs data for these goals.

Several tools could be used to estimate genetic diversity in populations. The most effective and commonly used are principal component (PCA) and Wright’s fixation index (Fst) analysis. Both methods are widely used to estimate genetic difference between populations. Genomic PCA finds the eigenvectors of the covariance matrix derived from the genotypes of animals. These eigenvectors provide the efficient linear combination of marker data which the most effectively differentiate of various samples, without requiring apriori sample classification information. The resolution of highly structured populations with PCA depends on non-random patterns of genetic variation. To reduce the impact of this factor, one should filter the data by removing a marker from every pair of markers which are in tight LD [3,4,5], or implement a shrinkage PCA [6], and apply iterative pruning PCA [7, 8].

More than 70 years have passed since S. Wright introduced a fixation index to measure genetic difference between populations [9]. His approach proved to be very fruitful for the further development of population genetics. Over these years many Fst statistics have been proposed. Among them the most commonly used estimators are those presented by Weir & Cockerham [10] and Nei [11]. But, the first one is sensitive to sample size and the second one consistently overestimates Fst [12]. Another approach used in the Hudson’s estimator [13]. It is not sensitive to sample size ratio, not systematically overestimate Fst, and it is accurate and stable under various ascertainment schemes [12].

Generally accepted that the rare alleles play an important role in evolution. Analysis performed by Gorlov et al. [14] suggests that including rare SNPs in genotyping platforms will advance identification of causal SNPs in case-control association studies, particularly as sample sizes increase. This effect is confirmed by the genomic breeding value evaluation of dairy cattle [15] and the effect of rare alleles on estimated genomic relationships from whole genome sequence data [16]. For animals the frequency of SNPs alleles in the range from 0 to 0.5 obtained with arrays is nearly uniform while for sequence data this distribution is substantially biased to rare alleles [5]. Obtained phenomenon can lead to ascertainment bias in the evaluation between populations difference with SNPs arrays. Removing low-frequency and rare SNPs alleles (MAF < 0.02) can significantly distort results of PCA analysis [17]. Human population studies have shown inflation of ascertainment schemes on Fst values calculation [18,19,20]. In other studies were observed upward bias in Fst values [5, 21]. Clark et al. [22] on human populations demonstrated that data sets based on different ascertainment schemes gave different patterns of Fst values. Moreover, the raw array data and those with polymorphic SNPs in the wild chicken samples underestimated pairwise Fst values between breeds which had low Fst (< 0.15) in the whole genome resequencing (WGS) data, and overestimated this parameter for high WGS Fst (> 0.15) [5]. It should be borne in mind that Fst value can depend heavily on the level of variation present in a sample and the frequency of the most frequent allele [23]. Indeed, Jost [24] argued that Fst may be so affected by genetic diversity that it should not be used as a measure of population differentiation, gene flow or relatedness.

The Leningrad region is the highest average milk yield producing region in Russia, with approximately 60,000 cows of Holsteinized Black and White cattle. Dutch, Danish, and Swedish Black and White bulls and heifers were imported to Russia during the 1930s. The Black and White breed was officially registered in Russia in 1959. To improve milk traits of Black and White cattle, local farmers started to use imported from USA (since 1978) and the Netherlands (since 2002) Holstein bulls and semen. Currently, the commercial Russian Black and White cattle population can be considered as Holstein due to long-term crossing only with Holstein bulls.

In presented study, we tried to evaluate the following objective. 1. Evaluate the correspondence of MAF and linkage disequilibrium. 2. Assess the impact of outliers removal on Fst data. 3. Evaluate LD based pruning methodology on Fst values. 4. Evaluate impact the MAF of SNPs on Fst values. 5. Evaluate significance of Fst values. 6. Evaluate PCA analysis data. 7. Assess the power of Fst and PCA analyzes.


Evaluation the correspondence of MAF and linkage disequilibrium

The effect of LD - based pruning on the number of SNPs was large (see Additional file 1: Figure S1). To estimate impact of LD - based pruning on MAF of SNPs we calculated the distribution of MAF in eight bins (Fig. 1). The proportion of SNPs regarding the MAF bins in the complete and the pruned data was noticeably different. LD – based pruning completely removed monomorphic SNPs, disproportionally removed SNPs with MAF 0.2–0.4 while proportion of rare and common SNPs with MAF 0.0001–0.1 and 0.5 increased (Fig. 1). It can be suggested that in average SNPs with MAF 0.1–0.4 distributed in genome closer to each other than remaining SNPs leading to the largest LD between them.

Fig. 1

Proportion of SNPs in the complete and pruned data

Assessing the impact of outliers removal on Fst data

On the first step we evaluated the impact of the outliers on Fst values. We calculated Fst values for Pairwise set of complete data both with correction and without correction on outliers (see Additional file 1: Table S1). After outliers correction in EIGENSOFT 799 cows and 46,626 SNPs were remained.

Absence of the outliers correction leads to a bias of Fst values but only for 6 from 78 pairs of the herds. In all cases the difference between Fst values was ±0.001 with exception of 4 and 13 pair of the herds having 0.002 difference. Nearly the same Fst values was also stored for Pairwise set where was excluded SNPs with MAF < 0.01. Among these Fst values only six pairs of the herds differed by 0.001 from Fst values for complete data in Table S2 (see Additional file 1) and three of them were the same as in result of outliers correction.

Evaluation LD based pruning methodology on Fst values

Linkage disequilibrium pruning (LD < 0.1) had the same effect size on Fst values ±0.001 as the outliers had but affected more pairs of the herds 22 vs. 6 for outliers effect (Table 1 and see Additional file 1: Table 1). In point of fact the effect size on Fst was not large despite of considerable decrease in the total number of SNPs (5827 vs. 48,108) and their proportion in the SNPs bins (Fig. 1). Thus, LD - based pruning had a moderate effect on Fst values but it affected more pairs of herds.

Table 1 Estimates of Fst values for complete and pruned data

Evaluation impact of SNPs MAF on Fst values

To evaluate impact of SNPs MAF on Fst values, we divided the entire MAF interval 0.0001–0.5 into 6 bins and calculated for each of them the mean Fst value across Pairwise sets formed from complete and pruned data (Fig. 2). The rare SNPs alleles with MAF 0.0001–0.005 had the smallest mean Fst value (0.0027) across all herds than those for remaining SNPs (see Additional file 1: Table S3). It can be concluded that in average between herds differences calculated for rare alleles were less of those for common alleles. For MAF in the range of 0.1–0.5 the difference between the mean Fst values across beans for two Data sets was not significant. As a result of mutual compensation of the mean Fst values in complete and pruned data in whole MAF range, the total summed value of Fst value between them was insignificant (see Additional file 1: Table S3). Thus, these results again confirm a small effect size of LD – based pruning on Fst values only for rare SNPs not common SNPs alleles.

Fig. 2

Dependence of mean Fst on the MAF range

Evaluation significance of Fst values

To assess significance of Fst values in the Table 1 we carried out the pairwise herds permutations of the cows treating them as H null-distribution. The results of these Fst values calculations are listed in Table S4 (see Additional file 1). Then, we calculate P-values for each pair of the herds in Pairwise set using Student’s t-test (Table 2). All of them were with P – values in the range from 1.0e-06 to 3.6e-60 with mean 6.5e-18 and median 3.6e-40, thereby it is higly skewed distribution. To calculate Fst for H null-distribution we carried out only 5 permutations for each of 78 pairs of the herds as it was time consuming process and result of P – values estimates would be only slightly underestimated. In the Table 1 the minimum Fst values were 0.002–0.003. The pairs of the herds corresponding to these values are the candidates for genetically most similar herds. However, when comparing these herds in Table 1 the errors were not taken into account. The probabilities of making a type 1 error for all 78 herd combinations are given in Table 2. To evaluate the genetic differences between the herds we have chosen cut off P ≤ 1.0e-30 (P ≤ 1.28e-32 taking into account the Bonferroni correction) in which, as a rule, the differences between the herds at Fst values 0.002–0.003 should be insignificant. The results are shown in the Table 3. Insignificant pairs of herds were 2 and 8, 9, 11, 12 (4 pairs); 3 and 5, 8, 9, 10 (4 pairs); 8 and 2, 3, 9, 11 (4 pairs); 9 and 2, 3, 11, 12 (4 pairs). The pairs of herds with 2, 3, 8 and 9 herds had 4–6 Fst values 0.002–0.003 (Table 1). Therefore, the results of identifying insignificant pairs of herds (Table 3) correspond to the minimal Fst data in the Table 1. In the Table 3 most significant pairs of herds at this cut off were the herd 4 (10 pars), 7 (12 pairs), 13 (12 pairs) or a more stringent level of significance at cut off P ≤ 1.28e-39 the herds 4, 7 and 13 had 10, 8 and 11 significant pairs of the herds (Table 3).

Table 2 Estimates between herds genetic differences (P – values) a
Table 3 Between herds genetic differences for complete data revealed by Fst analysis

It was necessary to determine the most significant pairs of herds. The most significant at cut off P ≥ 1.28e-50 pairs of the herds were 2 and 5, 6; 4 and 2, 3, 5, 12; 5 and 11; 7 and 1, 2, 9; 13 and 5, 9, 12 (Table 2). These pairs of the herds correspond to the most genetically different pairs of the herds, while Fst data errors were taken into account as well. Summarizing the results of P - values calculating we can assert about a high level of significance the Fst analysis.

Evaluation PCA analysis data

The eigenvalues of 100 eigenvectors calculated from the covariance matrix of alleles from 803 cows monotonically decreased from 9.5 to zero. It proves that the structure of the covariance matrix was enough homogeneous. Overall P- values and percent of variance (in brakets) for ten eigenvectors calculated for complete and pruned data were 2.8e-15 (1.16), 0.20 (1.05), 3.9e-14 (1.02), 1.9e-08 (0.88), 9.7e-03 (0.76), 2.3e-03 (0.72), 8.2e-03 (0.71), 6.0e-09 (0.66), 4.9e-05 (0.62), 5.6e-04 (0.59) (1) and 3.3e-16 (0.84), 6.4e-06 (0.79), 2.0e-04 (0.76), 3.4eE-06 (0.70), 2.6e-05 (0.67), 3.2e-08 (0.58), 2.0e-03 (0.55), 4.0e-04 (0.54), 2.2e-07 (0.53), 3.0e-03 (0.51) (2) respectively, i.e. they were similar. However the overall P - value for the second eigenvector of pruned data has become significant (6.4e-06) and at the same time overall P - value for third eigenvector on many orders of magnitude decreased (3.9e-14 vs. 2.0e-04). Such was the effect of LD - based pruning on overall P - values. From the list of overall P – values should be clear what main significant “axes of variation” were. From the list of variances for each eigenvector (1) and (2) can be calculated the variances to be used after summing ten eigenvectors. It were 8.17% for complete data and 6.47% for pruned data. Whence, the more eigenvectors will be summed, the more value of variance will be used.

Having the small Fst values and gradual decrease of the eigenvalues we calculated the mean for every herd in the PC scales to statistical description between herds genetic differences revealed by PCA. The plot of the means for all herds along PC 1 and PC 3 is shown on Fig. 3 and along PC 1 and PC 4 is shown on Fig. 4. To assess significance of genetic difference between 13 herds based on PC 1 we listed (+) (denoting between herds significance) in Table 4 obtained from P – values in Table 2 where cut off of significance was taken at P ≤ 0.05 but given the Bonferroni correction we get P ≤ 6.4e-4. Further, for brevity, we write P ≤ 0.05 instead P ≤ 6.4e-4. For PC 1 among 78 pairs of the herds there were 14 significant pairs of the herds. Most often significant data were observed for herds 4 and 13. Some significant results obtained with Fst statistic also confirmed with PCA for eigenvectors 1. For example, the greatest pairwise Fst – values for herd 4 were confirmed by noticeably higher level of significance revealed by PCA (Table 2). Furthermore, insignificant pairs of the herds 1 and 4, 4 and 6, 4 and 13 for PC 1 correspond to smallest Fst values for pairs of the herds formed with the herd 4 (Table 1). It should be noted a negligible effect size of LD based pruning on between herds’ significance for eigenvector 1 (Table 2).

Fig. 3

Position of mean Fst values of the herds along PC 1 and Pc 3. Each point denotes the mean herd position along PC 1 and PC 3 for complete data

Fig. 4

Position of mean Fst values of the herds along PC 1 and Pc 4. Each point denotes the mean herd position along PC 1 and PC 4 for complete data

Table 4 Between herds genetic difference for complete data revealed by PC 1 and PC 3

The same procedure was carried out for PC 3 (Table 4). Among Pairwise set there were 16 significant pairs of herds. The most often significant data were obtained as well for herd 4 not herd 13. Out of fourteen significant pair of the herds revealed PC 1 only 9 coincide with sixteen significant pair of the herds revealed PC 3. Thus, PC 3 score is different from the PC 1 one. Obviously, it would be incorrect to make a conclusion about between herds significant differences if we used data for a separate eigenvector (Table 2).

Comparing the visible pattern of location the mean values of the herds along PC 1–3 and PC 1–4 we can draw some general conclusions (Figs. 3 and 4). The trajectory connecting herds 4–7–6-13-1 preserved on both figures. Other herds visually shifted relative to each other although not all of those displacements were significant at P < 0.05 as was shown along eigenvectors 1 and 3. However, the difference between these pairs of the herds was highly significant when we measured them with Fst statistics (Table 2). Thus, visual differences of the herds positions on Figs. 3 and 4 might be incorrect if we used only visual information along separate eigenvectors.

The lack of overall significance (P < 0.20) of second eigenvector for complete data and insignificance of the most pairs of the herds in Pairwise set indicates that there are not between herds genetic difference for this axis. Therefore, these data were excluded from consideration.

Furthermore, based on complete data, we tested the level of PCA data significance if P – values for Pairwise set were calculated from summed ten PC. Appropriate P – values are given in the Table 2 and significant pairs of the herds which were denoted as (+) are listed in the Table 5 at cutoff P ≤ 0.05. Among them there were 47 pairwise significant combinations of the herds for summed PC 1–10. The most significant result was obtained for herd 4 and 7 while insignificant results for herd 8, 9 and 10. Thus, giving summed genetic variance from 10 eigenvectors lead to noticeably increase the level of significance and change conclusions about data significance as was shown for PC 1 and PC 3.

Table 5 Between herds genetic difference for complete data revealed by summed PC 1–10 and PC 1–20

To verify the level of significance further we calculated P – values for Pairwise set of the herds from complete data across summed 20 eigenvectors (Table 2). It turned out that for cutoff at P ≤ 0.05, 61 from 78 pairs of the herds were significant (Table 5). The most significant pairs of herds were 1, 4, 7, 8, 11 and 12 and the most insignificant pair of the herds was formed with the herd 3. Considering the data for summed ten and twenty eigenvectors, it is important to note that significant pairs of the herds varied greatly with an increase in the number of summed eigenvectors. Thus, increasing the number of summed eigenvectors leads to overall increase of significance level.

To include complete variance available from PCA analysis we calculated P – values for 100 summed eigenvectors (Table 2). For complete data P-values distribution had mean 2.2e-07 and median 2.2e-15, thereby the distribution is highly skewed. The herd 3 had minimum P - values with other herds (Table 2) therefore based on these values we selected significant pairs of the herds at cutoff P ≤ 1.0e-10 and given the Bonferroni correction P ≤ 1.28e-12. The results are shown in Table 6. The herd 3 formed 6 insignificant pairs of herds 3 and 6, 8, 9, 10, 11, 12 and herd 8 formed 9 insignificant pairs of the herds 8 and 1, 2, 3, 5, 6, 9, 10, 11, 12. Thus, the herd 8 and 3 was the most genetically related with other herds and this result do not contradict Fst values 0.002 and 0.003 prevailing in pairwise set for these herds (Table 1). It was necessary to determine the most significant pairs of the herds. The most significant pairs of the herds at cutoff P ≤ 1.28e-20 were 4 and 2, 3, 7, 9, 12, 13 (6 pairs); 7 and 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13 (11pairs); 13 and 2, 3, 4, 7, 9, 10, 12 (7 pairs). This result for herds 7 and 13 is due to smaller standard errors for these herds than for herd 4 (see Additional file 1: Table 5).

Table 6 Between herds genetic differences for complete data revealed by PC 1–100

For pruned data P-values distribution had mean 2.6e-06 and median 1.8e-16. Thereby, the complete and pruned data distributions are similar. For the same cutoff P ≤ 1.28e-12 as for complete data, the data in the Table 2 were ranked (Table 7). The herd 3 formed 4 insignificant pairs of the herd 3 and 6, 8, 9, 12. The herd 8 formed 9 insignificant pair of the herd 8 and 1, 2, 3, 4, 5, 6, 9, 11, 12. Among 15 pair of herds 3 and 8 for complete data only 11 of those pairs coincide with pruned data. The most significant pair of the herds with cutoff P ≤ 1.28e-20 were 4 and 2, 3, 7 (3 pairs); 5 and 1, 2, 7, 11 (4 pairs); 7 and 1, 2, 4, 5, 6, 9, 10, 11, 12, 13 (10 pairs); 13 and 2, 7, 9, 10, 12 (5 pairs). Thus, P – values for complete and pruned data match good enough (except of the herds 4 and 5).

Table 7 Between herds genetic differences for pruned data revealed by summed PC 1–100

Assessing the power of Fst and PCA analyzes

In the Table 2 listed P – values for Pairwise set of the herds calculated with PCA and Fst analyzes. According to these data for summed 100 eigenvectors, P – values were the smallest of those for any other eigenvector or summed 10 and 20 eigenvectors. This result was due to use the complete variance from initial data. Further, comparing P – values of PCA and Fst analyzes draw a conclusion that Fst P – values were many orders of magnitude less those of summed 100 eigenvectors. Across Pairwise set the PCA calculated power was within the range of 0.8–1.0, while for Fst it was within the range of 0.9–1 that is the probabilities of a type II error are similar. In total, considering by several orders of magnitude smaller P – values for Fst, we can conclude that probability type I error for the Fst analysis was far less the PCA one. Therefore, it should be accepted that the data from the Fst analysis are more reliable.


Verification of the genetic diversity across herds is useful in a variety of biological context, especially in breeding, selection and conservation of breeds as well as crossbreeding strategies. In fact, the maintenance of genetic variation in the breeds is extremely important. The problem is to know whether such differentiation reflects meaningful differences. Genome – wide data allow to carry out the population analysis at unprecedented earlier level. We can expect herein to be able to resolve some diversity across herds’ genetic data. To solve this problem we apply two tools. First one was Wright’s Fst statistics [9]. Second one was recently proposed PCA tool as an alternative approach to determine within and between populations diversity [3].

Comparing the power of Fst and PCA analyzes

Natural models of population structure predict that the most of the eigenvalues of theoretical covariance will be «small», nearly equal, and arise from sampling noise, while just a few eigenvalues will be «large», reflecting past demographic events [3]. It is not relevant for commercial dairy herds. Monotonic decrease of the eigenvalues was observed. This indicates a relatively homogeneous genetic structure of the herds due to a big enough gene flow between herds as result of artificial selection. For example the proportions of the cows born from the same bulls in 78 pair of the herds was up to 32% [25].

In fundamental analysis of genetic diversity with PCA, Patterson et al. [3] discovered a threshold (as measured, for example, by Fst) below which the population structure was essentially undetectable. They proposed that for two equal size subpopulations, there is a threshold value of Fst calculated as 1/ √nm (where n is the number of animals and m is the number of SNPs), below which there will be essentially no evidence of the populations structure. In our study this threshold is within 0.0005–0.0006 for complete data which is considerably lower obtained minimal Fst value 0.002 and threshold 0.0015–0.0018 for pruned data which is comparable with minimal Fst value 0.002 (Table 1). Findings show that proximity to the threshold of PCA analysis for pruned data did not affect samples testing (Table 1).

Therefore, we predict that the power of our PCA analysis would be sufficient to reveal detectable across herds genetic differences. But, PCA calculates correctly if the markers are independent (between them have not sizable LD) [3, 17]. Several approaches have been proposed to achieve this goal, namely, shrinkage PCA [6], iterative pruning PCA [7, 8] and LD - based pruning [3]. We used LD - based pruning. For our local Holsteinized herds LD - based pruning has no affect on Fst and PCA data analysis. The same result for PCA was obtained when studying the genetic diversity of Spanish beef cattle at much greater Fst values (0.026–0.068) [4]. However, for human populations LD - based pruning has a sizable effect (e. g [3, 17].). Perhaps the PCA would be sensitive to LD - based pruning when populations or herds have pronounced genetic structure. It should be noted that after LD - based pruning the second eigenvector overall P - value became significant 6.4E-06 (see (2)). Consequently, insignificance of the second eigenvector may be possible result of LD between SNPs. Thus, the effect of LD on estimates between herds’ differences is moderate but not as great as for human population [3]. We propose this is a consequence of genetic relatedness of the cows in the herds. Really, the cows from 13 herds had considerable genetic relatedness owing to use high proportion up to 32% the same sires in the herds [25].

To clarify effect of LD on between herds differences, consider the results of Fst analysis. The pattern of MAF before and after LD - based pruning changes considerably (Fig. 1). Despite of this effect on MAF as pairwise Fst values (Table 1) and mean Fst values in MAF bins (see Additional file 1: Table S3) have not changed considarably. Meanwhile, the rare alleles have the smaller mean Fst values than common alleles, particularly for MAF 0.0001–0.005 (see Additional file 1: Table S3) and Fst values gradually increase up to MAF 0.1. Thus, the rare SNPs alleles are less differentiated between the herds than common alleles but they did not have substantial effect on Fst values. The less differentiation rare alleles may be suggested as a by-effect of artificial selection on the highest breeding values.

It was proposed if population has gone through bottleneck then Fst values could be greater for rare alleles as compared with common alleles [3]. We have observed the opposite effect, a decrease of Fst values for rare alleles. This means the bottleneck event in the breeding history of the herds if it had occurred it might have revealed by PCA.

The PCA plots of the herds means along first, third and fourth PC for the cows from complete data are shown on Figs. 3 and 4. The position of forth herd is outstanding on these images. The mean pairwise Fst value with other 12 herds for herd 4 is also the greatest 0.0087 compared with those of other 12 herds (0.0038–0.0063) [25]. We assume that this result is caused by the heavy use of bulls from the Netherlands between 2000 and 2007 years in herd 4, while bulls imported mainly from the USA and Canada were more recently used in the other 12 herds. Therefore, all pairwise combinations of the forth herd with other twelve herds are highly significant for first and third (except of pair the herds 1 and 4, 6 and 4), summed PC 1–10 and PC 1–20, summed 10–100 eigenvectors (except 4 and 6 pair of the herds) (Table 2).

Unlike the position of the herd 4, the position of third and eighth herds was nearly in the middle of the cluster which located herds 3, 8 10, 11, 12 on Fig. 3. These herds have minimal mean pairwise Fst – values with other 12 herds 0.0038 and 0.0039 [25]. The herd 3 forms many insignificant at P ≤ 0.05 pairwise herds combinations with other herds revealed by PCA 1–20 (Table 5) and at P ≤ 1.26e-12 by PC 1–100 (Table 6). Such properties of third herd are the result of genetic relatedness of the cows from these herd with the cows of other 12 herds mostly due to a large percentage of cows (up to 32%) born from the same bulls used in other 12 herds [25]. The same is true for herd 8 revealed by PC 1–100 (Table 6). The herd 13 is in the middle of both images on Figs. 3 and 4. Therefore, the herd 13 forms many highly significant pairs of herds revealed by PC 1–100 and Fst (Tables 2, 6 and 7).

Thus, mutual position of the herds and their pairwise significance depends on the eigenvectors since they are orthogonal and in each of them used only a part of genetic variance. It cannot be used the certain eigenvector to evaluate genetic differences for low genetically different pairs of the herds. Only summed eigenvectors are able to accurately assess these differences not contradictory to Fst approach. This conclusion is fully confirmed by the results obtained from summed 100 eigenvectors (Table 2).

Examples between cattle breeds genetic differences

For the large – scale SNP data the PCA and Fst are widely used to summarize the structure of genetic variation in the populations. Consider some findings available from publications studying a moderate between populations difference. Analysis of Russian cattle breeds demonstrate a very low differentiation of Black and White breed from Holstein - Friesian breed (Fst = 0.02) [26]. The authors did not use PCA. In another research Fst value for Black and White and Holstein breeds was 0.035 and Black and White breed formed a cluster with the breeds from Northern Europe on multi dimentional scaling (MDS) images [27]. PCA analysis applied to a distance matrix based on identity by state (IBS) showed a grouping of Spanish beef cattle breeds [4]. The degree of genetic differentiation was small to moderate as the pairwise fixation index of genetic differentiation among breeds estimates ranged from 0.026 to 0.068. Obtained results indicate large within-breed diversity and a low degree of divergence among the autochthonous Spanish beef cattle breeds studied. Among 47 worldwide breeds the USA and French Holstein have Fst value 0.004 and they are indistinguishable across PC 1 – PC 2 [28]. Authors concluded that PCA may fail to detect spatial structuring if this is not associated with the most pronounced genetic differentiation. Some degree of differentiation was shown with PCA between the USA and New Zealand Jersey bulls and cows [29]. The mean (max) Fst across the genome for AU versus US cows was 0.008 (0.12) and the average (max) for US versus AU, US versus NZ, and AU versus NZ was 0.006 (0.08), 0.029 (0.21) and 0.009 (0.07), respectively. Authors suggest that although some differentiation based on Fst was seen, especially for US versus NZ cows, the other populations appear to be similar. Noteworthy, differentiation between Australian and the USA Jersey cow populations was marginal in comparison with populations of the bulls. On PC 1 – Pc 2 image it was impossible to differentiate them geographically. Taken together the PCA and Fst results show that two artificially unselected breeds were not well differentiated and still cover a considerable part of the original genetic diversity [30]. On the contrary, artificially selected breeds show significantly highest differentiation. The highest overlap of genetic variation was found between Anatolian Black and Illyrian Mountain Buša (Fst = 0.037). This breeds were very close to each other in the PC 1 - PC 2 and PC 1 - PC 3 images and statistical prove on genetic differences between them are not given. Most of the remaining breeds also had their smallest Fst value (Fst = 0.037–0.096) when compared to Illyrian Mountain Buša. In indigenous six cattle populations of Ethiopia and Korea, PCA evidently distinguishes Ethiopian cattle populations from Hanwoo breed [31]. The most similar populations are Ambo – Arsi, Horro wih Fst 0.002 (P < 0.01) and they are very close to each other on PC 1 – PC 2 image but statistical data are not shown. Ancestry analysis of the new world cattle evidences that the first axis of PC was associated with the indicine – taurine split and the second PC axis was associated with the divergence between European and African taurine cattle [32]. The authors calculated the overall P-values based on TracyWidom test and shown that 154 axes in the 50 k dataset were statistically significant. In another research PCA and Fst showed minimal structure within the Guernsey breed, with no complete segregation of animals reflecting geographic origin (Fst = 0.006–0.011) and PCA with no distinct visual animal separation [33].

It is important to note that as a rule in above mention publications the eigenvalues (variance) decrease faster than in our study. Apparently this is the result a considerably low between herds genetic difference comparing to the breeds. We should keep in mind that visual evaluations of the genetic distance between herds on PCA images may be incorrect without of statistical prove. As a rule, any statistical treatment of PCA between populations’ results was not given and such images are only illustrative.

What conclusions from obtained results should be done regarding the power of Fst and PCA analyzes? Wright’s Fst is based on maximization of allele frequencies differences between populations through variance component. Used by us Hudson’s estimator provides the genetic difference between populations compared to the genetic difference within populations through variance component as well. Applied in our research PCA relies on covariance matrix of SNPs alleles among animals [3] and it able to find between herds genetic variation. That is Fst and PCA based on similar mathematical approaches and additional simulation analysis is needed to determine why Fst analysis gives more significant data. Summarizing the results of Fst and PCA analyzes should be noted that the power of both analyzes similar but probability of making a type I error is much less for Fst approach. It can be concluded at that point the Fst analysis is superior to PCA.


Firstly, despite of genetic relatedness of the cows in the herds, Fst and PCA analyzes are able to differentiate between herds genetic differences. But, PCA applied to the herds might only be efficient when summed results of several eigenvectors will be used. Secondly, despite of considerable change in the number of SNPs and their MAF spectrum due to LD - based pruning, it has a small effect on the results of Fst and PCA analyzes. We suggest that this is a consequence of homogeneous genetic structure of the herds. Our findings show that Fst method give the more significant data than PCA but PCA approach might be useful due to visualization of some genetics features of the herds.


Animal resources and SNP genotyping

Data and genotypes were obtained from Committee on Agro-Industrial Complex of Leningrad region. Cows genotypes were available from 13 breeding herds locating in Leningrad region (Russia) born in range from 2010 to 2013. Animals for genotyping were selected by farmers with respect to pedigree structure of the herd. Number of animals genotyped depends on number of sires used in herd at least one daughter was presented by sire. In case of multiple daughters were presented per single sire, sire of dams were different. Sampled animals are presented 8–15% of total number of milking cows (see Additional file 1: Table S5). Altogether, 500 cows were genotyped with BovineSNP50 v.2.0 array (Illumina Ca. USA) and 300 cows with BovineSNPIDBv3 array (Illumina Ca. USA). In the first quality control step, SNPs with quality score less than 0.7 were removed. Then, imputation of the BovineSNPIDBv3 chip data (40 K SNPs) to BovineSNP50 v.2.0 chip data (50 K SNPs) was carried out with the Beagle software [34]. Genotyping quality control (QC) was done with PLINK 1.9 [35]. Only autosomal chromosome were considered. Three Data sets (Ds) were done by stepwise adding of QC criteria. In complete Ds missing rate per SNP was no more than 5% and probability of deviation from Hardy-Weinberg equilibrium (HWE) was less than 1.0E-03. It includes 804 cows which were genotyped with 48,108 SNPs (Total genotyping rate was > 0.99). In other Ds SNPs with MAF < 0.01 were removed resulted in 43,298 SNPs [25]. To pruned data, LD (0.1) - based SNPs pruning with –indep command in PLINK was applied to obtain pruned data, including only 5827 SNPs. Further, for each sample 78 pairwise comparisons between 13 herds (hereafter called Pairwise set) were formed.

Fst analysis

The Fst values were estimated with EIGENSOFT 6.0.1 software [3], where Hudson’s estimator was implemented. The standard errors of Hudson’s Fst estimates were calculated using a block-jackknife approach implemented in EIGENSOFT. To estimate significance of Fst values the permutations of the cows between each pair of the herds was carried out to get the distribution under H0 hypothesis. We used PLINK commands –make –pheno and –fiter –mfilter 5 to perform 5 pairwise permutations. Then within each of permuted pair of the herds, the cows were allocated into two groups with the same size as the original pairs of the herds had and they were coded as case and control. Further, we calculated Fst value for each of 5 permuted pairs of the herds. Finally the mean Fst value and standard error for each 5 permuted pair of the herds was calculated. Altogether, 78 mean Fst values under H0 distribution were calculated and then 78 P – values were estimated using one-sided Student’s t-test. Power of t-test was calculated with «powerAnalysis» program in R software environment [36]. The standard error of mean Fst for MAF in the range 0.0001–0.5 was calculated as MSE = \( \frac{1}{m}\sqrt{\sum_1^m{SE}_i^2} \) where m is the number of evaluated MAF bins equal to 6. When calculating the MSE for Fst value in each bin, m was equal to 78 pairs of the herds.

Principal components analysis

For calculation of PCA the EIGENSOFT 6.0.1 software was used [3]. The outliers removal was carried out during each iteration. We used 6 standard deviations which an animal must exceed along one of the top principal components in order to be removed as an outlier. ANOVA P-values between each pair of the herds along principal components were calculated. For each pair of the herds, the above mentioned ANOVA statistics are summed across 10, 20 up to 100 eigenvectors. The distributions were approximately chisq with d.f. equal to the number of eigenvectors minus one. Likewise P-values were calculated for each 78 pairs of herds including summed PC 1–10, PC 1–20 and PC 1–100. For each of leading component PC 1, PC 3 and PC 4 the mean herd value was calculated. Then, they were plotted with R software environment [36]. Power of PCA analysis for summed PC 1–100 eigenvectors was calculated with «powerAnalysis» software in R software environment [36]. When comparative evaluating P - values in Table 2 Bonferroni corrections by formula: P = α/m was used, where α is the desired overall alpha level and m is the number of hypotheses.

Availability of data and materials

The cows genotypes used during the current study are available from the corresponding author upon a reasonable request.


Fst :

Wright’s fixation index


Principal Components Analysis


Linkage Disequilibrium


  1. 1.

    de Cara MA, Villanueva B, Toro MA, Fernandez J. Using genomic tools to maintain diversity and fitness in conservation programmes. Mol Ecol. 2013;22:6091–9.

    Article  PubMed  Google Scholar 

  2. 2.

    Engelsma KA, Veerkamp RF, Calus MP, Windig JJ. Consequences for diversity when animals are prioritized for conservation of the whole genome or of one specific allele. J. Anim. Breed. Genet. 2014;131(1):61–70.

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genetics. 2006;2:e190.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Canas-Alvarez JJ, Gonzalez-Rodriguez A, Munilla S, Varona L, Diaz C, Baro JA, et al. Genetic diversity and divergence among Spanish beef breeds assessed by a bovine high-density SNP chip. J. Anim. Sci. 2015;93:5164–74.

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Malomane DK, Reimer C, Weigend S, Weigend A, Sharifi AR, Simianer H. Efficiency of different strategies to mitigate ascertainment bias when using SNP panels in diversity studies. BMC Genomics. 2018;19:22.

    Article  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Zou F, Lee S, Knowles MR, Wright FA. Quantification of population structure using correlated SNPs by shrinkage principal components. Hum. Heredity. 2010;70:9–22.

    CAS  Article  PubMed  Google Scholar 

  7. 7.

    Intarapanich A, Shaw PJ, Assawamaakin A, Wangkumhang P, Ngamphiw C, Chaichoompu K, et al. Iterative pruning PCA improves resolution of highly structured populations. BMC Bioinformatics. 2009;10:382.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Limpiti T, Intarapanich A, Assawamakin A, Philip JS, Wangkumhang P, Piriyapongsa J, Ngamphiw C, Tongsima S. Study of large and highly stratified population dataset by combining iterative pruning principal component analysis and structure. BMC Bioinformatics. 2011;12:255.

    Article  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Wright S. The genetical structure of populations. Ann Eugenics. 1949;15:323–54.

    Article  Google Scholar 

  10. 10.

    Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–70.

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Nei M. Defenition and estimation of fixation indices. Evolution. 1986;40:643–5.

    Article  PubMed  Google Scholar 

  12. 12.

    Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting Fst: The impact of rare variants. Genome Res. 2013;23:1514–21.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Hudson RR, Slatkin M, Maddison WP. Estimation of level of gene flow from DNA sequence data. Genetics. 1992;132:583–9.

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR, Amos CI. Shifting paradigm of association studies: value of rare single nucleotide polymorphism. Am. J. Hum. Genet. 2008;82:100–12.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Zhang Q, Calus MPL, Guldbrandtsen B, Lund MS, Sahana G. Contribution of rare and low frequency whole genome sequence variants to complex traits variation in dairy cattle. Genet Sel Evol. 2017;49:60.

    Article  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Eynard SE, Windig JJ, Leroy G, van Binsbergen R, Calus MPL. The effect of rare alleles on estimated genomic relationships from whole genome sequence data. BMC Genetics. 2015;12(16):24.

    Article  Google Scholar 

  17. 17.

    Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J Hum Genet. 2016;98(3):456–72.

    CAS  Article  Google Scholar 

  18. 18.

    Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nat Genetics. 2012;44(3):243–6.

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Lachance J, Tishkof SA. SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. Bioessays. 2013;35(9):780–6.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. 20.

    McTavish EJ, Hilliis DM. How do SNP ascertainment schemes and population demographics affect inferences about population history? BMC Genomics. 2015;16:266.

    Article  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Albrechtsen A, Nielsen FC, Nielsen R. Ascertainment biases in SNP chips affect measures of population divergence. Mol. Biol. Evol. 2010;27(11):2534–47.

    CAS  Article  Google Scholar 

  22. 22.

    Clark SA, Kinghorn BP, Hickey JM, Van der Werf JHJ. The effect of genomic information on optimal contribution selection in livestock breeding programs. Gen. Sel. Evol. 2013;45(1):44.

    Article  Google Scholar 

  23. 23.

    Jakobsson M, Edge MD, Rosenberg NA. The relationship between FST and the frequency of the most frequent allele. Genetics. 2013;193:515–28.

    Article  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Jost L. GST and its relatives do not measure differentiation. Mol. Ecol. 2008;17:4015–26.

    Article  PubMed  Google Scholar 

  25. 25.

    Smaragdov MG, Kudinov AA, Uimari P. Assessing the genetic differentiation of Holstein cattle herds in the Leningrad region using Fst statistics. Agri. Food Sci. 2018;27:96–101.

    Article  Google Scholar 

  26. 26.

    Yurchenko A, Yudin N, Aitnazarov R, Plyusnina A, Brukhin V, Soloshenko V, et al. Genome-wide genotyping uncovers genetic profiles and history of the Russian cattle breeds. Heredity. 2018;120(2):125–37.

    CAS  Article  PubMed  Google Scholar 

  27. 27.

    Sermyagin A, Dotsev A, Gladyr EA, Trsaspov AA, Deniskova TE, et al. Whole-genome SNP analysis elucidates the genetic structure of Russian cattle and its relationship with Eurasian taurine breeds. Genet. Sel. Evol. 2018;50:37.

    Article  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Gautier M, Laloe D, Moazami-Goudarzi K. Insight into the genetic history of French cattle from dense SNP data on 47 worldwide breeds. PLoS One. 2010;5:e13038.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Howard JT, Maltecca C, Haile-Mariam M, Hayes BJ, Pryce JE. Characterizing homozygosity across United States, New Zealand and Australian Jersey cow and bull populations. BMC Genomics. 2015;16:187.

    Article  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Rothammer S, Seichter D, Forster M, Medugorac IA. A genome-wide scan for signatures of differential artificial selection in ten cattle breeds. BMC Genomics. 2013;14:908.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Edea Z, Dadi H, Kim SW, Dessie T, Lee T, Kim H, et al. Genetic diversity, population structure and relationships in indigenous cattle populations of Ethiopia and Korean Hanwoo breeds using SNP markers. Front Genet. 2013;4(Article 35):1–9.

    Article  Google Scholar 

  32. 32.

    McTavish EJ, Decker JE, Schnabel RD, Taylor JF, Hillis DM. New world cattle show ancestry from multiple independent domestication events. Proc Natl Acad Sci U S A. 2013;110:E1398–406.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Cooper TA, Eaglen SAE, Wiggans GR, Jenko J, Huson HJ, Morrice DR, et al. Genomic evaluation, breed identification, and population structure of Guernsey cattle in North America, Great Britain, and the Isle of Guernsey. J. Dairy Sci. 2016;99:5508–15.

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Browning BL, Browning SR. Genotype imputation with millions of reference samples. Am J Hum Genet. 2016;98:116–26.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Pursell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole genome association and population based linkage analyses. Am J Hum Genet. 2007;81:559–75.

    CAS  Article  Google Scholar 

  36. 36.

    R Development Core Team R: a language and enviroment for statistical computing. R foundation for statistical computing. Viena. Accessed 1 Dec 2018.

Download references


We gratefully acknowledge the editor and anonymous reviwers. Their expertise contributed to improve the quality of the manuscript.


This research was financially supported by Russian Ministry of Science and Education № AAAA-A18–118021590138-1.

Author information




SMG performed design of the study, data analysis, carried out calculation of Fst and PCA, prepared the manuscript. KAA selected the samples of the cows in the herds and the samples of biological material, extract DNA, editing of the cows genotypes and participated in some calculations. All authors read and approved the final manuscript.

Corresponding author

Correspondence to M. G. Smaragdov.

Ethics declarations

Ethics approval and consent to participate

The reported study was performed in accordance with the ethical guidelines of the L.K. Ernst Federal Science Center for Animal Husbandry. The protocol was approved by the Commission on the Ethics of Animal Experiments of the L.K. Ernst Federal Science Center for Animal Husbandry. The animal tissue samples were collected by trained personnel under strict veterinary rules in accordance with the Rules for conducting of laboratory research (tests) in the implementation of the veterinary control (supervision) approved by Council Decision Eurasian Economic Commission № 80 (November 10, 2017).

Consent for publication

Not applicable.

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: Table S1.

Effect of outliers on estimates of Fst values for complete data. a - Fst values for complete data corrected on the outliers are above the diagonal and Fst values for complete data does not corrected on the outliers are below the diagonal. b - Increased Fst values are in bold and decreased Fst values are in bold Italic. Table S2. Effect of rare alleles with MAF < 0.01 on estimates of Fst values. a - Fst values for complete data after removal of the alleles with MAF < 0.01 are below the diagonal and Fst values for complete data does not corrected on MAF < 0.01 are above the diagonal. b - increased Fst values are in bold and decreased Fst values are in bold Italic. Table S3. Mean Fst values across Pairwise set of the complete data in MAF bins. * - In each MAF bin 78 Fst values was averaged. Statistical estimates were obtained with t-test. ** - MSE calculation see at materials and methods. Table S4. Estimates of Fst values calculated for H0 distribution. Fst values should be multiplied by 10− 4. Table S5. Standard errors of the Fst – values computed by EIGENSOFT 6.0.1. Standard errors of Fst obtained from complete data are above diagonal and from pruned data are below diagonal. SE values should be multiplied by 10− 4. Table S6. Description of the herds and number of the genotyped cows. * - Country of origin of the sires of the genotyped cows, NL – the Netherlands. Figure S1. Effect of LD - based pruning on the number of SNP in the complete data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Smaragdov, M.G., Kudinov, A.A. Assessing the power of principal components and wright’s fixation index analyzes applied to reveal the genome-wide genetic differences between herds of Holstein cows. BMC Genet 21, 47 (2020).

Download citation


  • Principal components
  • Fixation index
  • Minor allele frequency
  • Dairy cattle
  • Genetic diversity