A simple method for estimating genetic diversity in large populations from finite sample sizes
© Bashalkhanov et al. 2009
Received: 6 April 2009
Accepted: 16 December 2009
Published: 16 December 2009
Skip to main content
© Bashalkhanov et al. 2009
Received: 6 April 2009
Accepted: 16 December 2009
Published: 16 December 2009
Sample size is one of the critical factors affecting the accuracy of the estimation of population genetic diversity parameters. Small sample sizes often lead to significant errors in determining the allelic richness, which is one of the most important and commonly used estimators of genetic diversity in populations. Correct estimation of allelic richness in natural populations is challenging since they often do not conform to model assumptions. Here, we introduce a simple and robust approach to estimate the genetic diversity in large natural populations based on the empirical data for finite sample sizes.
We developed a non-linear regression model to infer genetic diversity estimates in large natural populations from finite sample sizes. The allelic richness values predicted by our model were in good agreement with those observed in the simulated data sets and the true allelic richness observed in the source populations. The model has been validated using simulated population genetic data sets with different evolutionary scenarios implied in the simulated populations, as well as large microsatellite and allozyme experimental data sets for four conifer species with contrasting patterns of inherent genetic diversity and mating systems. Our model was a better predictor for allelic richness in natural populations than the widely-used Ewens sampling formula, coalescent approach, and rarefaction algorithm.
Our regression model was capable of accurately estimating allelic richness in natural populations regardless of the species and marker system. This regression modeling approach is free from assumptions and can be widely used for population genetic and conservation applications.
Accurate estimation of genetic diversity parameters in large natural populations using finite sample sizes is one of the central issues in population and conservation genetic studies and applications. Small sample sizes can lead to significant errors in estimating the genetic diversity of the species in question. For effective genetic resource conservation, sufficient allelic richness and a minimum number of carriers for each allele must be present in the conservation population to ensure its self-sufficiency over generations, otherwise its entire purpose may be compromised if the sampling criteria are not met . This aspect is often overlooked when conservation programs are developed and minimum viable population sizes are determined .
Allelic diversity (richness) is one of the most important and commonly used estimators of genetic diversity in populations. It strongly depends on the effective population size and past evolutionary history . However, the number of observed alleles and their frequency distribution also depend on the sample size and the genetic marker system used. Thus, a practical method for reliable estimation of genetic diversity parameters in large populations is needed for population genetic studies and to develop scientifically sound strategies for genetic resource conservation.
Based on the probability theory alone, one can calculate the sample size required to detect alleles with a certain threshold frequency [4, 5]. Rarefaction  and repeated random subsampling  are increasingly popular methods for standardizing the allelic richness for unequal sample sizes. However, there are several possible limitations in using these approaches. First, many estimates are based on the ideal population model. Most temperate and boreal species have experienced tremendous migrations and disturbances since the last glacial maximum. The northernmost populations are evolutionary young and remain dynamic, showing significant deviations from the equilibrium state. Second, distribution of allele frequencies strongly varies among species and marker types. Third, due to the non-linear relationship between sample size and observed allelic richness, simple extrapolation beyond the maximum sample size may not be feasible . Bayesian approaches have also been introduced, but they still cannot predict the allelic richness in large populations when the sample sizes are limited .
Estimating the parameter θ in a natural population is still complicated: i) in the expression θ = 4N e μ, there is no or very little information on the mutation rates in plant populations, plus the observed mutation rates are locus-specific, and can be confounded by selection and migration; ii) the effective population size (N e ) can be estimated by the coalescent approach [12, 13], but the inferences depend on the underlying population genetic model, and the related assumptions may not hold true for the real natural population in question. For nucleotide sequences, Nei introduced nucleotide diversity ϕ as another estimator of 4N e μ , but it is locus-specific and sensitive to the sample size.
Although the Ewens sampling formula and coalescent approach provide theoretical expectations for the allelic richness in a given sample, they normally assume an ideal random mating population of constant size, and without migration and selection. However, natural populations rarely conform to these and other ideal population assumptions. Selection effects may be heterogeneous in time and space, and are extremely difficult to realistically model. Random mating may be hampered by spatial genetic structure and selfing [15, 16]. A simple, assumption-free and robust method is needed for estimating allelic diversity in large natural populations. Here, we introduce a simple and robust approach to estimate the genetic diversity in large natural populations based on the empirical finite sample data.
The empirically derived equation (5a) is similar to the modified Ewens sampling formula in equation (3).
We tested the regression model (5) using (i) large empirical data sets for four conifer tree species with contrasting population genetic characteristics, and (ii) simulated population genetic data sets created using Markov-chain-based algorithm with different inherent migration and selfing rates.
Allelic richness estimated by regression, coalescent and rarefaction
Source data set
Estimated allelic richness
No. of loci
( n = 120)
( n = 120)
( n = 120)
( n = 120)
( n = 120)
An additional data set for allozyme markers for eastern white pine was also analyzed. Allozymes have been extensively used in population and conservation genetic studies before the advent of microsatellite markers. Although other markers, such as RAPD (random amplified polymorphic DNA), and AFLP (amplified fragment length polymorphism) have been used in population genetic studies, these markers are not well suited for such studies and have fallen out of favour, primarily due to their diallelic and dominant nature. Codominant SNP (single nucleotide polymorphism) markers are being used in population genetic studies. However, most of them also suffer from the limitation of being diallelic. Since the objective of the present study was to predict the allelic richness in large populations, we used microsatellite and allozyme markers for validating our model, since these markers are codominant and multiallelic.
Pinus strobus and Picea glauca are predominantly outcrossing species - average multilocus outcrossing rates (tm) are 0.924, and 0.940, respectively [18, 19], and Picea rubens and Thuja occidentalis are mixed-mating selfing-tolerant species - tm = 0.595, and 0.635, respectively [20, 21]. Samples were collected in natural populations. In Picea rubens and Picea glauca stands, trees were randomly selected with minimum spacing of 30-50 m between the trees to avoid the possible family structure effects. In the Pinus strobus and Thuja occidentalis stands, all mature trees within the population were sampled. The number of individuals sampled per population varied from 95 to 180 (Table 1). The number of microsatellite loci used ranged from 6 to 13 (typically employed for population genetic studies). Although the Pinus strobus populations were genotyped for 54 allozyme loci , we used data for 15 most polymorphic loci to validate our model.
The allelic richness estimates predicted by our regression model were compared with the Ewens sampling formula, coalescent approach, and rarefaction algorithm predictions. Since the experimental data sets had only up to 180 individuals per population, pseudo-simulation data sets of ~10,000 individuals per population were created for each of the four conifer species from their empirical genotype data (Table 1) to address the collection of finite samples from a large natural population. This was done by randomly replicating each genotype within population equal number of times until a population size of ~10,000 was reached, so the resulting data sets had the same distribution of allele frequencies as the original populations. Then random sampling was applied to create test subsamples of 15, 25, 35, 45, 60, 90, and 120 individuals in 50 replicates, and the mean number of alleles per locus was calculated for each sample size. Computations were performed using a Visual Basic program for Microsoft Excel.
The resulting allelic richness values were used to derive the estimates of ρ and the β coefficients in equation (5) using the Gauss-Newton method implemented in the NLIN procedure in the SAS 9.1.3 statistical package (SAS Institute, Cary, NC). An example of the input data and SAS NLIN output, showing derivation of the regression coefficients (ρ, β n , and β A ) in (5), is provided in the Additional file 1.
We also tested the simplified Ewens formula (3) as a predictor for allelic richness. First, θ was calculated from the allelic richness values observed in the source data sets for natural populations (Table 1) using the modified Ewens formula (3). Then the resulting θ was used in equation (3) to estimate the predicted allelic richness at various sample sizes. We also estimated the θ values for the source data sets obtained from natural populations using the popular maximum likelihood coalescent approach implemented in the MIGRATE 3.0 program [22, 23]. The resulting θ estimates were used in the equation (3) to calculate the allelic richness for various sample sizes as indicated above. Also, we estimated the predicted allelic richness values (at n = 120) in our experimental samples using the rarefaction procedure implemented in the HP-RARE 1.0 program .
Additionally, to estimate the effects of sample size on the observed genetic diversity and genetic subdivision parameters, we calculated the observed and expected heterozygosity, Shannon information index, and F ST for Picea rubens and Pinus strobus data sets using the GENALEX 6.1. program .
To validate our model, we created 10 artificial data sets each containing 2 populations of 10,000 individuals, with selected combinations of inherent migration and selfing rates, using the Markov chain-based simulation algorithm implemented in the EASYPOP 2.1 program . Migration rates (Nm) were set at 0, 1, 10, 50 and 100 migrants per generation, and selfing (s) was set at 0, 0.2, 0.6, and 0.99 to cover a wide range of mating system and gene flow scenarios. Different combinations of migration and selfing rates would approximate possible deviations from the ideal population model for a wide variety of organisms. High degrees of selfing are not unusual in many mixed-mating selfing-tolerant conifer species, such as Thuja occidentalis , and extensive gene flow is generally observed in natural plant populations . Mutation rates were set to 0.0002, with the K-allele mutation model implied, and all loci had 20 possible allelic states - these parameters are typical for microsatellite markers . The population size was set constant at n = 10,000 to represent a typical large natural plant population. The initial allele states were assigned randomly, and then populations were allowed to evolve under the above-mentioned evolutionary scenarios for 20,000 generations to yield the data set A. Then a sample of n = 200 individuals (close to n = 180 in experimental populations of P. rubens) was taken from the resulting population, and randomly replicated 50 times as described above to create the data set B of n = 10,000 individuals. Then repeated random subsampling was performed on both data sets A and B (for n = 15, 25, 35, 45, 60, 90, 120, 500, 2,000, 5,000), and the allelic richness was calculated in 50 replicates as described above. Various combinations of Nm and selfing parameters used are provided in the Additional File 2.
As mentioned above, the empirically derived equation (5a) is similar to the modified Ewens sampling formula (3). Allelic richness estimates predicted by the Ewens formula (3) significantly deviated from the empirical estimates obtained by repeated random subsampling (Figure 2; Figure 3). As the equations (5a) and (3) are mathematically congruent, ρ in the equations (5) and (5a) may be interpreted as a simplified empirical estimator for θ. The empirically derived regression coefficients β n , and β A would provide correction for possible deviations of the experimental population from the ideal population model.
We also compared allelic richness estimates obtained for the four conifer species using our regression model equation (5) and the rarefaction procedure. Rarefaction estimates were close to the subsampling and regression results obtained for n = 120 (Table 1). Rarefaction is commonly used to standardize allelic richness estimates to the smallest sample size used in a given study, but it cannot extrapolate the allelic richness beyond the values observed in the empirically analyzed samples . Thus, it cannot be used for estimating the number of alleles in large populations. Our non-linear regression model is a good predictor for the allelic richness at large sample sizes. It effectively addresses the possible deviations from the ideal population model by introducing the empirically derived regression coefficients β.
A valid concern would be that the original sample set used for the subsampling procedure may contain only a fraction of the allelic diversity present in a large natural population. Our results indicate that allelic richness estimates obtained by the model developed here in the amplified data were consistent with that actually observed in the total simulated population. The allelic diversity of various samples drawn from the entire simulated population of 10,000 individuals (data set A) was consistent with that drawn from 50-times pseudo-replicated population of 200 individuals (data set B) (Figure 4). The replicated data sets based on n = 200 (data set B) adequately represented the major proportion of the allelic diversity existing in the entire simulated population (data set A), and pseudo-replication and subsampling apparently had little effect on allelic diversity estimates at the sample sizes below n < 100 typically used in population genetic studies. At large sample sizes (n > 500), the replicated data sets tend to underestimate the allelic richness in comparison with samples drawn from the true simulated population (Additional File 2). Capturing the low frequency (p = 10-2..10-4) alleles in a finite population would require sample sizes close to the entire number of individuals in the population.
It should be noted that existence of spatial genetic structure in a population can affect the observed allelic diversity estimate in a sample. In two of the four studied species, spatial genetic structure up to ~25 meters has been observed (Rajora, unpublished; Pandey and Rajora, submitted). Since the sampling distance normally used for population genetic studies in forest trees (30-50 m) is greater than the observed spatial genetic structure, the latter has little effect on the allelic richness estimates.
The logarithmic nature of the relationship between allelic richness and sample size holds true regardless of the organism and marker system used. In addition to our own data sets for conifer tree species, we observed this relationship in a number of other studies published for various taxa, e.g. [29–32]. In the present study, we provide a simple, direct and robust method to predict the allelic diversity in large natural populations. Leberg  mentioned one possible limitation of such extrapolation: it requires a significant number of samples for the initial estimation of θ, but in our opinion, the robustness of the subsequent results far outweighs the expenses associated with running a small pilot study.
Our approach takes into account possible deviations from the ideal population model occurring in such complex systems as natural forest tree populations, where long distance gene flow, population bottlenecks, selection, varying mating systems, and overlapping generations are the norm. One of the other advantages of our model over the coalescent approach is that it does not require high computation resources.
The minimum sample size for population genetics and conservation studies has been a hotly debated topic. Although it is usually desired to capture 90-95% of allelic diversity, it is often not feasible, as the true number of alleles in the population is rarely known. A recent study by Gapare and Aitken  claims that sample sizes of approximately 150 individuals per population would be enough to capture 95% of its alleles. However, the "true" number of alleles was observed at n = 200. Our simulation study and experimental results indicate that it is unlikely that a sample of 200 individuals would capture all alleles in a real natural population. The bivariate linear regression model used in  may not be an accurate predictor of allelic richness in large populations because the relationship between the sample size and the observed allelic richness is non-linear .
For conservation and adaptation studies, rare alleles may be especially important as they may represent the populations' potential to adapt in changing environmental conditions. Usually, very large sample sizes are suggested for conservation populations , although in our opinion the size of the conservation population can be optimized depending on the distribution of allele frequencies in the parent population. As the allelic richness is approximately a log function of the sample size, after certain threshold n (for example, n ~150 in red spruce - Figure 1), the observed allelic richness increases almost exclusively by rare alleles. At a very large n, doubling the sample size would allow only a minor increase in the allelic diversity (Additional File 2). Sampling artifacts arising from the existing spatial genetic structure may further reduce the observed allelic richness increment. Our regression model could be a good predictor for the number of rare alleles in natural populations. Once the regression model parameters have been established for a given species and marker system, the results should be applicable for other populations within the species.
Our non-linear regression model provides a simple and robust approach to estimate the genetic diversity in large natural populations based on the empirical data. Since the regression coefficients in our model are derived empirically, and there are no assumptions to violate, it allows for quick and easy estimation of allelic diversity in large natural populations based on finite sample sizes. The model is independent of the marker mutation mode and population history, and works well with high selfing and predominantly outcrossing species. It has been validated on simulated data sets, as well as on the experimental data for different species and molecular marker systems. Therefore, our model is more accurate, simple and practical than the coalescent or Ewens approach. The proposed method can be widely applicable in population genetic studies, and it may provide the missing link for conservation and management decision support.
The research was funded by the Canada Research Chair Program (CRC950-201869) funds and the Natural Sciences and Engineering Research Council of Canada Discovery Grant RGPIN 170651 to O.P. Rajora. S. Bashalkhanov was supported by the University of New Brunswick start up funds provided to O.P. Rajora and a Canadian Forest Service graduate student's supplemental stipend. M. Pandey was financially supported from the Canada Research Chair Program (CRC950-201869) funds to O.P. Rajora. Genotyping for Thuja occidentalis, and Picea glauca was carried out primarily by Dr. Lisa O'Connell and Dr. Ishminder Mann. The authors appreciate the useful comments and suggestions of three anonymous reviewers.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.