Population size influences the type of nucleotide variations in humans

Subramanian, Sankar

doi:10.1186/s12863-019-0798-9

Research article
Open access
Published: 05 December 2019

Population size influences the type of nucleotide variations in humans

Sankar Subramanian ORCID: orcid.org/0000-0002-2375-3254¹

BMC Genetics volume 20, Article number: 93 (2019) Cite this article

3404 Accesses
2 Citations
Metrics details

Abstract

Background

It is well known that the effective size of a population (N_e) is one of the major determinants of the amount of genetic variation within the population. However, it is unclear whether the types of genetic variations are also dictated by the effective population size. To examine this, we obtained whole genome data from over 100 populations of the world and investigated the patterns of mutational changes.

Results

Our results revealed that for low frequency variants, the ratio of AT→GC to GC→AT variants (β) was similar across populations, suggesting the similarity of the pattern of mutation in various populations. However, for high frequency variants, β showed a positive correlation with the effective population size of the populations. This suggests a much higher proportion of high frequency AT→GC variants in large populations (e.g. Africans) compared to those with small population sizes (e.g. Asians). These results imply that the substitution patterns vary significantly between populations. These findings could be explained by the effect of GC-biased gene conversion (gBGC), which favors the fixation of G/C over A/T variants in populations. In large population, gBGC causes high β. However, in small populations, genetic drift reduces the effect of gBGC resulting in reduced β. This was further confirmed by a positive relationship between N_e and β for homozygous variants.

Conclusions

Our results highlight the huge variation in the types of homozygous and high frequency polymorphisms between world populations. We observed the same pattern for deleterious variants, implying that the homozygous polymorphisms associated with recessive genetic diseases will be more enriched with G or C in populations with large N_e (e.g. Africans) than in populations with small N_e (e.g. Europeans).

Background

The out of Africa hypothesis predicts that the ancestors of the human populations around the world originated in Africa, migrated out of the continent and eventually colonized different parts of the world [1]. During this process, the ancestral populations underwent a series of population bottlenecks along the migratory routes. Due to this founder effect, the ancestral population size is expected to decline with increasing distance from Africa. Previous empirical studies confirmed this prediction and showed that populations in Africa are the most genetically diverse and that the diversity declined with increasing geographic distance from Africa particularly along the colonization routes [2,3,4,5,6]. These observations clearly suggest significant variation in the nucleotide diversity among global populations.

Nucleotide diversity (π = 4N_eμ) is a measure of genetic variation, which is determined by mutation rate (μ) and effective population size (N_e). Since mutation rate is similar across human populations, the observed difference in the diversity of world populations is largely due to the variations in effective population sizes. Although a recent study suggested a higher rate of mutation in non-Africans, the magnitude of this effect was very small (~ 5%) [7]. Recent population genomic studies showed large variation in the number of polymorphisms observed between world populations [8]. Populations in Africa have ~ 5 million Single Nucleotide Variations (SNVs) whereas those in East Asia have ~ 4.1 million, which is ~ 20% less. Although the variation in the number of polymorphisms is well known, it is unclear if there are differences in the types of polymorphisms between world populations. Are the frequencies of different types of nucleotide changes (eg. A → G or T → C) similar across populations? This question arises from our understanding of the phenomenon of GC-biased gene conversion (gBGC).

gBGC is a recombination-associated process that favors G/C over A/T nucleotides during the repair of mismatches that occur in heteroduplex DNA during meiosis [9,10,11,12,13]. Although this process is not associated with natural selection, the efficiency of gBGC could also be reduced by genetic drift. Therefore, the effect of gBGC is expected to be much weaker in small populations than in large populations [12, 13] and consequently, the frequencies of AT→GC polymorphisms are expected to differ among world populations. It is important to characterize the population specific patterns of genetic variants, as these patterns may have immense implications for human health.

For instance, previous studies have shown a positive correlation between the number of deleterious homozygous SNVs present in human populations and their distance from East Africa [14]. Furthermore, non-Africans were found to have much higher proportion of high frequency or homozygous deleterious variants than Africans [15, 16]. These observations suggest that due to the effects of genetic drift, small populations (typically located away from Africa) have a higher fraction of high frequency (and homozygous) deleterious mutations than large populations. On the other hand, a gBGC mediated skews in the frequencies of deleterious AT→GC (relative to GC → AT) polymorphisms were also reported [17, 18]. However, it is unclear whether the extent of such skews is influenced by the effective population sizes of various global populations. Therefore, using data from the 1000 Genomes Project we investigated the pattern of nucleotide changes observed in the SNVs segregating in different allele frequencies [8]. Furthermore, we also analyzed homozygous and heterozygous variant data for 126 distinct populations from around the world, obtained from the Simons Genome Diversity Project [7].

Results

Allele frequencies and types of nucleotide changes in human populations

To quantify the difference in the patterns of observed AT→GC and GC → AT changes we derived a measure β, based on the Waterston estimator (θ_W) as described in the methods (Eq. 1). The measure β is the ratio of AT→GC (μ_AT → GC) and GC → AT (μ_GC → AT) mutation rates, which captures the mutational equilibrium between AT and GC nucleotides. The ratio β is expected to be 1 if the observed AT→GC and GC → AT changes are due to the forward and reverse mutation rates alone. Any deviation from this ratio (β = 1) suggests a bias in the substitution of one type over the other. To examine the variation in the patterns of nucleotide changes we obtained the 1000 Genomes phase II data for 26 distinct populations of the world. Since most of the genomes from Latin America were admixed with Europeans/Africans, we separated 20 Peruvian genomes that had < 0.5% admixture. We used these to represent an un-admixed Native American population and hence the total number of populations analyzed was 27. The SNVs of these populations were grouped into eleven categories based on their Derived Allele Frequencies (DAF). The ratio β was estimated for each category of SNVs belonging to each population. We estimated the effective size (N_e) of each population based on the mutation rate (μ) and nucleotide diversity (π) (see methods).

Figure 1 shows the relationship between N_e and β for SNVs belonging to two extreme allele frequencies. For DAF < 0.025, the estimates of β were almost equal to 1 for all populations. In contrast, we observed a significant positive correlation (P < 10^− 6) for SNVs with very high DAF (> 0.9). Another major pattern was that the β estimates were similar among populations belonging to the same geographical locations and very different among populations from distinct locations. However, this was not true for admixed Americans (black dots). We then examined this relationship for SNVs with different derived allele frequencies. We did not find any significant relationship between N_e and β for low frequency SNVs until the DAF was ≤0.2 (P > 0.10) (Fig. 2a). However, significant positive correlations (at least P < 10^− 2) were observed between N_e and β for SNVs with DAF > 0.2. The magnitude of the correlation increased with the increase in DAF, which is evident from the rise in the slopes of the regression lines. This increase is clearer in Fig. 2b, which shows the positive relationship between DAF and the slopes of the regression lines shown in Fig. 2a. The slope observed for SNVs with DAF > 0.9 was 6 × 10^− 5, which is almost 15 times higher than for SNVs with DAF = 0.2–0.3 (4 × 10^− 6). For high frequency SNVs (DAF > 0.9), the difference between the mean β estimated for African genomes (1.72) was 79% higher than that observed for Peruvian genomes (1.35).

We also plotted β against DAF to show how this estimate changes with increasing DAF. For this purpose, we selected five representative populations with significantly different N_e. As shown in Fig. 3, positive relationships between DAF and β against were observed for all populations and β increased with increasing DAF. However, based on the slopes of the regression lines (see Fig. 2 legend-inset) the rate of increase correlates with N_e. The slope was the highest for Africans (0.55), intermediate for Eurasians (0.34-0.43) and the lowest for Peruvians (0.28).

Patterns of homozygous and heterozygous variations

To further examine the patterns of substitutions in a much larger number of populations we obtained the genotype data of the Simons Genome Diversity Project for genomes representing 126 distinct ethnic populations. We examined the patterns of nucleotide changes for homozygous and heterozygous SNVs. For each genome, we estimated β for homozygous and heterozygous SNVs. The nucleotide diversity was estimated by comparing the pairs of chromosomes of each genome and N_e was calculated using the mutation rate obtained from previous studies (see methods). We then plotted β against N_e. For homozygous SNVs, our regression analysis produced a significant positive correlation (P < 10^− 6) between the two variables (Fig. 4a). Although the effective population sizes vary widely between world populations, they are roughly similar for the populations in a geographical location. Interestingly, the β values for homozygous SNVs are also varied considerably but were similar for populations within a geographical region. This is clear from the mean estimates shown in the inset of Fig. 4a. For consistency, we also examined this relationship using the 1000 Genomes phase II data and found similar highly significant relationship between N_e for homozygous SNVs (P < 10^− 6) (Additional file 1: Figure S1). On the other hand, N_e for heterozygous SNVs did not show any significant relationship (P = 0.36) with β (Fig. 4b).

Proportion of deleterious AT→GC and GC → AT changes in global populations

To understand the implications of the observed patterns on human health, we examined the patterns of nucleotide changes for deleterious SNVs. To determine the deleteriousness of the SNVs we used the robust method, Combined Annotation-Dependent Depletion (CADD). This method integrates over 60 diverse annotations (to measure the extent of deleteriousness of a variant) into a single measure (C score) [19]. We designated SNVs with a C score of > 15 as deleterious. To distinguish the frequency of different types of nucleotide changes we estimated the proportion of deleterious AT→GC (P_AT) and GC → AT (P_GC) changes using eqs. 2 and 3 (methods). These estimates were plotted against N_e. For deleterious SNVs with DAF > 0.9 we found highly significant relationships between N_e and P_GC (P < 10^− 6) and between N_e and P_AT (P < 10^− 6) (Fig. 5a). Importantly for large populations (Africans) the proportion of deleterious AT→GC mutations was 54% higher than the proportion of GC → AT mutations. For small populations (Peruvians) this difference was only 26%. We performed a similar analysis using the homozygous SNVs from 126 world populations and obtained similar results (Fig. 5b). The difference between P_GC and P_AT was highest for Africans (51%) and lowest (22%) for Native Americans.

Discussions

In this study, we showed that the types of nucleotide changes in world populations are shaped by their effective population sizes. Our results revealed a much higher proportion of AT→GC variations in populations with large effective sizes (eg. Africans) compared to those with small sizes (eg. Native Americans). These observations could be explained based on the well-known recombination associated GC-biased gene conversion (gBGC) [9,10,11,12,13]. The two strands of DNA are connected by double hydrogen bonds between A and T bases (weak) as well as triple hydrogen bonds between G and C bases (strong). It has been shown that gBGC favors the changes involving AT→GC (or weak → strong) compared to GC → AT (or strong → weak) during the process of fixation. We developed a measure β to capture this bias in substitutions. For rare SNVs, β estimates were close to 1 (β ≈ 1) for all populations (Fig. 1). This suggests that the rate of AT→GC mutations per (A or T) base is equal to GC → AT mutations per (G or C) base. Therefore, the observed number of rare SNVs reflect the mutation patterns alone.

In contrast, for high frequency variations β estimates were significantly higher than 1 (β > 1). Importantly β increased with the increase in DAF as shown clearly in Fig. 3. These results suggest a preferential fixation of AT→GC over GC → AT mutations over time. However, based on the slopes of the regression lines the results also suggest that the rate of preferential of fixation of AT→GC mutations in small populations is low compared to that in large populations because of genetic drift that reduces the efficiency of gBGC. This result is supported by a previous study using human population genetic data that suggested that on an average gBGC is stronger in African than non-African populations [13].

To further support our claim, we examined the changes between A↔T (weak bond) and between G↔C (strong bond) nucleotides using high frequency SNVs (DAF > 0.9). Our results showed that β estimated for A → T/T → A or C → G/G → C were close to 1 for all populations and there was no significant relationship with N_e (Additional file 1: Figure S2). We also examined this using homozygous and heterozygous SNVs from the 126 populations and observed no significant relationship between β and N_e (Additional file 1: Figure S3A and S3B). Since the changes are between the same types of nucleotides (with respect to weak or strong hydrogen bonds) there was no effect of gBGC on the fixation of one type of nucleotide over other. This provides further evidence that the results of our study are not due to methodological artifacts.

Since gBGC does not affect the changes between A↔T (weak bond) and between G↔C (strong bond) previous studies have used the rate of these changes as a normalizing factor to assess the magnitude of gBGC on GC↔AT changes (Lachance and Tishkoff 2014; Glemin, et.al 2015; Xue and Chen 2016). Following this, we normalized AT→GC with A↔T changes and GC → AT with G↔C changes respectively and developed a normalized ratio, β’ (eq. 2 - methods). However, the relationship between N_e and β’ was also highly significant and comparable to previous results obtained for high frequency (Additional file 1: Figure S4A) and homozygous SNVs from the 1000 Genomes project (Additional file 1: Figure S4B) and the Simons Genome Diversity Project (Additional file 1: Figure S4C). This further supports our results as the normalization eliminates any variation in the mutation rates between populations.

The results based on homozygous and heterozygous SNVs shown in Fig. 4 further support to the results based on DAF presented in Figs. 1 and 2. For instance, Figs. 1 and 2a. showed there was no significant relationship between β and N_e for low frequency SNVs. Since low frequency SNVs predominantly exist as heterozygous SNVs in a genome, results based on the former are expected to be similar to those based on the latter (Figs. 1 and Fig. 4b). Similarly, as high frequency SNVs are more likely to be present as homozygous SNVs in genomes, the results for these two types of SNVs are alike. This is evident from the results shown in Figs. 1 and 4a.

Since gBGC is mediated by recombination it effects were found to be strong for highly recombining regions. To examine this, we obtained SNVs from regions with low (< 2 cM), medium (2–20 cM) and high (> 20 cM) rates of recombination. The results showed a much higher β for the variants in high recombination regions (Fig. 6). However, the magnitude of the relationship between β and N_e was similar for all three regions; the slopes of the regression lines were 0.000039 (low), 0.000042 (medium) and 0.000041 (high). This suggests that the influence of population size on gBGC is relatively similar across chromosomal regions with varying degrees of recombination.

Previous studies have shown a significant negative correlation between heterozygosity and the distance (of the location of the populations) from Africa. This correlation is expected based on the prediction that during migration out of Africa human populations underwent a series of population bottlenecks or founder effects along the migratory route [2,3,4,5]. This is because only a subset of people migrated from the original location to new sites and hence the size of the populations reduced with distance from Africa. From previous studies, we obtained the geographic distance of 41 non-African populations from Eastern Africa (Addis Ababa) [2, 6] and we plotted the estimates of β against them. We obtained a highly significant negative correlation for homozygous variants (P < 10^− 6) (Fig. 7a) but not for heterozygous variants (P = 0.2) (Fig. 7b). This is very similar to the results shown in Fig. 4. In this analysis, the geographic distance from Africa was used as a proxy for N_e. Hence this result independently confirms our findings and also justifies the method used in this study to estimate N_e from nucleotide diversity.

We estimated N_e under the assumption that mutation rate (μ) and the rate of accumulation of mutations are both similar between populations. However, a recent study suggested a slightly (~ 5%) higher rate of mutation accumulation in non-Africans compared to Africans [7]. To accommodate the elevated diversity, we subtracted 5% of the observed divergence for non-Africans while estimating N_e and re-analyzed the data. This produced almost identical patterns and similar strengths of correlation to that reported in Fig. 4 (Additional file 1: Figure S5).

Conclusions

We have shown that the types of SNVs observed in different human populations are very likely to be modulated by their effective population sizes. Since this pattern was universal for genome-wide variations we showed that deleterious SNVs also follow this pattern. Our results showed that populations with large effective sizes (e.g. Africans) displayed the greatest difference between the proportions of high frequency deleterious AT→GC and GC → AT SNVs. This difference was much lower in populations with small effective sizes (e.g. Native Americans). This has significant implications in human health as it implies that high frequency diseases-associated mutations in Africans will be more enriched with AT→GC SNVs than in Native Americans. Furthermore, we showed that deleterious homozygous SNVs are also predominantly AT→GC in Africans, and to a greater extent than in non-Africans. This suggests the possibility that recessive genetic disorders in Africans are more likely to be caused by AT→GC variants than in non-Africans. Therefore, our study recommends that genome-wide association studies should consider the frequency of population specific nucleotide changes.

Methods

Genome data

We obtained genotype data from the 1000 Genomes Project – Phase II [8]. The genome-wide variations from the 26 populations including Africans (seven populations: 661 individuals), South Asians (five: 489), European (five: 503), East Asian (five: 504) and South Americans (four: 347). Although there were 85 Peruvian genomes available, most of these were admixed with Europeans and Africans. Hence, we used the likelihood based clustering algorithm Admixture [20] and examined the proportion of admixture in each Peruvian genome. Our results showed that only 20 genomes (40 chromosomes) had < 0.5% admixture from other populations and we included these un-admixed Peruvians as the 27th population in our analyses. To identify derived alleles, orientations of SNVs were determined using the ancestral state of the nucleotides, which was inferred from six primate EPO alignments [21]. The SNVs were divided into eleven categories based on their derived allele frequencies (see Fig. 2-legend). For the SNVs in each category we computed the counts of six types of changes: A/T → G/C, G/C → A/T, C → G, G → C, A → T and T → A (see below).

We also obtained the genotype data from the Simon Genome Diversity Project [7]. To examine the patterns of nucleotide changes we used the homozygous and heterozygous SNVs present in a single representative genome from each of the 126 populations. We excluded four African hunter-gatherer populations from our analysis as it was difficult to ascertain the correct orientation of the nucleotide changes in these genomes. For each genome, we estimated the number of homozygous and heterozygous changes belonging to the six types described above.

Deleterious mutations

To determine the deleterious nature of a SNV we used a robust method, Combined Annotation-Dependent Depletion (CADD) that integrates diverse annotations into a single measure (C score) [19]. The extent of deleteriousness was further determined by estimating the corresponding selective coefficients for these scores [22]. For instance, SNVs with a CADD score of 15–20 have a selection coefficient (s) of 0.0001 and this is considered to significantly affect the fitness of humans as most of the nonsynonymous polymorphisms have a score above this threshold. The C scores for each SNV in the 1000 Genomes Project data were publicly available (http://cadd.gs.washington.edu/download). Using an in-house Perl script, we combined this score with the genome data by using the chromosomal co-ordinates of the SNVs. For the deleterious variant analysis, we included only the SNVs for which the C score was available. We used a C score of ≥15 to determine a mutation to be deleterious in nature following previous studies [19, 23]. However, using a different threshold produced almost identical results.

Estimating the ratio of mutation rates

The ratio of AT→GC and GC → AT mutation rates could be estimated based on the Waterson estimator [24], θ = 4N_eμ = S/a_n, where N_e is the effective population size and S is the number of segregating sites per site and a_n = $ \sum \limits_{i=1}^{n-1}\frac{1}{i} $. We can use this estimator considering only one type of mutation as:

$$ {\theta}_{A\to G}=4{N}_e{\mu}_{A\to G}=\frac{S_{A\to G}}{a_n} $$

$$ {\theta}_{G\to A}=4{N}_e{\mu}_{G\to A}=\frac{S_{G\to A}}{a_n} $$

The ratio of forward and reverse nucleotide changes (β) could be obtained as:

$$ \beta =\frac{\mu_{A\to G}}{\mu_{G\to A}}=\frac{\theta_{A\to G}}{\theta_{G\to A}}=\frac{S_{A\to G}}{S_{G\to A}} $$

The number of segregating sites per site or SNVs per site of a genome can be estimated as:

$$ {S}_{A\to G}=\frac{M_{A\to G}}{N_A}\ and\ {S}_{G\to A}=\frac{M_{G\to A}}{N_G} $$

where M_A → G and M_G → A are the number of observed A → G and G → A mutations in a genome respectively and N_A and N_G are the number of ancestral A and G nucleotides. This formula can be extended for the combined AT→GC and GC → AT mutation rates because each pattern is mutually exclusive and hence the ratio of nucleotide changes (β) is:

$$ \beta =\frac{\mu_{AT\to GC}}{\mu_{GC\to AT}}=\frac{\theta_{AT\to GC}}{\theta_{GC\to AT}}=\frac{S_{AT\to GC}}{S_{GC\to AT}} $$

The number of segregating sites or SNVs in a genome can be calculated as:

$$ {S}_{AT\to GC}=\frac{M_{A\to G}+{M}_{A\to C}}{N_A}+\frac{M_{T\to C}+{M}_{T\to G}}{N_T} $$

$$ {S}_{GC\to AT}=\frac{M_{G\to A}+{M}_{G\to T}}{N_G}+\frac{M_{C\to T}+{M}_{C\to A}}{N_C} $$

Since A and T as well as C and G are complementary to each other in a double-stranded DNA they are equal in number. Therefore, β can be expressed as,

$$ \beta =\frac{M_{A\to G}+{M}_{A\to C}+{M}_{T\to C}+{M}_{T\to G}}{M_{G\to A}+{M}_{G\to T}+{M}_{C\to T}+{M}_{C\to A}}\times \frac{N_{GC}}{N_{AT}} $$

$$ \beta =\frac{M_{AT\to GC}}{M_{GC\to AT}}\times \frac{N_{GC}}{N_{AT}}\to (1) $$

This derivation proves that the ratio of forward and reverse rates of changes can be calculated by simply taking the ratio of the observed counts of AT→GC (M_{AT → GC}) and GC → AT (M_{GC → AT}) changes and multiplying with the ratio of the number of GC (N_GC) to AT (N_AT) nucleotides in a genome. Since eq. 1 represents the ratio of mutation rates (μ_AT → GC and μ_GC → AT) this ratio is expected to be 1 (β = 1) if the observed nucleotide changes are solely due to the result of these mutation rates. Any deviation from this suggests a bias in the substitution process. While β > 1 indicate an excess of AT→GC substitutions β < 1 imply an excess of GC → AT substitutions.

GC-biased gene conversion is known to affect only the changes involving weak (A or T) to strong (G or C) nucleotides but not the changes within weak (A↔T) or within strong (G↔C) nucleotides. Hence the latter is not expected to vary between populations with different N_e. Therefore, we used this as a normalization factor and developed a normalized ratio of AT→GC to GC → AT (β’).

The A → G mutation rate can be normalized using A → T rate and the normalized rate (τ_A → G) can be expressed as:

$$ {\tau}_{A\to G}=\frac{\mu_{A\to G}}{\mu_{A\to T}}=\frac{\theta_{A\to G}}{\theta_{A\to T}}=\frac{S_{A\to G}}{S_{A\to T}}=\frac{\left({M}_{A\to G}/{N}_A\right)}{\left({M}_{A\to T}/{N}_A\right)}=\frac{M_{A\to G}}{M_{A\to T}} $$

Similarly, we can obtain this expression for AT→GC and GC → AT rates as:

$$ {\tau}_{AT\to GC}=\frac{\mu_{A\to G}}{\mu_{A\to T}}+\frac{\mu_{A\to C}}{\mu_{A\to T}}+\frac{\mu_{T\to C}}{\mu_{T\to A}}+\frac{\mu_{T\to G}}{\mu_{T\to A}} $$

$$ {\tau}_{AT\to GC}=\frac{M_{A\to G}+{M}_{A\to C}}{M_{A\to T}}+\frac{M_{T\to C}+{M}_{T\to G}}{M_{T\to A}} $$

$$ {\tau}_{GC\to AT}=\frac{\mu_{G\to A}}{\mu_{G\to C}}+\frac{\mu_{G\to T}}{\mu_{G\to C}}+\frac{\mu_{C\to T}}{\mu_{C\to G}}+\frac{\mu_{C\to A}}{\mu_{C\to G}} $$

$$ {\tau}_{GC\to AT}=\frac{M_{G\to A}+{M}_{G\to T}}{M_{G\to C}}+\frac{M_{C\to T}+{M}_{C\to A}}{M_{C\to G}} $$

Therefore, the normalized ratio of AT→GC to GC → AT (β’) is:

$$ {\beta}^{\prime }=\frac{\tau_{AT\to GC}}{\tau_{GC\to AT}}\to (2) $$

The relationship between nucleotide diversity (π), mutation rate (μ) and effective population size (N_e) for diploid organisms is π = 4N_eμ. Using this relationship, we calculated the effective population size as N_e = π/4 μ. We used the observed nucleotide diversity of a population or of a diploid genome and used a mutation rate of 1.2 × 10^− 8 substitutions per site per generation following many studies based on human pedigree genome data [25, 26]. A recent suggested that the rate of mutation accumulation in non-European genomes could be slightly (5%) higher than that of Africans [7]. To accommodate this difference, we subtracted 5% of the nucleotide diversity for non-African populations only while calculating the effective population sizes and obtained almost identical results (Additional file 1: Figure S4).

Estimation of the proportion of AT→GC counts

We also estimated the proportion of AT→GC counts (P_GC) and GC → AT (N_{GC → AT}) counts, which were calculated as: $ {P}_{GC}=\frac{N_{A\to G}+{N}_{A\to C}+{N}_{T\to C}+{N}_{T\to G}}{N}\to (3) $

$$ {P}_{AT}=\frac{N_{G\to A}+{N}_{G\to T}+{N}_{C\to A}+{N}_{C\to T}}{N}\to (4) $$

N is the number of all types of nucleotide changes. The standard error of P_{AT → GC} and P_{GC → AT} were calculated using the binomial variance.

Availability of data and materials

The whole genome datasets analyzed during the current study were obtained from the 1000 genome project – Phase II (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp) and the Simons Genome Diversity Project (https://www.simonsfoundation.org/simons-genome-diversity-project/).

Abbreviations

CADD:: Combined Annotation-Dependent Depletion
DAF:: Derived Allele Frequencies
gBGC:: GC-Biased Gene Conversion
SNV:: Single Nucleotide Variation

References

Stringer C. Human evolution: out of Ethiopia. Nature. 2003;423(6941):692–3 695.
Article CAS PubMed Google Scholar
DeGiorgio M, Jakobsson M, Rosenberg NA. Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proc Natl Acad Sci U S A. 2009;106(38):16057–62.
Article CAS PubMed PubMed Central Google Scholar
Handley LJ, Manica A, Goudet J, Balloux F. Going the distance: human population genetics in a clinal world. Trends Genet. 2007;23(9):432–9.
Article CAS PubMed Google Scholar
Prugnolle F, Manica A, Balloux F. Geography predicts neutral genetic diversity of human populations. Curr Biol. 2005;15(5):R159–60.
Article CAS PubMed PubMed Central Google Scholar
Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005;102(44):15942–7.
Article CAS PubMed PubMed Central Google Scholar
Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319(5866):1100–4.
Article CAS PubMed Google Scholar
Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201–6.
Article CAS PubMed PubMed Central Google Scholar
Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.
Article PubMed Google Scholar
Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. Vanishing GC-rich isochores in mammalian genomes. Genetics. 2002;162(4):1837–47.
CAS PubMed PubMed Central Google Scholar
Marais G. Biased gene conversion: implications for genome and sex evolution. Trends Genet. 2003;19(6):330–8.
Article CAS PubMed Google Scholar
Duret L, Galtier N. Biased gene conversion and the evolution of mammalian genomic landscapes. Annu Rev Genomics Hum Genet. 2009;10:285–311.
Article CAS PubMed Google Scholar
Galtier N, Duret L, Glemin S, Ranwez V. GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates. Trends Genet. 2009;25(1):1–5.
Article CAS PubMed Google Scholar
Glemin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L. Quantification of GC-biased gene conversion in the human genome. Genome Res. 2015;25(8):1215–28.
Article CAS PubMed PubMed Central Google Scholar
Henn BM, Botigue LR, Peischl S, Dupanloup I, Lipatov M, Maples BK, Martin AR, Musharoff S, Cann H, Snyder MP, et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc Natl Acad Sci U S A. 2016;113(4):E440–9.
Article CAS PubMed Google Scholar
Do R, Balick D, Li H, Adzhubei I, Sunyaev S, Reich D. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat Genet. 2015;47(2):126–31.
Article CAS PubMed PubMed Central Google Scholar
Subramanian S. Europeans have a higher proportion of highfrequency deleterious variants than Africans. Hum Genet. 2016;135(1):1–7.
Article PubMed Google Scholar
Lachance J, Tishkoff SA. Biased gene conversion skews allele frequencies in human populations, increasing the disease burden of recessive alleles. Am J Hum Genet. 2014;95(4):408–20.
Article CAS PubMed PubMed Central Google Scholar
Xue C, Chen H, Yu F. Base-biased evolution of disease-associated mutations in the human genome. Hum Mutat. 2016;37(11):1209–14.
Article CAS PubMed Google Scholar
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310–5.
Article CAS PubMed PubMed Central Google Scholar
Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19(9):1655–64.
Article CAS PubMed PubMed Central Google Scholar
Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73.
Article PubMed Google Scholar
Racimo F, Schraiber JG. Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms. PLoS Genet. 2014;10(11):e1004697.
Article PubMed PubMed Central Google Scholar
Subramanian S. Using the plurality of codon positions to identify deleterious variants in human exomes. Bioinformatics. 2015;31(3):301–5.
Article CAS PubMed Google Scholar
Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1875;7:256–76.
Article Google Scholar
Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, et al. Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011;43(7):712–4.
Article CAS PubMed PubMed Central Google Scholar
Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–9.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

The author thanks Alex Quin for critical comments.

Funding

This study was supported by a grant from the Australian Research Council (LP160100594).

Author information

Authors and Affiliations

GeneCology Research Centre, University of the Sunshine Coast, 90 Sippy Downs Drive, Sippy Downs, QLD 4556, Australia
Sankar Subramanian

Authors

Sankar Subramanian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: SS; Data Analysis: SS; Writing: SS.

Corresponding author

Correspondence to Sankar Subramanian.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: Figure S1.

Correlation between effective population size (Ne) and the ratio AT→GC to GC→AT (b) estimated for homozygous SNVs from whole genomes of the 1000 Genomes Project. The relationship was highly significant (P < 10-6). Figure S2. Relationship between the effective population size (Ne) and the ratio of nucleotide changes within the same types (b): (A) within strong types i.e. C®G/G®C (B) within weak types i.e. A®T/T®A. The ratios were estimated using the high frequency SNVs (DAF > 0.9) belonging to 27 populations obtained from the 1000 genome Project. Figure S3. Relationship between the effective population size (Ne) and the ratio of nucleotide changes within the same types (b): (A) within strong types i.e. C→G/G→C (B) within weak types i.e. A→T/T→A. The ratios were estimated using the homozygous SNVs belonging to 126 populations obtained from the Simon Genome Project. Figure S4. Relationship between the effective population size (Ne) and the normalized ratio of AT®GC to GC®AT (b’) changes using equation 2 (see methods). We used A↔T and G↔C to normalize AT→GC and GC→AT changes respectively. (A) High frequency SNVs (DAF>0.9) and (B) Homozygous SNVs of the 1000 genome project (C) Homozygous SNVs from the Simons Genome Diversity project. The relationships were highly significant (P < 10-6). Figure S5. The relationship between effective population size (Ne) and the normalized ratio AT→GC/GC→AT (b) estimated for homozygous SNVs present in individual genomes belonging to 126 distinct populations of the world. This is very similar to Fig 4A except that the nucleotide diversities of non-Africans were 5% reduced while calculating Ne in order to neutralize the difference in mutation accumulation rates between Africans and non-Africans as reported recently. The correlation was highly significant (P < 10-6).

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Subramanian, S. Population size influences the type of nucleotide variations in humans. BMC Genet 20, 93 (2019). https://doi.org/10.1186/s12863-019-0798-9

Download citation

Received: 04 July 2019
Accepted: 01 December 2019
Published: 05 December 2019
DOI: https://doi.org/10.1186/s12863-019-0798-9

Population size influences the type of nucleotide variations in humans