Calculating expected DNA remnants from ancient founding events in human population genetics
© Stacey et al. 2008
Received: 10 July 2008
Accepted: 17 October 2008
Published: 17 October 2008
Skip to main content
© Stacey et al. 2008
Received: 10 July 2008
Accepted: 17 October 2008
Published: 17 October 2008
Recent advancements in sequencing and computational technologies have led to rapid generation and analysis of high quality genetic data. Such genetic data have achieved wide acceptance in studies of historic human population origins and admixture. However, in studies relating to small, recent admixture events, genetic factors such as historic population sizes, genetic drift, and mutation can have pronounced effects on data reliability and utility. To address these issues we conducted genetic simulations targeting influential genetic parameters in admixed populations.
We performed a series of simulations, adjusting variable values to assess the affect of these genetic parameters on current human population studies and what these studies infer about past population structure. Final mean allele frequencies varied from 0.0005 to over 0.50, depending on the parameters.
The results of the simulations illustrate that, while genetic data may be sensitive and powerful in large genetic studies, caution must be used when applying genetic information to small, recent admixture events. For some parameter sets, genetic data will not be adequate to detect historic admixture. In such cases, studies should consider anthropologic, archeological, and linguistic data where possible.
In the past 20 years, DNA sequence data and advanced computational techniques have provided an unparalleled resource in the study of human origins and migration. These tools have demonstrated a Pleistocene colonization of America by Asian populations[3, 4] and have even prompted calculations of the size of the original human founding populations. Similarly, DNA sequence data have helped demonstrate the dynamics of large human populations such as primitive human migration out of Africa, the American migration, the Lemba migration in Africa, the migratory history of the Baltic States, and many others. Researchers have even used the population genetics of human disease vectors to trace human migration events. It may be difficult to underestimate the value genetic data have played and will continue to play on our ability to reconstruct historic population events.
But while sequence data have been used to study many forms of human migration, their utility in the study of small-scale migration is still in question. Research into small migrations like the Norse settlements in Greenland, a possible Polynesian migration to the New World, the North African Slave migration to America, and the pre-Columbian European migration to America[13, 14], have traditionally been based primarily on evidence other than DNA sequence information. However, recently, researchers have begun to apply genetic data to these smaller historical migrations and make conclusions about small historic populations using current DNA. For example, DNA information has recently been used to study the small indigenous populations of Tierra del Fuego, and to analyze Caucasian admixture in specific African American populations[16, 17]. It should be noted that genetic data have been used to study the large Norse migration to Ireland, but are an afterthought when researching their short-lived occupation of Canada[19, 20].
This raises questions about the utility of genetic data in providing evidence for historic migrations and inferences of unknown past events. While genetic studies can provide considerable information, they are also accompanied by variation and stochasticity. Because of these limitations, even the most complete studies of human populations have been called "not unequivocal" or "sobering" by those conducting the research. Recent reports have also addressed the limited depth of current genetic studies, indicating that most studies make conclusions after sequencing less than 1% of subjects' genomes, and sampling only small numbers of a population. Such methods can be especially problematic when dealing with historic admixture events that are very small. The difficulty is a function of the current architecture of genetic studies: researchers sample loci from a group of individuals and categorize individuals into groups based on which alleles they have at the loci tested[24, 25]. These categorizations are determined based on the most prevalent or probable genetic markers in an individual's genome. The results of these studies, then, can overlook genetic markers that simply are not sampled, which is common in small admixture events. Additionally, stochastic events can lead to allele fixation and further complicate matters, particularly in small populations. It has been suggested that studies of even the largest migrations should couple genetic information with archeological, anthropological, and linguistic data.
As our ability to collect and analyze DNA sequence data increases, understanding the probabilities and variability associated with admixture becomes especially important. In this study, we explore the utility of DNA sequence data in small, recent human migration studies. We use forward-based genetic simulation to explore three questions: 1) what variables contribute to the presence (or absence) of historic markers in today's genomes, 2) how do these variables affect the probability of finding historically admixed DNA in today's populations, and 3) how can studies be designed to maximize information from genetic data? These questions are answered through genetic simulation and a sample size study aimed at suggesting the numbers of subjects and loci that should be sampled to successfully detect small-scale admixture. In our simulations, we assume that migrant allele frequencies are known a priori. The simulations test our ability to detect these known migrant alleles in admixed descended populations. We find that genetic parameters, the stochasticity of genetic drift, and experimental design all play an important role in the ability to find historic DNA in current admixed populations.
We used the simuPOP software package for forward-based genetic simulations. In each simulation, a "migrant" population with distinct, known alleles was admixed with a "native" population. We followed the combined population through time and recorded the frequency of migrant alleles at each generation. Because migrant genetic parameters were known a priori, these simulated allele frequencies allow us to assess how parameters affect the ability of detecting migrant alleles in an admixed descendant population. We used a generation time of 23 years as a compromise among differing estimates of human generation times [28–30]. The simuPOP module allows numerous genetic variables to be altered and studied independently. The variables of interest in these initial simulations are basic genetic variables: native population size, migrant population size, mutation rate, time since admixture event, and initial allele frequencies. These variables allow the assessment of the role that population sizes, mutation, genetic drift, and allele frequency have on the amount of migrant DNA present in the admixed population after a number of generations. Our simulations have been designed so that total population sizes are as analogous to effective population sizes (Ne) as possible. We assume that each individual has an equal expectation of obtaining progeny, that there are equal sex ratios, and that the population remains constant over time. These assumptions allow the population size used in our study to be interpreted as an effective population size, though under some definitions of Ne our numbers will have different values of Ne than those assigned. The statistics and results in this study are based on the allele frequencies retrieved from the simuPOP software. We imported these numbers into the R statistical package for numeric and graphical analysis.
In our genetic simulations, we make a number of assumptions about the populations: random mating, absence of selection, no gene flow, and constant population size from time of the migratory event to the present. Actual populations experience some gene flow with neighboring populations[32, 33]; however, in our simulations, we do not consider this in an attempt to create a best-case scenario for the migrant allele. If such gene flow did occur, it could only decrease the chances of detecting the migration event by lowering the frequency of the migratory allele in the admixed population. In addition, real populations often experience growth following admixture. However, assuming that the migrant allele is growing at the same rate as the other alleles (random mating), the allele frequency should not be changed directly by population size increase, although the effects of drift could become less pronounced as a result of a greater population size. Further studies and simulations using population growth rates may be helpful in addressing the effects of population growth.
Our simulations can be grouped into two separate categories. The first is a series of simulations designed to assess how the parameters mentioned above can influence the presence of historic migrant DNA in today's populations. More concretely, these simulations answer this question: how does each genetic parameter affect the frequency of migrant alleles in an admixed population? Our simulations tested the effect of 4 variables: size of migrant population, size of native population, time since admixture event, and mutation rate at the locus of interest. We assigned each variable a high value and a low value based on current literature and ran a total of 16 simulations using a full factorial experimental design, altering only one variable at a time. This allowed us to study variables independently and assess how they affect the frequency of the migrant allele over time. We compare the impact of each variable by holding other variables constant and comparing the frequencies of the migrant allele.
In the first simulations, we modeled only one locus per individual and assumed no recombination; one locus is adequate to assess the role of these parameters on allele frequencies. We also initialized the migrant population with the migrant allele fixed (all migrant individuals possessed the migrant allele). This is unrealistic, but provides a best-case scenario for detecting the migrant allele. We replicated each simulation 250 times.
The second simulation category was a single simulation designed to mimic the genetic landscape of a true admixed population. We assigned mid-range values for migrant population size (200), native population size (5,000), and generations (100). In order to more realistically model a current study, we followed 1,000 loci on 20 different chromosomes on each individual. This represents a sample much larger than the recommended number needed in order to detect large human admixture. A standard recombination rate of 1.26 cM/Mb was used, though the human recombination rate has been shown to be negligible over 100 generations. At the beginning of the simulation, an initial migrant allele frequency and a mutation rate were randomly generated for each of the 1,000 loci on each individual, in order to model the DNA seen in actual human genetics research. The methods of random generation are outlined below.
In order to understand what must be done to successfully study data from historic admixture, we constructed a sample size study using the data from simulation 2. Small human genetics studies test approximately 50 loci when studying populations. Given the calculated frequency of migrant alleles in our simulated population, we calculated the number of migrant alleles that would be seen, on average, in each human subject of a genetic study. This is accomplished using the cumulative density function (CDF) of a binomially distributed random variable where the size parameter is 50 and the probability parameter is the expected migrant allele frequency. In comparison, one of the larger human genetic studies to date sequenced 993 loci in each human subject. Accordingly, we followed the same protocol to investigate a study of this magnitude, using the binomial CDF with a size parameter of 993 and the same probability parameter.
The most recent studies have again raised the bar as far as loci per subject, sampling 650,000 loci in each individual. Although sampling more loci will find a larger number of migrant alleles, the proportion of such markers in the population does not change when more samples are taken. The study conducted by Li et al. (2008) samples about 20 individuals per population group, a number similar to previous studies. Accordingly, we investigated the sample size necessary to find at least one migrant allele at each of the loci sequenced in a large genetic study.
In the case of a large study with as many as 933 loci, based upon the expected migrant allele frequency of 1.017%, almost every subject would demonstrate at least one migrant allele (Figure 7b). In fact, most subjects would demonstrate more than 9 migrant alleles. However, while large studies would expect to succeed in finding more migrant alleles in today's population, this alone cannot link the admixed population to the migrant population. The migrant alleles will still only represent, on average, 1% of every allele sequenced in the entire study. Therefore, although 9 migrant alleles may, on average, be found in each subject, it is hard to know if the migrant alleles will be redundant among loci and subjects or spread evenly throughout all the loci in the study. Additionally, these numbers could be considerably lower depending on the allele frequency in the migrating population.
Our results provide some important insights in detecting historic admixture. The simulations we present illustrate the effect that initial parameters have on the outcome of human admixture. Simple adjustments in the parameters in our simulation series changed the expected allele frequency outcome from as low as 0.0005 to over 0.50, an increase of three orders of magnitude. The results of any admixture study using genetic data, then, are highly dependent on the variables presented in these simulations (e.g., mutation rate, population sizes, and time since admixture (number of generations)).
High mutation rates can decrease the expected migrant allele frequency and the variability by more than 50 percent, especially in populations that experienced earlier migrations. For example, an increased mutation rate can change the mean final allele frequency from.0243 to.0128, or from.5016 to.2699 (depending on other variables, as reported in Figure 3). Researchers should keep this in mind when selecting loci for analysis. Because some DNA mutation rates are highly variable, choice of locus can have a profound impact on the number of migrant alleles detected years later. Many studies advocate the use of mtDNA due to data collecting feasibility and other factors. However, because the mutation rate is generally higher in mtDNA, it could corrupt signal in studies addressing historic admixture, even when the time frame is relatively recent.
The sizes of the migrant and native populations are fundamental for an understanding of expected allele frequency. With time since admixture as low as those we consider in our simulations, the most important factors are the sizes of the migrating and native populations. In our simulations, if the native population is large, changing the migrating population size results in a change of mean final allele frequency from.0243 to.0010. If the native population is small, those numbers change to.5016 and.0407. These are the most significant differences illustrated by our simulations and they attest to the important role of population sizes. Researchers should not expect to find many alleles from a small migratory group of 50 individuals in a large population today, even if sampling methods are exhaustive.
Additionally, we see that time plays an important role. The standard deviations presented in Table 1 demonstrate that allelic frequencies vary widely, particularly as the number of generations increases. High mutation rates combined with large time spans can reduce migrant allele frequencies significantly. When the mutation rate is low, however, the time since admixture does not affect the final mean allele frequency much (or at all), but it still has a profound impact on the standard deviation. For example, a change in time since admixture in one parameter set almost doubles the standard deviation from.0525 to.1044. As time increases, genetic drift causes the spread of final allele frequencies to increase, particularly when the population sizes are small. Thus, as the time since the admixture event increases, sample size for both loci and subjects becomes increasingly important.
In our second simulation, most of the migrant alleles are present in less than 2% of the population. In a study of a population where few subjects from many human populations are studied, alleles from a small-scale admixture will usually not be recovered at all. And these rare alleles could easily be ignored in favor of haplotypes that better categorize the population into clusters.
Our results demonstrate a profound and general fact: the values of these genetic parameters can drastically alter the expected frequency of migrant alleles in today's populations. Even in our simulations, where steps have been taken to ensure a best-case scenario for the migrant allele, there is often a large spread of possible outcomes. DNA data have been touted as a panacea for recovering information about the past, but their use depends so extensively on factors that are beyond our control that their application is not always appropriate. It is imperative, therefore, that researchers understand the implications of the variables we have presented and not rely solely on DNA sequence data when researching small, recent human migrations. We can only hope to understand basic details of population history when quantifying genetic data and even valid results derived from genetic data may still be misleading if viewed unilaterally, as demonstrated by Harpending et al [46, 47].
Improving probability of detecting historic admixture
• Large Migrant Population
• Identify informative migrant alleles
• Small Native Population
• Test large number of loci
• Low mutation rate at loci of interest
• Large sample size for each population
• Fewer generations since admixture event
• Establish methods for detecting rare alleles
• Collaborative approach (Archeology, Anthropology, Linguistics)
Perhaps most importantly, it must be remembered that drift is stochastic and that historic genetic parameters are, for the most part, unknown. Thus, the absence of specific genetic data is not conclusive evidence against historic admixture. Our results illustrate several parameter sets that would cause admixture to be either completely or practically undetectable today. To address the inconsistent results found in DNA all but the largest genetic studies need to continue to consider anthropologic, archeological, and linguistic data in order to formulate conclusions. Finally, our study demonstrates the utility of simulation studies to put bounds on parameter values and sample sizes for studies of human migration events.
The ability to detect historic admixture and make correct inferences based on genetic data depends on the interplay between population sizes, mutation rates, time, and other parameters. We explore the parameter space of historic alleles in current populations and demonstrate the broad implications of each of these genetic parameters on modern allele frequencies. Our results provide guidelines with respect to the population genetic parameters and their values needed to detect migrant alleles in an admixed population. While studies that focus on large admixture events should be able to draw specific and valid conclusions, we suggest that genetic data be used with caution when studying small admixture events. The random nature of admixed genetic data seen in these simulations demonstrates that the utility of genetic data is dependent on the context of each individual study. Increasing the number of loci and the number of individuals sampled will increase the probability of detecting small traces of signal, but other sources of evidence should always be considered where possible.
We thank Ryan Parr for comments on an earlier draft of this manuscript. This work was supported by an Eliza R. Snow Fellowship from Brigham Young University.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.