Estimating demographic parameters from large-scale population genomic data using Approximate Bayesian Computation
- Sen Li^{1} and
- Mattias Jakobsson^{1, 2}Email author
DOI: 10.1186/1471-2156-13-22
© Li and Jakobsson; licensee BioMed Central Ltd. 2012
Received: 31 August 2011
Accepted: 27 March 2012
Published: 27 March 2012
Abstract
Background
The Approximate Bayesian Computation (ABC) approach has been used to infer demographic parameters for numerous species, including humans. However, most applications of ABC still use limited amounts of data, from a small number of loci, compared to the large amount of genome-wide population-genetic data which have become available in the last few years.
Results
We evaluated the performance of the ABC approach for three 'population divergence' models - similar to the 'isolation with migration' model - when the data consists of several hundred thousand SNPs typed for multiple individuals by simulating data from known demographic models. The ABC approach was used to infer demographic parameters of interest and we compared the inferred values to the true parameter values that was used to generate hypothetical "observed" data. For all three case models, the ABC approach inferred most demographic parameters quite well with narrow credible intervals, for example, population divergence times and past population sizes, but some parameters were more difficult to infer, such as population sizes at present and migration rates. We compared the ability of different summary statistics to infer demographic parameters, including haplotype and LD based statistics, and found that the accuracy of the parameter estimates can be improved by combining summary statistics that capture different parts of information in the data. Furthermore, our results suggest that poor choices of prior distributions can in some circumstances be detected using ABC. Finally, increasing the amount of data beyond some hundred loci will substantially improve the accuracy of many parameter estimates using ABC.
Conclusions
We conclude that the ABC approach can accommodate realistic genome-wide population genetic data, which may be difficult to analyze with full likelihood approaches, and that the ABC can provide accurate and precise inference of demographic parameters from these data, suggesting that the ABC approach will be a useful tool for analyzing large genome-wide datasets.
Background
where P(θ|D) is the conditional distribution of some parameter of interest (θ) given the data (D), P(θ) is the prior distribution of the parameter, and P(D|θ) is the probability of the data given the parameter (the likelihood function: L(θ) = P(D|θ). According to the expression above, the conditional distribution P(θ|D) of the parameter given the data, which is called the posterior distribution, is proportional to the prior distribution and the likelihood. For most practical cases in evolutionary biology and population genetics, the likelihood function is very difficult to compute because of the large amount of data and the potentially complex models, and exact approaches requiring evaluation of the full likelihood are often restricted to simple evolutionary models [4].
Approximate Bayesian Computation (ABC) can be used to make inference for complex models with high dimensional data [5]. First, in order to overcome the difficulty in evaluating exact likelihoods, ABC approximates the likelihoods of the parameters based on a tolerance level (with respect to some metric) for the difference between observed and simulated data. As the tolerance level goes to zero, ABC produces a sample from the posterior distribution. Second, to overcome the difficulty in evaluating high dimensional data, ABC evaluates summary statistics that reduce the dimensionality of the data. The summary statistics are ideally chosen so that they capture as much information as possible from the data about the parameter(s) of interest. However, because only non-sufficient summary statistics exist for most complex models and parameters of interest, the effects of such statistics will be case-dependent, and the effect of mapping the data space to arbitrarily chosen non-sufficient summary statistics space is not well-known. Tavaré et al. (1997) [5] described a straightforward rejection-algorithm for approximate Bayesian inference, which was extended by Pritchard et al. (1999) [6] to allow some level of deviation between the observed and simulated data. In short, their algorithm proceeds as follow: Simulate a large number of datasets based on different values for some parameter of interest, where the parameter values are sampled from a prior distribution. Next, calculate summary statistics of simulated datasets, accept or reject parameter values on the basis of the difference between simulated summary statistics and the summary statistics of some observed data. Finally, the accepted parameter values represent an approximate sample from the posterior distribution of the parameter of interest. There are several more advanced versions of the basic rejection approach, such as local linear regression adjustment [7], non-linear feed forward neural networks [8], ABC with Markov chain Monte Carlo [3], ABC with sequential Monte Carlo [9].
Population divergence models, or 'isolation with migration' models, have been used extensively in order to describe properties of populations and species, and to explore increasingly complex demographic scenarios e.g. [10–18]. These models can often be good approximations of scenarios that involve populations splitting off from an ancestral population e.g. [13, 18], such as the colonization of islands or distant continents, or the domestication of livestock and crops. In recent years, ABC has also been used to infer demographic parameters of humans from genetic data. For example, Fagundes et al. (2007) [19] estimated several demographic and historical parameters using divergence models, such as the timing of modern humans' exodus from Africa and the time of colonization of the Americas, based on data from 50 nuclear loci sequenced in African, Asian and Native American samples. Similarly, Wegmann et al. (2009) [20] applied ABC with a Markov chain Monte Carlo approach to estimate divergence times and migration rates between three African populations based on 331 microsatellites. Bertorelle et al. (2010) [21] conducted a survey about ABC related publications since 2002. In their survey, they found that 43% publications used microsatellite markers as the source of genetic information, which was the most common source, and the remaining fraction was divided between nuclear and mitochondrial sequence data. The median value of the number of loci for STR and nuclear sequence data was 9 and 19, respectively. Bertorelle et al. (2010) [21] concluded that most applications of ABC still use limited amounts of data, often due to using a small number of loci, compared to the amount of genome-wide population-genetic data which has become available in the last few years [22–26]. Recently, Wollstein et al. (2010) [27] used an ABC approach to investigate the demographic history of Oceania based on approximately 1 million SNPs. Based on that data, and accounting for ascertainment bias, they could provide a more detailed picture of human history and the peopling of Oceania than has previously been painted. However, most studies that use ABC are based on a small number of markers (e.g. Bertorelle et al. (2010) [21]) leading to, in many cases, imprecise parameter estimation [28], and questions about of the power of ABC under some scenarios [29, 30].
In this article, we investigate the performance and power of the ABC approach when we have access to large amounts of genome-wide population-genetic data. We study the ABC approach with local linear regression adjustment for several population divergence models. Simulated data is generated under 'human-like' conditions and from a particular known demographic model (some 150,000 to 300,000 SNPs are generated). Three population divergence models with increasing complexity are investigated, and we compare estimation accuracy of particular parameters under the different models. The effect of the number of loci on the performance of the ABC approach is also investigated.
Methods
Population models
In the second model (Figure 1B), an ancestral population was split into two sub-populations with size N_{1}^{'} and N_{2}^{'} at time T. The size of the ancestral population was set to N_{A} = N_{1}^{'} + N_{2}^{'}. Each sub-population grew exponentially starting at time T with different rates α_{1} and α_{2}. At present, the size of each sub-population was N_{1} and N_{2}. In this model, we assumed that there was no migration between populations. We aim to estimate past population sizes N_{1}^{'} and N_{2}^{'}, and present population sizes N_{1} and N_{2}. Since both past and present population sizes will be drawn from prior distributions for each proposed set of parameters, the growth rates α_{1} and α_{2} are fixed and can be computed by ${\alpha}_{i}=\mathsf{\text{ln}}\left({N}_{i}/{N}_{i}^{\prime}\right),i=1\phantom{\rule{0.3em}{0ex}}or\phantom{\rule{0.3em}{0ex}}2$. The divergence time T was assumed to be known in this model (T = 0.1 × 4N_{ e } = 4,000 generations).
The third model (Figure 1C) was a combination of model 1 and model 2, we treated all parameters as unknown, and we were interested in estimating all these seven parameters: divergence time T, migration rates m_{12} and m_{21}, past population sizes N_{1}^{'} andN_{2}^{'}, and present population sizes N_{1} and N_{2}.
Simulated data
Population-genetic SNP data, comparable in size to human SNP-chip data or large scale re-sequencing data, was simulated using Hudson's ms program [31]. We simulated 10,000 genome-regions each of size 100 kb, for a total of 200 chromosomes (200 haploid individuals or 100 diploid individuals), 100 chromosomes from each sub-population. The mutation rate (θ = 4N_{e}μ) per genome-region was set to 5, the recombination rate (ρ = 4N_{e}r) per genome-region was set to 40. By assuming an effective population size of N_{e} = 10,000, the mutation rate θ corresponded to μ_{s} = 1.25 × 10^{-9} per base pair per generation, and the recombination rate ρ corresponded to r_{s} ≈ 1.00 × 10^{-8} per base pair per generation. The recombination rate r was chosen to match recombination rates estimated from large-scale genomic data and the mutation rate μ was chosen to correspond to an incomplete set of SNPs (which was lower than the value ~10^{-8} for human genome) [32], on the order of 1/8^{th} of all SNPs in the region. Furthermore, SNPs with minor allele frequency less than or equal to 5% were removed in order to mimic ascertained SNPs. With these values of the parameters, one replicate of the simulated data resulted in a few hundred thousand SNPs. The 10,000 genome-regions were assumed to be independent of each other. The ms commands for generating data under the three models are provided in the Additional file 1.
ABC with local linear regression adjustment
To estimate demographic parameters, we used the ABC approach with local linear regression adjustment, introduced by Beaumont et al. (2002) [7] that also utilize smooth weighting for candidate parameters instead of only using a rejection algorithm (e.g. Pritchard et al. 1999) [6]. The linear regression is an innovation that has been successful in reducing the computational load of ABC, but it can produce nonsense values (due to post-processing of sampled values) for the posterior distribution that fall outside the prior distribution. Parameter values not included in the prior distribution cannot appear in the posterior distribution in parametric models as a consequence of Bayes' theorem, and a transformation of accepted parameters vales [33] solves this issue so that only values that appear in the prior distribution can appear in the posterior distribution.
In detail, our ABC procedure can be described as follows:
(1) Sample a set of candidate parameters, θ_{ i } , from each of the prior distributions;
(2) Simulate population-genetic data using the sampled parameter-values θ_{ i } under a particular model;
(3) Compute summary statistics from simulated data;
(4) Compute Euclidean norm for the differences between the set of simulated summary statistics, ${S}_{i}^{*}\equiv \left({S}_{i1}^{*},\dots ,{S}_{iq}^{*}\right),$ and the set of observed summary statistics, $S\equiv \left({S}_{1},\dots ,{S}_{q}\right)$, so that $\u2225{S}_{i}^{*}-S\u2225=\sqrt{{\sum}_{j=1}^{q}{\left({S}_{ij}^{*}-{S}_{j}\right)}^{2}}$, where q is the number of summary statistics. All summary statistics were standardized before computing the Euclidean norm.
(5) Select a fixed fraction of candidate parameter-sets that have the smallest values of $\u2225{S}_{i}^{*}-S\u2225$ and use the Epanechnikov kernel to weight the candidate parameter-sets. Adjust candidate parameters by using a local linear regression approach [7].
We generated 50,000 replicate simulated dataset for each model and choice of prior that were investigated, which corresponds to 500 million simulated genome-regions of size 100 kb. The tolerance level was set to 1% (except for the investigation of tolerance level). In order to assure that the estimated posterior distribution obtained by the local linear regression approach stayed within the bounds of the prior distribution, we transformed the values of the accepted parameters in the rejection step before the regression step [33]. The obtained adjusted parameter-values are draws from the posterior distributions, and can be used as an approximation of the posterior distributions of the parameters of interest.
Summary statistics
We used eight different classes of summary statistics for the ABC approach. All summary statistics were computed individually for the two sub-populations, except for F_{ ST } , resulting in a total of 15 summary statistics. Many of the summary statistics were based on 'haplotypes' and the genetic data were assumed to be phased. We define a haplotype locus by the chunk of DNA that extends from the SNP position a along the genome to the SNP position a + w (for a particular window size w). A haplotype-allele is defined as the combination of variants at all SNPs within the window w for a particular chromosome (and for a particular haplotype locus) [34]. The 8 different classes of summary statistics are:
where n is the number of sampled chromosomes.
where n is the number of sampled chromosomes, and S is the number of SNPs.
where f_{ ij } is the frequency of allele i for SNP j, n is the number of sampled chromosomes, and S is the number of SNPs.
where j_{1} and j_{2} denote the frequency of allele 1 and allele 2 at SNP J and k_{1} and k_{2} denote the frequency of allele 1 and allele 2 at SNP K, and x_{11} denotes the frequency of the J_{1}K_{1} haplotype. The average r^{2} is computed across all pairs of SNPs (that are located between 9 kb and 11 kb from each other) to get LDR.
(5) The number of distinct haplotype-alleles (NOA) per genome-region.
(6) The number of private haplotype-alleles in each sub-population (NPA) per genome-region.
(7) Tajima's D (TAD). Computed for all SNPs in each genome-region following [36].
(8) F_{ ST } (FST). Computed for all SNPs in each genome-region using equation 5.3 in [37].
The summary statistics were computed for each of the two populations, except for F_{ ST } , and all summary statistics were averaged across the 10,000 genome regions. We also tested to use the variances (across the genome regions) instead of the means in the ABC procedure, see discussion.
Results
Comparison of population models
True values and prior distributions of each parameter of the 3 models
Model | Parameter | True value | Prior distribution (uniform) | |
---|---|---|---|---|
T | 0.1 | (0, 0.5) | ||
model 1 | m _{12} | 2 | (0, 5) | |
m _{21} | 1 | (0, 5) | ||
model 3 | N _{1} ^{'} | 2,000 | [100, 10,000] | |
model 2 | N _{2} ^{'} | 5,000 | [100, 10,000] | |
N _{1} | 100,000 | [10,000, 200,000] | ||
N _{2} | 150,000 | [10,000, 200,000] |
Summary of the posterior sample for each parameter in model 1
Parameter | True value | Mean | Difference | 95% interval |
---|---|---|---|---|
T | 0.1 | 0.1003 | 0.30% | [0.0952, 0.1063] |
m _{12} | 2 | 1.9803 | 0.99% | [1.6516, 2.1560] |
m _{21} | 1 | 0.9454 | 5.46% | [0.5033, 1.3935] |
Summary of the posterior sample for each parameter in model 2
Parameter | True value | Mean | Difference | 95% interval |
---|---|---|---|---|
N _{1} ^{'} | 2,000 | 2,054 | 2.70% | [1,907, 2,244] |
N _{2} ^{'} | 5,000 | 4,883 | 2.34% | [4,580, 5,141] |
N _{1} | 100,000 | 119,913 | 19.91% | [82,623, 177,468] |
N _{2} | 150,000 | 162,167 | 8.11% | [135,291, 190,948] |
Summary of the posterior sample for each parameter in model 3
Parameter | True value | Mean | Difference | 95% interval |
---|---|---|---|---|
T | 0.1 | 0.1000 | 0.00% | [0.0654, 0.1449] |
m _{12} | 2 | 2.3454 | 17.27% | [0.4864, 4.6692] |
m _{21} | 1 | 0.8356 | 16.44% | [0.0686, 3.1288] |
N _{1} ^{'} | 2,000 | 1,987 | 0.65% | [1,106, 3,178] |
N _{2} ^{'} | 5,000 | 4,943 | 1.14% | [3,838, 5,828] |
N _{1} | 100,000 | 131,615 | 31.62% | [66,355, 192,624] |
N _{2} | 150,000 | 160,328 | 6.89% | [122,769, 194,806] |
We compared the three parameters in common between model 1 and model 3, and the four parameters in common between model 2 and model 3. Generally, the estimation of each parameter was more accurate under model 1 or model 2 compared to model 3; both the mean values and the 95% credible intervals of the posterior samples were more precise for model 1 and 2 compared to model 3. Especially the 95% credible intervals estimated in models 1 and 2 were much smaller compared to the 95% credible intervals in model 3. These observations were not surprising and reflect the notion that the more complex a model is, the less accurate will the parameter estimation be (given the same estimation conditions). However, the means of the posterior samples of T, N_{1}^{'}, N_{2}^{'}, were still quite close to the true values even for the complex model 3, but the migration rates and present population sizes were more difficult to estimate as illustrated by the wide credible intervals (Figure 4).
The estimation of divergence time, migration rates, past population sizes and present population sizes in model 3 when the mutation rate or the recombination rate varies across the genome regions
Parameter (true value) | Varying mutation rate (θ= 4N_{ e }μ) | Varying recombination rate (ρ= 4N_{ e }r) | |||||
---|---|---|---|---|---|---|---|
E[θ] = 5/3 | E[θ] = 5 | E[θ] = 5 × 3 | E[ρ] = 40/2 | E[ρ] = 40 | E[ρ] = 40 × 2 | ||
T (0.1) | Mean | 0.0700 | 0.1070 | 0.1168 | 0.0662 | 0.1154 | 0.1489 |
95% CI | [0.0252, 0.1383] | [0.0656, 0.1759] | [0.0726, 0.1864] | [0.0274, 0.1259] | [0.0626, 0.1883] | [0.1018, 0.2054] | |
m _{ 12 } (2.0) | Mean | 3.1771 | 2.4135 | 4.9080 | 1.5306 | 2.5406 | 2.6228 |
95% CI | [0.4462, 4.8543] | [0.4741, 4.6774] | [4.5929, 4.9939] | [0.0379, 4.2540] | [0.5208, 4.6828] | [1.0433, 4.6559] | |
m _{ 21 } (1.0) | Mean | 0.7207 | 0.9144 | 0.0476 | 4.6749 | 0.9631 | 0.1092 |
95% CI | [0.0465, 3.4811] | [0.1033, 3.0788] | [0.0000, 0.2378] | [2.8810, 4.9834] | [0.1032, 3.1232] | [0.0190, 0.3659] | |
N _{ 1 } ^{ ' } (2,000) | Mean | 1,268 | 2,349 | 9,014 | 1,820 | 2,474 | 4,779 |
95% CI | [499, 2,460] | [1195, 4,227] | [7,253, 9,884] | [796, 3,402] | [1,186, 4,390] | [2,738, 6,790] | |
N _{ 2 } ^{ ' } (5,000) | Mean | 1,777 | 4,777 | 9,338 | 3,562 | 4,888 | 4,104 |
95% CI | [770, 3,361] | [3,344, 6,235] | [7,878, 9,971] | [2,301, 4,836] | [3,324, 6,276] | [2,620, 5,496] | |
N _{ 1 } (100,000) | Mean | 164,116 | 116,700 | 75,603 | 44,265 | 115,277 | 192,358 |
95% CI | [94,583, 198,448] | [47,239, 191,310] | [40,493, 133,925] | [13,558, 159,882] | [44,780, 191,187] | [177,723, 199,615] | |
N _{ 2 } (150,000) | Mean | 183,527 | 154,968 | 24,650 | 45,778 | 151,010 | 196,899 |
95% CI | [14,273, 199,210] | [102,065, 195,442] | [12,964, 75,434] | [16,968, 140,253] | [98,170, 195,246] | [190,142, 199,659] |
Choosing a poor prior
Estimation of divergence time T for model 3 in cases where the prior distribution does not encompass the true parameter value
True T | With transformation | Without transformation | Without regression | |||
---|---|---|---|---|---|---|
Mean | 95% interval | Mean | 95% interval | Mean | 95% interval | |
0.60 | 0.4994 | [0.4954, 0.5000] | 0.6226 | [0.0977, 1.1443] | 0.4607 | [0.3945, 0.4991] |
0.65 | 0.5000 | [0.5000, 0.5000] | 0.5952 | [0.5618, 0.6297] | 0.4633 | [0.4039, 0.4991] |
0.70 | 0.5000 | [0.5000, 0.5000] | 0.7502 | [0.5228, 0.9915] | 0.4669 | [0.4158, 0.4994] |
0.75 | 0.5000 | [0.5000, 0.5000] | 0.5447 | [0.4450, 0.6692] | 0.4703 | [0.4216, 0.4995] |
0.80 | 0.5000 | [0.5000, 0.5000] | 1.2404 | [0.4929, 1.8773] | 0.4729 | [0.4275, 0.4994] |
0.85 | 0.5000 | [0.5000, 0.5000] | 0.7836 | [0.5244, 0.9533] | 0.4731 | [0.4308, 0.4994] |
0.90 | 0.5000 | [0.5000, 0.5000] | 0.6255 | [0.5325, 0.7240] | 0.4738 | [0.4312, 0.4994] |
0.95 | 0.5000 | [0.5000, 0.5000] | 0.6322 | [0.4716, 0.7902] | 0.4749 | [0.4376, 0.4994] |
1.00 | 0.5000 | [0.5000, 0.5000] | 0.6887 | [0.5749, 0.8131] | 0.4754 | [0.4358, 0.4991] |
Comparison of summary statistics
Increasing the number of loci
Discussion
From the comparison of the three different models, it was clear that the accuracy in the parameter estimates was lower for the more complex model 3 with seven unknown parameters compared to models 1 and 2, which have three and four unknown parameters. This observation was not surprising since the less complex models were nested models within model 3. More generally, by increasing the number of simulated data sets in the ABC procedure, the parameter estimation was improved (especially the 95% credible interval). For example, for model 3, five of the seven parameter estimates were substantially improved if 50,000 replicates were used instead of 10,000 replicates, and the remaining two parameter estimates (the two current population sizes) were very similar regardless of the number of replicates. This result confirms that we can overcome some of the difficulties in parameter estimation for complex models by increasing the number of simulation steps in the ABC procedure, but it will mainly decrease the Monte Carlo error and allow a reduction of the tolerance level. However, the number of replicate simulations can be a limiting factor for ABC analyses of large scale genome-wide data because of computing time. In our case, approximately 8-9 minutes of CPU time was needed to generate one simulated dataset (10,000 genome regions of size 100 kb) for model 3 using ms [31], which corresponds to about 7,000 computer hours for 50,000 simulated datasets.
Among the seven parameters in our models, the migration rates and present population sizes were the most difficult to estimate. A divergence model without migration and an island migration model may result in quite similar gene genealogies for sampled individuals and there will be little information contained in most types of summary statistics to distinguish the differences [13]. Moreover, in a divergence model, the estimates of migration rates may depend on the divergence time since a model with large divergence time and large migration rates can generate gene genealogies that are similar to the gene genealogies of a model with short divergence time and small migration rates. However, under a scenario of divergence and gene flow, the variation in genealogical histories for different parts of the genome could in principle be used to separate migration rates and divergence times. To determine similarity between the "observed" data and the simulated data, we used the means of the summary statistics. Another option would be to use variance, quartiles (e.g. [39]), or a combination of different summaries of the distributions. We tested the performance of using the variances (instead of the means) of the summary statistics to infer the divergence time and the two migration rates in model 1 (true T = 0.1, m_{12} = 2.0, and m_{21} = 1.0). However, the precision of the parameter estimates were clearly more accurate based on means compared to using variances (0.1003 vs. 0.1008 for T, 1.9803 vs. 2.0915 for m_{12}, and 0.9454 vs. 0.8234 for m_{21}). If we use both the mean and the variance, the precision of the parameter estimates were quite close to the result based only on means, but with larger 95% credible intervals ([0.0952, 0.1063] vs. [0.0816 0.1218] for T, [1.6516, 2.1560] vs. [1.2458 3.0073] for m_{ 12 } , [0.5033, 1.3935] vs. [0.1575 1.6617] for m_{ 21 } ). Note also that the number of summary statistics for the combined case is twice the number of summary statistics for the case based on means. The lower quality of the estimates may be related to increasing the dimensionality of the problem as the number of summary statistics increases when considering both means and variances [8].
In models 2 and 3, it is assumed that the population sizes grow exponentially and the populations spend a very short amount of time with a size close to the 'present population size', hence the populations are far from being at an equilibrium. The summary statistics that we used capture events over a long period of time, and during most of this time, the populations have much smaller sizes compared to the present population sizes. Therefore, little information about present population sizes was contained in the summary statistics, which could explain the difficulty in estimating present population sizes compared to past population sizes.
We investigated cases where the true parameter value was outside the range of the prior distribution (Table 6) in order to determine its effect on the parameter estimation and the potential warning signals to pay attention to. If we use the ABC approach with local linear regression adjustment including the transformation step, the posterior sample will be limited to the range of the prior distribution so the ABC practitioner needs to be observant of posterior distributions that are pushed close to the boundaries of the prior. For comparison, the ABC approach with local linear regression adjustment without a transformation step produce mean values of the posterior samples that were often fairly close to the true values despite that the range of the prior distribution did not overlap with the true value (but note that such results violate Bayes' theorem, see above). In many Bayesian analyses, when the prior distribution does not include the true value of the parameter in its support, the posterior sample might be skewed towards the true value indicating that something might be wrong with the choice of prior distribution. The ABC approach with local linear regression adjustment and transformation seem to preserve this property, which is reassuring. In practice, if the posterior distribution ends up close to a bound on the prior, we should adjust the range of prior distribution, so that the posterior is well within the prior distribution.
Huang et al. 2011 [28] demonstrated increasing power for inferring divergence times with increasing numbers of loci, but limited their investigation to relatively small numbers of loci (< 100). We investigated much larger numbers of loci and found that the mean values of the posterior sample approach the true values when approx. 1,000 loci (or more) were used (Figure 8). The width of the 95% credible interval decreases rapidly as the number of loci increase from 100 to some 2,000, after which the decrease rate of the 95% width was lower. The same trends were observed for both the divergence time and migration rates (Figure 8). These results suggest that increasing the number of loci from around a hundred to several thousand improves the accuracy of parameters estimation using ABC. Although the greatest improvement appears for less than 2,000 loci, we note that model 1 is a relatively simple model, and for more complex models, the accuracy of the parameter estimation may continue to improve beyond 2,000 loci.
Both the mutation and the recombination rates are likely to vary across the genome. However, we assumed that the mutation and recombination rates did not vary along the genome, and they were fixed to a known value for the simulations used in the ABC. We further demonstrate that this assumption works well even if the mutation and the recombination rates vary around some mean values as long as these mean values are similar to the fixed values used in the ABC. To make the simulations in the ABC approach even more realistic, we could draw mutation and recombination rates for each genome-region from some distribution and potentially estimate the empirical mutation and recombination rates. Alternatively, the mutation and recombination rates could be treated as nuisance parameters that are only included to make the simulated data better resemble the empirical data.
Conclusions
To conclude, we find that increasing the amount of data from a few loci, or a few hundred loci, to thousands of loci can substantially improve the accuracy of parameter estimation using ABC. In contrast to many full-likelihood inference approaches, the ABC approach is well suited for analyzing large amounts of population genomic data, using for example haplotype-based summary statistics.
Declarations
Acknowledgements
We are grateful for financial support provided by a Sven and Lilly Lawski's Graduate Student Scholarship to SL, the Swedish Research Council, and the Swedish Research Council Formas. The computations were performed on resources provided by Swedish National Infrastructure for Computing (SNIC) through Uppsala Multidisciplinary Centre for Advanced Computational Science (UPPMAX) under Project p2009042 and s00110-213. We also thank Martin Lascoux and three anonymous reviewers for helpful comments and suggestions.
Authors’ Affiliations
References
- Stephens M: Inference Under the Coalescent. Handbook of Statistical Genetics. Edited by: Balding DJ, Bishop M, Cannings C. 2008, Chichester, UK: John Wiley & Sons, Ltd, 878-908. ThirdGoogle Scholar
- Beaumont MA, Rannala B: The Bayesian revolution in genetics. Nat Rev Genet. 2004, 5: 251-261.View ArticlePubMedGoogle Scholar
- Marjoram P, Molitor J, Plagnol V, Tavaré S: Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA. 2003, 100: 15324-15328. 10.1073/pnas.0306899100.PubMed CentralView ArticlePubMedGoogle Scholar
- Csilléry K, Blum MG, Gaggiotti OE, François O: Approximate Bayesian Computation (ABC) in practice. Trends Ecol Evol. 2010, 25: 410-418. 10.1016/j.tree.2010.04.001.View ArticlePubMedGoogle Scholar
- Tavare S, Balding DJ, Griffiths RC, Donnelly P: Inferring Coalescence Times From DNA Sequence Data. Genetics. 1997, 145: 505-518.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW: Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol. 1999, 16: 1791-1798. 10.1093/oxfordjournals.molbev.a026091.View ArticlePubMedGoogle Scholar
- Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian Computation in Population Genetics. Genetics. 2002, 162: 2025-2035.PubMed CentralPubMedGoogle Scholar
- Blum M, François O: Non-linear regression models for Approximate Bayesian Computation. Stat Comput. 2010, 20: 63-73. 10.1007/s11222-009-9116-0.View ArticleGoogle Scholar
- Sisson SA, Fan Y, Tanaka MM: Sequential Monte Carlo without likelihoods. Proc Natl Acad Sci. 2007, 104: 1760-1765. 10.1073/pnas.0607208104.PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen R, Wakeley J: Distinguishing Migration From Isolation: a Markov Chain Monte Carlo Approach. Genetics. 2001, 158: 885-896.PubMed CentralPubMedGoogle Scholar
- Rosenberg NA, Feldman MW: The relationship between coalescence times and population divergence times. Modern Developments in Theoretical Population Genetics. Edited by: Slatkin M, Veuille M. 2002, Oxford: Oxford University Press, 130-164.Google Scholar
- Rosenberg NA: The shapes of neutral gene genealogies in two species: probabilities of monophyly, paraphyly, and polyphyly in a coalescent model. Evolution. 2003, 57: 1465-1477.View ArticlePubMedGoogle Scholar
- Hey J, Nielsen R: Multilocus Methods for Estimating Population Sizes, Migration Rates and Divergence Time, With Applications to the Divergence of Drosophila pseudoobscura and D. persimilis. Genetics. 2004, 167: 747-760. 10.1534/genetics.103.024182.PubMed CentralView ArticlePubMedGoogle Scholar
- Hey J: On the Number of New World Founders: A Population Genetic Portrait of the Peopling of the Americas. PLoSBiol. 2005, 3: e193-View ArticleGoogle Scholar
- Jakobsson M, Hagenblad J, Tavaré S, Säll T, Halldén C, Lind-Halldén C, Nordborg M: A Unique Recent Origin of the Allotetraploid Species Arabidopsis suecica: evidence from Nuclear DNA Markers. Mol Biol Evol. 2006, 23: 1217-1231. 10.1093/molbev/msk006.View ArticlePubMedGoogle Scholar
- Jakobsson M, Rosenberg NA: The probability distribution under a population divergence model of the number of genetic founding lineages of a population or species. Theor Popul Biol. 2007, 71: 502-523. 10.1016/j.tpb.2007.01.004.View ArticlePubMedGoogle Scholar
- Becquet C, Przeworski M: A new approach to estimate parameters of speciation models with application to apes. Genome Res. 2007, 17: 1505-1519. 10.1101/gr.6409707.PubMed CentralView ArticlePubMedGoogle Scholar
- Hey J: Isolation with Migration Models for More Than Two Populations. Mol Biol Evol. 2010, 27: 905-920. 10.1093/molbev/msp296.PubMed CentralView ArticlePubMedGoogle Scholar
- Fagundes NJR, Ray N, Beaumont M, Neuenschwander S, Salzano FM, Bonatto SL, Excoffier L: Statistical evaluation of alternative models of human evolution. Proc Natl Acad Sci. 2007, 104: 17614-17619. 10.1073/pnas.0708280104.PubMed CentralView ArticlePubMedGoogle Scholar
- Wegmann D, Leuenberger C, Excoffier L: Efficient Approximate Bayesian Computation Coupled With Markov Chain Monte Carlo Without Likelihood. Genetics. 2009, 182: 1207-1218. 10.1534/genetics.109.102509.PubMed CentralView ArticlePubMedGoogle Scholar
- Bertorelle G, Benazzo A, Mona S: ABC as a flexible framework to estimate demography over space and time: some cons, many pros. Mol Ecol. 2010, 19: 2609-2625. 10.1111/j.1365-294X.2010.04690.x.View ArticlePubMedGoogle Scholar
- The International HapMap 3 Consortium: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467: 52-58. 10.1038/nature09298.PubMed CentralView ArticleGoogle Scholar
- Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung H, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schymick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J, Matarin M, Britton A, van de Leemput J, Rafferty I, Bucan M, Cann HM, Hardy JA, Rosenberg NA, Singleton AB: Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008, 451: 998-1003. 10.1038/nature06742.View ArticlePubMedGoogle Scholar
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM: Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science. 2008, 319: 1100-1104. 10.1126/science.1153717.View ArticlePubMedGoogle Scholar
- Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD: Genes mirror geography within Europe. Nature. 2008, 456: 98-101. 10.1038/nature07331.PubMed CentralView ArticlePubMedGoogle Scholar
- Reich D, Thangaraj K, Patterson N, Price AL, Singh L: Reconstructing Indian population history. Nature. 2009, 461: 489-494. 10.1038/nature08365.PubMed CentralView ArticlePubMedGoogle Scholar
- Wollstein A, Lao O, Becker C, Brauer S, Trent RJ, Nürnberg P, Stoneking M, Kayser M: Demographic History of Oceania Inferred from Genome-wide Data. Curr Biol. 2010, 20: 1983-1992. 10.1016/j.cub.2010.10.040.View ArticlePubMedGoogle Scholar
- Huang W, Takebayashi N, Qi Y, Hickerson M: MTML-msBayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity. BMC Bioinformatics. 2011, 12: 1-PubMed CentralView ArticlePubMedGoogle Scholar
- Beaumont MA: Joint determination of topology, divergence time and immigration in population trees. Simulations, Genetics and Human Prehistory. Edited by: Matsumura S, Forster P, Renfrew C. 2008, Cambridge, UK: McDonald Institute for Archaeological Research, 135-154.Google Scholar
- Blum MGB: Approximate Bayesian Computation: A Nonparametric Perspective. J Am Stat Assoc. 2011, 105: 1178-1187.View ArticleGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002, 18: 337-338. 10.1093/bioinformatics/18.2.337.View ArticlePubMedGoogle Scholar
- The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.PubMed CentralView ArticleGoogle Scholar
- Hamilton G, Stoneking M, Excoffier L: Molecular analysis reveals tighter social regulation of immigration in patrilocal populations than in matrilocal populations. Proc Natl Acad Sci. 2005, 102: 7476-7480. 10.1073/pnas.0409253102.PubMed CentralView ArticlePubMedGoogle Scholar
- Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK: A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet. 2006, 38: 1251-1260. 10.1038/ng1911.View ArticlePubMedGoogle Scholar
- Hill WG, Robertson A: Linkage disequilibrium in finite populations. Theor Appl Genet. 1968, 38: 226-231. 10.1007/BF01245622.View ArticlePubMedGoogle Scholar
- Tajima F: Statistical Method for Testing the Neutral Mutation Hypothesis by DNA Polymorphism. Genetics. 1989, 123: 585-595.PubMed CentralPubMedGoogle Scholar
- Weir BS: Genetic Data Analysis II. 1996, Sunderland, MA, USA: Sinauer Associates, Inc., 174-Google Scholar
- Rousseeuw PJ: Least Median of Squares Regression. J Am Stat Assoc. 1984, 79: 871-880. 10.2307/2288718.View ArticleGoogle Scholar
- Blum MGB, Jakobsson M: Deep Divergences of Human Gene Trees and Models of Human Origins. Mol Biol Evol. 2011, 28: 889-898. 10.1093/molbev/msq265.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.