Fast "coalescent" simulation
© Marjoram and Wall; licensee BioMed Central Ltd. 2006
Received: 22 August 2005
Accepted: 15 March 2006
Published: 15 March 2006
The amount of genome-wide molecular data is increasing rapidly, as is interest in developing methods appropriate for such data. There is a consequent increasing need for methods that are able to efficiently simulate such data. In this paper we implement the sequentially Markovian coalescent algorithm described by McVean and Cardin and present a further modification to that algorithm which slightly improves the closeness of the approximation to the full coalescent model. The algorithm ignores a class of recombination events known to affect the behavior of the genealogy of the sample, but which do not appear to affect the behavior of generated samples to any substantial degree.
We show that our software is able to simulate large chromosomal regions, such as those appropriate in a consideration of genome-wide data, in a way that is several orders of magnitude faster than existing coalescent algorithms.
This algorithm provides a useful resource for those needing to simulate large quantities of data for chromosomal-length regions using an approach that is much more efficient than traditional coalescent models.
Given the increasing prevalence of genome-wide data, and the development of methodologies for the analysis of such data, there is an increasing need for tools that can simulate data appropriate for long, genomic regions. Two options suggest themselves:
1. Model and simulation: The traditional approach has been to use a model that is
(a) thought to be a reasonable approximation to the evolutionary history for the organism of interest, and
(b) easy to simulate.
2. Existing data and perturbation: An alternate, newer approach is to take an existing data set and then perturb it in some fashion to produce "new data from old". A simple example of such an approach would be re-sampling. More specific examples can be found in [3, 4].
The first approach has the advantage of being able to produce data that is not dependent on an existing data set. However, the model it uses will be, by definition, an approximation to the evolutionary processes that produced the real data. The second approach, while being dependent on the presence of an initial data set, has the advantage that the evolutionary model underlying the unperturbed data is correct. We don't know how the data got there, but it is 'real' data, so it got there via the correct evolutionary history. However, the need to then perturb the initial data to produce new data sets adds noise to the evolutionary process and thereby results in data that is only an approximation to reality. Furthermore, the extent of the dependence of the new data sets on the initial data set is unclear, and it is therefore not obvious how typical such data might be of other, unobserved, real data.
We believe both of these approaches have merit. In this paper we restrict ourselves to a discussion of the former approach, in which we use an evolutionary model to simulate new data sets. The use of the standard coalescent model becomes impractical as the length of the simulated region increases. However, the coalescent has been proven to be a powerful simulation tool in these contexts (e.g., ). Thus, in this paper we exploit an approximation to the full coalescent algorithm. This approximation, the sequentially Markov coalescent (SMC), was introduced by McVean and Cardin . It is able to simulate significantly longer regions while maintaining the properties of short-range summary statistics. Since our particular interest is in the development of such algorithms as a tool for the testing of disease mapping methodologies, we pay close attention to the behavior of linkage disequilibrium [LD] in data simulated under the SMC model. The coalescent was introduceid in . It provided an elegant and efficient model for the evolution of a population of randomly-mating, neutral, haploid individuals. As such it has become a very widely used tool. Over time, generalizations have been introduced to deal with the more obviously restrictive aspects of the original model. For example, recombination was introduced in . Selection was introduced in [7, 8]. Useful reviews are found in [9–11].
Our interest here centers on the use of the coalescent algorithm to simulate long chromosomal regions. When long regions are considered, and the recombination rate is therefore very high, the coalescent algorithm becomes somewhat problematic to use. Run-times become longer (see "Results") and memory requirements become greater.
In a case in which two widely-separated regions were being considered, one might simulate these two regions independently, relying on the fact that the regions would be essentially unlinked. However, when one is studying a long, continuous region such a strategy becomes inappropriate since linkage disequilibrium is likely to be present along the entire region. (In a situation in which recombination hotspots were present, one might try to independently simulate regions between hotspots.)
Rapid simulation of coalescent ancestries is central to estimation methods such as rejection algorithms, or to the use of simulation-studies as a test-bed for new methodologies. Thus we use a simple approximation to the coalescent in which the difficulties associated with simulating long chromosomal regions are mitigated.
Hudson  introduced recombination into the coalescent model. Griffiths and Marjoram  then embedded this within the ancestral recombination graph (ARG), a more tractable description of the coalescent model in the presence of recombination. Shortly thereafter, Wiuf and Hein [13, 14] introduced an alternate description of the ancestral process with recombination in which the sample is constructed by moving "along the chromosome". Their algorithm gains efficiency by ignoring a class of recombination events that do not affect the present day sample.
In order to discuss this further, we review the concept of ancestral material. A chromosomal region in an individual is considered to be ancestral if it is eventually inherited by any of the sample of interest drawn from the present day population. Thus, individuals in previous generations are likely to contain chromosomal regions that are both ancestral and non-ancestral.
In essence there are five types of recombination events that occur on the full ARG:
1. Recombination in ancestral material;
2. Recombination in non-ancestral material that has ancestral material to both sides;
3. Recombination in non-ancestral material that has ancestral material only to the left;
4. Recombination in non-ancestral material that has ancestral material only to the right;
5. Recombination in an individual that carries no ancestral material.
As the recombination parameter, ρ, increases, the number of recombinations in the ARG, which is of the order of ρ log(n) for a sample of size n, (e.g., ) grows. A simulation of the full ARG would contain all such recombination events, and hence be highly inefficient. This is not, primarily, due to the large number of recombination events per se, but rather is caused by the growing size of the ARG, which makes increasing demands on computer memory.
Simulating the ancestral recombination graph
There is a class of recombination events which occur to lines on the full ARG but which do not affect the properties of samples generated from that graph. These are recombinations which occur in regions which are non-ancestral for that line (i.e. are not passed on to the sample of interest at the bottom of the graph) and which do not have ancestral regions to both their left and right (corresponding to events of types 3, 4 and 5 in the list of the previous section). The Wiuf and Hein algorithm gains efficiency over an algorithm based on the full ARG by excluding recombinations of types 4 and 5. In particular, it excludes recombinations which occur in non-ancestral material and which only have ancestral material to their right on that line, but not those that have ancestral material only to their left. This has the curious feature of making the density of recombination events in the simulated graph increase as we move along the chromosomal region. (However, it is important to note that these 'extra' recombination events occur in non-ancestral material and do not influence the composition of the final sample.) Since, for large ρ, the efficiency of algorithms that simulate (a subset of) the ARG is largely a function of the amount of memory required to store the graph, this makes the Wiuf and Hein algorithm more efficient than algorithms based on the full ARG. The popular ms algorithm of Hudson  also excludes recombinations in non-ancestral material that only have ancestral material to their left (i.e. of type 3). Thus, all events that do not affect the current sample are ignored in the ms algorithm, and it is therefore more efficient than the Wiuf and Hein algorithm.
The novelty of the SMC scheme proposed by McVean and Cardin, is that before adding the new line corresponding to a recombination event they delete the old (existing) line for that recombination, (i.e. all parts of the old line between the point at which the recombination occurred, and the point at which the old line coalesced, are deleted). Thus, each graph we construct is in fact a tree, and knowledge of the deleted lines is lost rather than being stored within the total graph constructed so far (as in the algorithm of Wiuf and Hein). Consequently, the SMC algorithm explicitly disallows coalescence between lineages with no overlapping ancestral material. The motivation is to significantly increase algorithmic efficiency, due to lower memory requirements, while retaining most of the LD structure . This is possible largely because the process by which trees are constructed as we move along the chromosome using the SMC is now Markovian.
In our implementation of their algorithm we include a slight modification in which the old (existing) line for the recombination is deleted after (rather than before) the new line is added. Thus, in this modified version, which we refer to as SMC', the intuitive interpretation is that we only include recombination events that occur in ancestral material and ignore all events occurring in non-ancestral regions. Our motivation for doing this is as follows. The original specification of the SMC algorithm has the consequence of excluding a class of recombination events that occur in ancestral material but do not affect the pattern of LD in the data (because the two lines resulting from the recombination coalesce with each other before coalescing with any other line). We denote this class of recombinations by R. Thus, since all recombinations are forced to be other than type R, the rate at which recombinations of type not equal to R occurs will now be slightly higher than it would normally be under the full coalescent model (for a given recombination parameter ρ). This suggests that LD will decay slightly more quickly in the original, SMC, version of the algorithm (see "Results" for more details). Our FastCoal software implements both the SMC and SMC' versions of the algorithm.
We give results to supplement those in  and demonstrate that the SMC/SMC' approximation is much more efficient than the coalescent for high ρ and produces data that is almost indistinguishable from that produced by an exact coalescent algorithm. The algorithm simulates an approximation to the true coalescent model. However, the degree of approximation is extremely close, at least in terms of patterns of LD. Furthermore, implementation of the algorithm results in software that is very significantly faster, and has much lower memory requirements, when ρ is large. In fact, the memory required by the algorithm is independent of ρ (in direct contrast to ms, for example).
Note that by discarding the old line associated with each recombination, at any given moment the algorithm stores information for one coalescent tree rather than a more complicated and memory-intensive graph. This allows the SMC/SMC' algorithm to run efficiently for high ρ (when the size of the corresponding graph becomes very large). Other than that, the SMC/SMC' process and Wiuf & Hein algorithms are essentially the same. Thus it follows, in a manner directly analogous to that in [13, 14], that T(x), the marginal genealogy at a particular location x, is still exactly described by the traditional coalescent process. See  for an extended discussion of the properties of the SMC algorithm and a derivation of theoretical results.
We now consider the degree to which data produced by the approximation algorithm is similar to that produced by traditional coalescent algorithms and then demonstrate the relative computational efficiencies for a variety of parameter values.
Wiuf and Hein's algorithm
We start by summarizing Wiuf and Hein's method [13, 14]. Their algorithm provides a way of constructing a subset of the ARC by moving 'along the chromosome', constructing the tree appropriate for each point on the chromosome and storing those trees within a graph that is a subset of the full ARG. Recall that x denotes a location in the interval being simulated. The algorithm proceeds as follows:
1. Set x = 0 and generate a coalescent tree for x. Denote this tree by G(x). Denote the length of the graph at x by L(x).
2. Generate y ~ Exp(LM(x)), the distance along the chromosome to the next recombination event.
3. Pick a point g on the graph G(x)uniformly.
4. Add a recombination event to the graph at that g. The recombination occurs at chromosomal location x + y. The left emerging branch follows the path of the existing line at that point. The right emerging line coalesces at some point higher up on the graph (possibly past the MRCA) according to the usual coalescent probabilities. In particular, it coalesces with each existing line at rate 1.
5. Set x = x + y. Let G(x) denote the total graph constructed so far (i.e. G(x) contains all branches appropriate for any z < x). Set L(x) equal to the total length of G(x).
6. If x + y < 1 return to 2.
The method of Wiuf and Hein simulates a substantial sub-sample of the full ARG. Thus, its burden on computer memory is also substantial, to the point of being intractable for long genomic regions. McVean and Cardin introduced the SMC algorithm, an approximation to the process of Wiuf and Hein. The SMC algorithm reduces the topology being simulated to a tree rather than a graph. We now introduce our variation on the SMC algorithm, which we refer to as SMC'.
The SMC' algorithm
The SMC' algorithm proceeds as follows:
1. Set x = 0 and generate a coalescent tree for x. Denote this tree by T(x). Denote the length of the tree at x by L(x).
2. Generate y ~ Exp(L(x)), the distance along the chromosome to the next recombination event.
3. Pick a point g on the tree T(x) uniformly.
4. Add a recombination event to the graph at that g. The recombination occurs at chromosomal location x + y. The left emerging branch follows the path of the existing line at that point. We refer to this as the old branch. The right emerging line coalesces at some point higher up on the graph (possibly past the MRCA) according to the usual coalescent probabilities. In particular, it coalesces with each existing line at rate 1.
5. Delete the part of the old (i.e. left) branch that lies between the newly added recombination event and the point at which the old branch coalesces with another line. At this point we are left with a tree (rather than a graph).
6. Set x = x + y. Let T(x) denote the tree constructed at x. Set L(x) equal to the length of T(x).
7. If x + y < 1 return to 2.
We illustrate this algorithm in Figure 2 [see Additional file 2]. In the original SMC algorithm presented by McVean and Cardin steps 4 and 5 are conducted in reverse order.
As we discussed earlier, this algorithm has the property that, at any point in time, the topology being considered is a tree rather than a graph. Furthermore, as discussed in , the algorithm is now Markovian as we move along the chromosome. As such it can be efficiently stored in memory, the amount of memory required being independent of the recombination rate.
Behavior of algorithm – tree heights
As noted in , exploiting the approximation described by the SMC algorithm (or the SMC' variation), implies that we are no longer simulating exact coalescent ancestries. For the reasons discussed above and in  it rather straightforwardly follows that time to most recent common ancestor (TMRCA) for any point x ∈ [0,1] will have the same distribution as for the standard coalescent.
Mean Height of i th tree for SMC. We show the mean TMRCA for the i th tree along the chromosome, when it exists, as a function of the recombination rate. Data was simulated for a sample size of n = 2. Results are given for ms, SMC and SMC'.
ρ = 1
ρ = 100
Mean Height of i th tree for SMC. We show the mean TMRCA for the i th tree along the chromosome, when it exists, for a sample size of n = 20. Results are given for ms, SMC and SMC'
ρ = 100
The results show that the SMC' algorithm appears to provides a closer approximation to the full coalescent model than does SMC. As one might expect a priori, the degree of difference decreases as the sample size increases. We illustrate this in Table 2, where results are shown for a sample size of 20.
It is not clear how this difference in behavior might affect the properties of the data being simulated, but it suggests that the covariance of the tree heights at any two positions x and y along the region of interest will be highest under ms, with SMC' leading to a somewhat lower covariance and SMC leading to a further lowering of the covariance. (For further evidence of this effect see the results for LD below.)
Behavior of algorithm – run times
We compare the run times of our software with those of ms (Hudson 2002). We concentrate on parameter values that are appropriate for modelling the data that will come from future large-scale association studies. All simulations assume that θ = 4Nu = 10-4 (where N is the effective population size and u is the mutation rate per base pair per generation) and ρ = 4Nr = 5 * 10-4 (where r is the crossover rate per base pair per generation). The θ value gives a SNP density of one SNP with a minor allele frequency (MAF) of at least 0.05 every 3.4 Kb and the ρ value assumes plausible human parameter values (N = 104 and an average recombination rate of 1.25 cM/Mb).
Run-times. Average time per simulation, as a function of sample size n, based on 20 trials, assuming θ = 10-4/bp and ρ = 5 * 10-4/bp. Simulations were run on a 2.8 GHz Intel Xeon processor. Dashes correspond to simulations that could not be completed because they required too much (> 3 GB RAM) memory.
Behavior of algorithm – LD
Mean of LD summaries. We show results for the mean value of summaries of LD behavior for ms and the SMC and SMC' algorithms. We simulated a 2 Mb region for a sample of size 500. Numbers shown are averaged over 1000 replicates. Markers with MAF less than 0.05 were excluded.
% seq. in hap. blocks
Hudson's ms algorithm is an excellent and widely used tool. As we enter an age in which genome-wide studies are becoming increasingly frequent, the ability to efficiently simulate long chromosomal regions becomes more important. Consequently, the need for an efficient alternative to ms arises. While we feel that ms should continue to be the algorithm of choice when computational demands do not prohibit its use, we feel the SMC/SMC' algorithm provides a useful alternative in this new paradigm. In particular, it will lead to significant improvements in efficiency for methods such as rejection algorithms or computationally intensive simulation studies. While it models an approximation to the full coalescent model, the degree of approximation appears to be very good.
Software to implement the SMC algorithm is available as open-source, freely distributable C++ code. It can be used to generate data according to both the SMC and SMC' versions of the algorithm. The current implementation includes the possibility of allowing for changes in population size, in the form of an exponential growth model. This is dealt with in the standard way by altering the rates at which sequences coalesce (see , for example, for details). Clearly, there are a range of other complicating factors that one might wish to add to this code. Our view is that variation in recombination or mutation rates is best handled via post-processing the data produced by the standard form of the algorithm. For example, if one generates data for given values of θ and ρ one can produce recombination hotspots by contracting a region by a factor of λ (where λ is the relative rate of recombination in the hotspot) followed by a consequent thinning of mutations (including each mutation in that region with probability 1/λ if one wishes to keep the mutation rate constant). See  for details. The authors encourage interested parties to submit functions to allow for such complications. We will maintain these in a central repository.
We have developed software (FastCoal) to implement the SMC algorithm of . This algorithm approximates the standard coalescent process. We also introduce a modified version of the algorithm, SMC', which appears to produce a slightly closer approximation to the full coalescent model. The approximation makes the SMC/SMC' algorithm an appropriate choice for simulating long, chromosomal regions, for which existing algorithms become computationally intractable. We have shown that despite the fact that this method is an approximation to the exact coalescent model, it appears to produce data this is virtually indistinguishable from the exact model, at least in terms of patterns of pairwise LD and marginal TMRCAs. The behavior of LD is particularly relevant for genome-wide mapping studies, so we feel our results give convincing evidence that this software can be used to provide test data in a highly efficient manner when testing new genome-wide mapping methodologies.
Availability and requirements
The FastCoal software, written in C++, is available from PM at firstname.lastname@example.org and runs on Windows platforms.
List Of abbreviations
ancestral recombination graph
minor allele frequency
(time to) most recent common ancestor.
The authors gratefully acknowledge funding from NIH grants GM069890-01A1 and HG002790-01A1, as well as thanking Simon Tavaré and Magnus Nordborg for helpful discussions. We thank Gil McVean and two anonymous reviewers for helpful comments on an earlier version of this manuscript.
- Kingman JFC: On the genealogy of large populations. J Appl Prob. 1982, 19A: 27-43.View ArticleGoogle Scholar
- Hudson RR: Properties of a neutral allele model with intragenic recombination. Theor Popn Biol. 1983, 23: 183-201. 10.1016/0040-5809(83)90013-8.View ArticleGoogle Scholar
- Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP: Linkage Disequilibrium Mapping via Cladistic Analysis of Single-Nucleotide Polymorphism Haplotypes. Am J Hum Genet. 2004, 75: 35-43. 10.1086/422174.PubMedPubMed CentralView ArticleGoogle Scholar
- Jiang R, Marjoram P, Stram D: "New data from old" – simulation of test data for mapping studies. 2005,Google Scholar
- Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D: Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005, 15: 1576-1583. 10.1101/gr.3709305.PubMedPubMed CentralView ArticleGoogle Scholar
- McVean GAT, Cardin NJ: Approximating the coalescent with recombination. Phil Trans R Soc B. 2005, 360: 1387-1393.PubMedPubMed CentralView ArticleGoogle Scholar
- Krone SM, Neuhauser C: Ancestral processes with selection. Theor Popn Biol. 1997, 51: 210-237. 10.1006/tpbi.1997.1299.View ArticleGoogle Scholar
- Neuhauser C, Krone SM: The genealogy of samples in models with selection. Genetics. 1997, 145: 519-534.PubMedPubMed CentralGoogle Scholar
- Hudson RR: Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology. Edited by: Futuyma D, Antonovics J. 1991, 7: 1-44.Google Scholar
- Tavaré S: Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Popn Biol. 1984, 26: 119-164. 10.1016/0040-5809(84)90027-3.View ArticleGoogle Scholar
- Nordborg M: Coalescent theory. Handbook of Statistical Genetics. Edited by: Balding DJ, Bishop MJ, Cannings C. 2001, New York: John Wiley & Sons, Inc, 179-208.Google Scholar
- Griffiths RC, Marjoram P: An ancestral recombination graph. Progress in Population Genetics and Human Evolution, IMA Volumes in Mathematics and its Applications. Edited by: Donnelly P, Tavaré S. 1997, Springer Verlag, 87: 100-117.Google Scholar
- Wiuf C, Hein J: Recombination as a Point Process along Sequences. Theor Popul Biol. 1999, 55: 248-259. 10.1006/tpbi.1998.1403.PubMedView ArticleGoogle Scholar
- Wiuf C, Hein J: The ancestry of a sample of sequences subject to recombination. Genetics. 1999, 151: 1217-1228.PubMedPubMed CentralGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model. Bioinformatics. 2002, 18: 337-338. 10.1093/bioinformatics/18.2.337.PubMedView ArticleGoogle Scholar
- Hudson RR, Kaplan NL: Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics. 1985, 111: 147-164.PubMedPubMed CentralGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.PubMedView ArticleGoogle Scholar
- Wall JD: A comparison of estimators of the population recombination rate. Mol Biol Evol. 2000, 17: 156-163.PubMedView ArticleGoogle Scholar
- Hudson RR: Two-locus sampling distributions and their application. Genetics. 2001, 159: 1805-1817.PubMedPubMed CentralGoogle Scholar
- Hein J, Schierup MH, Wiuf C: Gene Genealogies, Variation and Evolution. 2005, Oxford: Oxford University PressGoogle Scholar
- Li N, Stephens M: Modelling Linkage Disequilibrium, and identifying recombination hotspots using SNP data. Genetics. 2003, 165: 2213-2233.PubMedPubMed CentralGoogle Scholar