Insertion-deletion polymorphisms (indels) as genetic markers in natural populations

Background We introduce the use of short insertion-deletion polymorphisms (indels) for genetic analysis of natural populations. Results Sequence reads from light shot-gun sequencing efforts of different dog breeds were aligned to the dog genome reference sequence and gaps corresponding to indels were identified. One hundred candidate markers (4-bp indels) were selected and genotyped in unrelated dogs (n = 7) and wolves (n = 18). Eighty-one and 76 out of 94 could be validated as polymorphic loci in the respective sample. Mean indel heterozygosity in a diverse set of wolves was 19%, and 74% of the loci had a minor allele frequency of >10%. Indels found to be polymorphic in wolves were subsequently genotyped in a highly bottlenecked Scandinavian wolf population. Fifty-one loci turned out to be polymorphic, showing their utility even in a population with low genetic diversity. In this population, individual heterozygosity measured at indel and microsatellite loci were highly correlated. Conclusion With an increasing amount of sequence information gathered from non-model organisms, we suggest that indels will come to form an important source of genetic markers, easy and cheap to genotype, for studies of natural populations.


Background
Advancement in population and evolutionary genetic research has been accompanied by -or perhaps better phrased -been a consequence of continuous improvement in the way genetic similarity or dissimilarity between genomes is assessed. Seen in long time perspective, genetic marker methodology has evolved from focusing on phenotypes, via immunological parameters and proteins, to genotypes. Following their introduction to the study of natural populations about 15 years ago [1][2][3], microsatellites or short simple tandem repeats have been the genotype-based marker approach of choice for many applications where the relatedness between individuals, populations or species is sought. Preceding and subsequently in parallel to this, non-repetitive DNA sequence variation has been assessed through various approaches, including DNA sequencing, restriction fragment length polymorphism (RFLP) analysis, single strand conformation polymorphism (SSCP) analysis, random amplified polymorphism detection (RAPD) and amplified fragment length polymorphism (AFLP) analysis [4]. More recently, single nucleotide polymorphisms (SNPs) are increasingly finding their application in studies of natural populations [5,6].
The benefits of microsatellites are several and well-known. They are multi-allelic, show high heterozygosity and are relatively easy to analyse at moderate cost. Because of the high polymorphism information content, a rather limited number of markers suffice for many applications in molecular ecology and population genetics. It is usually not too difficult to isolate the required markers from DNA libraries [7] or to employ markers originally developed for related species [8]. SNPs merit as genetic markers for other reasons. They are very common, with genomic densities outnumbering that of microsatellites by orders of magnitudes. Large numbers of individuals may be genotyped at large number of loci by simple and fast automatic methods, and data interpretation is usually straightforward [5,9]. Moreover, SNP variation at protein-coding genes and in other functionally constrained regions of the genome is likely to form the main genetic background to phenotypic variation. Furthermore biallelic SNPs evolve in a manner well described by simple mutation models. There are good reasons to believe that they in many cases will gradually come to replace the use microsatellites in molecular ecology and population genetics/genomics research [6].
Unfortunately, however useful, both microsatellites and SNPs suffer from some shortcomings. The complex and heterogenous mutation pattern of microsatellites [10] introduces ambiguities to further data analysis. Genotyping errors may occur because of stutter bands and technical artefacts (allelic dropouts, null alleles, false alleles, size homoplasy) [11]. As for SNPs, many more markers are needed to get the same amount of information [6,9]. Moreover, despite the many elegant genotyping methods available [9], most of them are relatively costly at small or medium scales, and requires special equipment for highthroughput genotyping.
With a few years' lag phase, the introduction of new genetic markers to the study of natural populations has generally followed methodological developments made in the genetic analysis of model organisms [4]. Currently, there is an increasing focus on polymorphisms of the type short insertions and deletions (indels) in genomic research of humans [12,13] and model species such as Drosophila melanogaster [14] and chicken G. gallus [15]. Indels have been recognised as an abundant source of genetic markers that are widely spread across the genome, though not as common as SNPs. For instance, Mills et al. [13] used data from re-sequencing surveys to identify 415,436 indels segregating in human populations and they estimated that among the total number of >10 million polymorphisms known in humans, some 1.5 million represent indels. Clearly, this indicates that indels could form a very common class of genetic markers also in nonmodel species and this is particularly so given that genetic diversity in many natural populations typically seems to be higher than in humans [5,6,16]. Most importantly, indels can be genotyped with simple procedures based on size separation. Another advantage is the minuscule chance of two indel mutations of exactly the same length happening at the same genomic position, meaning that shared indels can confidently been seen as representing identity-by-descent [cf. [17]].
In this study we present a test of the usefulness of indel markers in natural populations. We use a bioinformatics approach to survey dog shot-gun reads [18] for the presence of indels and based on this we design a pipeline for development of PCR-based indel markers. We subsequently genotype 100 indels in natural wolf populations and compare the results with data on microsatellite variability obtained from the same animals.

Results
There are ≈100,000 shot-gun reads available from each of 9 different dog breeds, sequences data that come in addition to data obtained for the partial [19] or full genome sequencing [18] of two dogs. We surveyed 200,000 of these trace reads for the occurrence of short insertion and deletion polymorphisms, as detected by alignment against the reference sequence of one female boxer [18]. Note that there is essentially no sequence overlap among trace reads so the alignments were consistently in the form of only two alleles drawn from the population of dogs. In total, this yielded 30,116 length polymorphisms, corresponding to about one length variant every 2400 bp. Consistent with what has been found in other organisms cf. [12,13,15,20], the great majority of indels were very short with a dominance of 1-bp events (Table 1). From these polymorphism data we chose 4-bp indels for further analysis since they are easily scored by size separation and relatively abundant in the genome. We selected 100 4-bp non-repetitive indels located within unique sequence. They were spread across the canine genome and consistently represented autosomal loci; the great majority of them likely to reside in non-protein coding sequence. Of the 100, 94 could be readily amplified and scored and were selected for further analysis ( Table 2).
Using conventional genotyping based on fragment length separation in a DNA sequencing instrument, 81 out of the 94 putative markers were found to be polymorphic in a screening of 7 dogs and 76 of them were polymorphic in a global sample of 18 wolves ( Figure 1A). As PCR primers were designed to generate amplicons of varying size within the 70-120-bp interval, combinations of multiplex reactions (three markers per PCR) were readily formed.
This allowed simultaneous amplification, and consequently simultaneous genotyping within a single capillary, of several markers even using the same fluorofore ( Figure 1B).
In wolves, 74% of the polymorphic loci had a minor allele frequency of >10% and 49% of >20%. The average observed and expected heterozygosities were respectively 19.4% and 26.1% in wolves, while they were 26.8% and 35.5% in dogs. The distribution of wolf heterozygosities is shown in Figure 2.
The 76 indels found to be polymorphic in the global sample of wolves were subsequently genotyped in 27 wolves from a Swedish population. Fifty-one loci were polymorphic and showed an observed mean heterozygosity of 25.3%, or 17.0% if including all 76 markers. The same wolves were also genotyped for a set of 20 microsatellites known to be informative in this population [e.g. [21]]. Expected heterozygosities for these loci ranged between 28-75%. There was a positive correlation between mean heterozygosity at indel and microsatellite loci in individual wolves (r 2 = 0.41, P < 0.001; Figure 3).

Discussion
Our study shows the feasibility of using large-scale genomic sequence data for extracting putative insertion and deletion polymorphisms, marker loci subsequently can be validated to represent informative genetic markers at a population level. It also demonstrates the feasibility of transfer of genomic data from a model species to a natural population of a close relative. Dogs were domesticated from wolves 10,000-100,000 years ago [22][23][24], and their divergence has since then been accentuated by strong artificial selection during domestication. Finally, by genotyping of indels and microsatellites in the same wolves it also shows that polymorphism levels of the two marker types are highly correlated.
A lack of large-scale genome sequence information has up till now hampered the introduction of indels as genetic markers in non-model species. It can be anticipated that this will come to change in the near future. There is a rapid increase in the number of genome sequencing initiatives and new sequencing technology, like "454-sequencing" [25], offers immense possibilities for generating massive amount of sequence data from hitherto uncharacterised genomes. Importantly, the depth of sequence coverage provided by new technology means that it is well suited for sequence analysis of pools of individuals, from which a wealth of polymorphism data can be obtained [26]. For example, if 100 Mb of sequence is generated from each of two individuals (with a 1 Gb-genome) in two megasequencing runs, and with an indel density of 1 every 2 kb in pairwise comparisons, several hundred indels are expected to be detected.
Indel density has not been as well characterized in natural populations as nucleotide diversity. In domestic chicken, the pairwise heterozygosity for indels is 2 × 10 -4 per bp [15]. In a natural population of collared flycatchers, Backström et al. [27] found a similar occurrence of indels, 1-2 × 10 -4 per bp. In this study we found about 30,000 indels in 7.2 Mbp of dog sequence, which translates into a heterozygosity of 4 × 10 -4 per bp. This includes length variants in unique sequence as well as in repetititve DNA, like microsatellites. Using a similar search algorithm and a similar type of shot-gun vs. genomic reference data set for chicken, we recently found that about half of all length variants detected in this way represent tandem repeats [15]. This would suggest that in dogs, the heterozygosity for short non-repetitive indels is about 2 × 10 -4 per bp, similar to chicken. Moreover, the length distribution of dog indels (Table 1) show congruence with such data from chicken.
The Swedish wolf population was functionally extinct by the 1960s-1970s but has subsequently recovered to a current size of well over 100 individuals [28]. All contemporary Scandinavian wolves are thought to originate from only three founders, that were eastern immigrants arriving to Sweden around 1980 and 1990, respectively [21]. The strong bottleneck, subsequent inbreeding and the associated loss of genetic diversity experienced by this population [21,29], give the opportunity to test the utility of indel markers in a small and endangered natural population. The finding that about 50% of in silico predicted indels from pairwise sequence comparisons of dog alleles is informative in this wolf population confirms the usefulness of indel markers even in a population with limited genetic diversity.
The mean heterozygosity of the 51 polymorphic indels within the Scandinavian wolf population (25%) is somewhat lower than what was been observed for 21 SNPs (34%) in the same population [29]. However, those SNPs    were initially identified from a screening of a limited number of Scandinavian wolves so there was an ascertainment bias in favour of markers with high polymorphism information content. Generally, for those indels and SNPs that represent neutral markers, there should be no reason to believe that heterozygosity for polymorphic loci differs between the two marker categories. Indels in coding sequence are likely to more often be deleterious than point mutations, at least indels that cause frame shift mutations, which should act as to reduce their diversity due to negative selection. On the other hand, point mutations in coding sequence may potentially more often than indels be subject to positive selection, which also reduces diversity. In any case, although probably comparable to SNPs, indels do show less variation than microsatellites. Thus, to obtain the same resolution power in relatedness analyses, a higher number of biallelic markers are needed compared to multiallelic microsatellites [30][31][32]. However, the rich abundance of indels in genome sequence surveys and the ease by which they are genotyped ( Figure  1a) and multiplexed (Figure 1b) add to their benefit. Moreover, it is possible to design microarrays specifically for short indels, by which genotyping costs become very low [32].

Conclusion
With an increasing amount of sequence information gathered from non-model organisms, we suggest that indels will come to form an important source of genetic markers, easy and cheap to genotype, for studies of natural populations.

Samples
Genomic DNA was extracted from wolf tissue samples using standard phenol-chloroform extraction protocols or ers to analyse the genetic diversity at the intra-populational level using tissue samples of 27 wolves collected between 1985 and 2005 from roadkills or shot animals from Sweden [21,29].

Selection of markers
A total of about 200,000 dog trace read sequences were obtained from GenBank. These sequence tags were almost exclusively derived from light shot-gun sequencing of unrelated dogs that was done in conjunction to the sequencing of the dog genome [18]. An automated pipeline was set up to survey the sequences for potential indels and for design of primers. The initial step in the pipeline was to place all STS sequences onto the dog genome. This was done using local NCBI BLAST [33], with a conservative setting to require an E value of less than 10 -70 . To avoid possible duplicated loci all cases where there was more than one BLAST hit were discarded. Next, the BLAST results were surveyed for 4 bp indels, recognised as 4 bp gaps in alignments of shot-gun reads and the genome reference sequence. To avoid selection of microsatellites only those 4 bp indels where none of the flanks were identical to the indel were used for further processing. For each indel with at least 70 bp flanking sequence on both sides, Primer3 [34] was used for primer design. Primers were requested from the program for fragment lengths between 70 and 120 bp. The primers were constrained by a required melting temperature between 58 and 62°C, as well as a primer length between 19 and 22 bp. Finally the primers were evaluated with regard to self complementarity, as well as for the possibility of the resulting product to form a hair-pin. This was done through a simple complementarity testing procedure where possible self-complementarity at the sharp end of the hairpin was scored higher, and decreasing score inwards. The top 100 loci passing through all steps were picked for screening. Primers were fluorescently labelled with either FAM, HEX, or TET.

Genotyping and data analysis
Amplification by polymerase chain reaction (PCR) was performed in 10 µl solution containing 20 ng DNA, 0.25 U AmpliTaq Gold polymerase with 1× Amplitaq Gold PCR buffer (Applied Biosystems), 2.5 mM MgCl 2 , 0.3 µM of each primer and 0.4 mM dNTP. The PCR profile for the indel markers included initial heating at 95°C for 5 min, followed by 35 cycles of 95°C for 30 s, 58°C for 30 s and 72°C for 1 min, and a final extension at 72°C for 10 min. The profile for microsatellites included an initial denaturation step of 95°C for 10 min, 11 touch-down cycles with 94°C for 30 s, 58°C for 30 s, decreasing by 0.5°C in each cycle, and 72°C for 1 min, then 28 cycles of 94°C for 30 s, 52°C for 30 s and 72°C for 1 min and a final extension of 72°C for 10 min. PCR products were run on a Mega-BACE 1000 capillary sequencer (Amersham Biosciences) and analyzed using the accompanied software Genetic Profiler 2.2. Observed and expected heterosygosities calculated using Microsatellite Toolkit for MS Excel [35], and correlation between the observed individual heterozygosities according to indel and microsatellite data was estimated. the molecular analyses. HE conceived of and coordinated the study, and wrote the paper together with ÜV. All authors read and approved the final manuscript.