SNP identification, verification, and utility for population genetics in a non-model genus
© Williams et al. 2010
Received: 5 February 2010
Accepted: 30 April 2010
Published: 30 April 2010
Skip to main content
© Williams et al. 2010
Received: 5 February 2010
Accepted: 30 April 2010
Published: 30 April 2010
By targeting SNPs contained in both coding and non-coding areas of the genome, we are able to identify genetic differences and characterize genome-wide patterns of variation among individuals, populations and species. We investigated the utility of 454 sequencing and MassARRAY genotyping for population genetics in natural populations of the teleost, Fundulus heteroclitus as well as closely related Fundulus species (F. grandis, F. majalis and F. similis).
We used 454 pyrosequencing and MassARRAY genotyping technology to identify and type 458 genome-wide SNPs and determine genetic differentiation within and between populations and species of Fundulus. Specifically, pyrosequencing identified 96 putative SNPs across coding and non-coding regions of the F. heteroclitus genome: 88.8% were verified as true SNPs with MassARRAY. Additionally, putative SNPs identified in F. heteroclitus EST sequences were verified in most (86.5%) F. heteroclitus individuals; fewer were genotyped in F. grandis (74.4%), F. majalis (72.9%), and F. similis (60.7%) individuals. SNPs were polymorphic and showed latitudinal clinal variation separating northern and southern populations and established isolation by distance in F. heteroclitus populations. In F. grandis, SNPs were less polymorphic but still established isolation by distance. Markers differentiated species and populations.
In total, these approaches were used to quickly determine differences within the Fundulus genome and provide markers for population genetic studies.
High throughput sequencing and genotyping has become increasingly faster, less expensive and more accurate. In recent years this has lead to the establishment of myriad data sets ranging from increased coverage of variation in the human genome at the individual level [1–5] to the sequencing of non-model prokaryotic and eukaryotic genomes and transcriptomes [6–11]. For many organisms sequencing of entire genomes is still unattained, but smaller, more targeted portions of the genome can be easily sequenced and genotyped. Such data can provide genome-wide sequence information which can be used to characterize population and selection pressure parameters as well as provide evolutionary insights that are broadly applicable .
One non-model genus, Fundulus, includes closely related species that range in physiology, environmental and habitat preference, and geographic locales; Fundulus heteroclitus and Fundulus majalis inhabit the Atlantic coast, and Fundulus grandis and Fundulus similis inhabit the Gulf Coast. Many Fundulus species and/or populations have extensive euryhaline capabilities, respond well to varying ranges of hypoxia [13–15], live along a steep thermocline, and have adapted to extremely polluted areas . A variety of studies have investigated the underlying genetic basis of this teleosts' phenotypic plasticity. While some of the transcriptome is known for F. heteroclitus [17–27] much of the genome-wide variation within and between populations and species for this genus is relatively unknown.
Establishing a set of genetic markers, which can be used to assess regions of the genome involved in local adaptation and in speciation is important to understand fundamental similarities and differences between populations and species of Fundulus. Once markers are established they can be further studied to look for signatures of selection to any number of evolutionary forces (e.g., pollution, hypoxia, salinity, temperature). A few studies have established genetic differences between populations of F. heteroclitus mainly with respect to phylogeographic constraints [28, 29] or selection [30–38]. These studies used microsatellite, mitochondrial DNA, and AFLP analyses as well as targeted gene approaches. Single nucleotide polymorphisms (SNPs) are a useful starting point to scan large and disparate regions of the genome due to their abundance in both coding and non-coding regions, their co-dominant nature, and lack of ambiguity.
SNPs have been used to establish differences between individuals , populations [40–42] and species [43, 44]. They also are useful markers for propensity to disease [45–47], disease states , and evidence of the genetic basis of adaptation [49–52]. In vertebrates, a SNP occurs on average every 100 to 1000 base pairs and often is in linkage disequilibrium with many other SNPs along the chromosome, forming strong haplotypes, which can be easily identified . Unfortunately, SNP resources are not readily available in the majority of non-model species lacking genomic resources. With this in mind, we set out to establish a set of SNP markers to identify differences between Fundulus populations and species.
Genomic DNA from spleen and testes was extracted by phenol and chloroform as described in Wirgin et al. , and DNA was resuspended in 50 μL 0.1× TE buffer. Genomic DNAs from fin clips were extracted using a modified version of Aljanabi and Martinez  and DNA was resuspended in 50 μL 0.1× TE buffer. This experiment was performed according to an approved Institutional Animal Care and Use Committee at North Carolina State University.
Adapters and primers used in the amplification of genomic DNA.
BspEI (5' to 3')
EcoRI (5' to 3')
EcoRI (5' to 3')
BspEI (5' to 3')
EcoRI (5' to 3')
BspEI (5' to 3')
EcoRI (5' to 3')
BspEI (5' to 3')
Preselective PCR reactions with primers specific to adaptors (Table 1) were performed in a total volume of 25 μL containing 2 μL of diluted (1:10 in 0.1× Tris-EDTA buffer) ligation product with EcoRI primer (Integrated DNA Technologies; 10 pmol), BspE1 primer (Integrated DNA Technologies; 10 pmol) and 1 U Taq. PCR conditions were 20 cycles of 94°C for 10 sec, 49° for 30 sec, and 72°C for one min. Following the preselective amplification, a selective amplification was carried out to decrease the number of fragments amplified in each individual to approximately 200 by extending the primer on the 3' end. Preselective PCR products were diluted (1:10) and 2 μL of diluted product was amplified with primers (Table 1) to EcoRI+ AAG (Integrated DNA Technologies; 10 pmol) and BspEI +C (Integrated DNA Technologies; 10 pmol) with 1 U Taq in a 25 μL total volume. PCR conditions in the first cycle were 94°C for 10 sec, 65°C for 30 sec, and 72°C for one minute with the annealing temperature reduced by 0.5°C for 20 cycles, then 25 cycles of 94°C for 10 sec, 55°C for 30 sec, and 72°C for one minute.
Emulsion PCR was carried out on PCR products as described . Amplification of the PCR product on the bead was controlled for by quantifying and calculating the size of the amplicon pool using a Bioanalyzer 2100 so that there was a minimum of 2 × 106 copies of DNA that ranged in size from 100 to 700 base pairs. Subsequent products were sequenced on a Roche/454 Life Sciences GS FLX Sequencer at the University of South Carolina's Environmental Genomics Core Facility. The PicoTiter plate was subdivided into eight regions with an expectation of 30,000 reads per region .
Sequences were trimmed of their barcodes. All 626 sequences with at least one ambiguous base were removed since the presence of even a single ambiguous base is an effective indicator of low-quality sequence . Because shorter than expected read lengths also correlate strongly with incorrect reads , another three percent of the sequences (whose lengths were smaller than 100 bp) were removed. The remaining reads were aligned using CAP3 . Quality scores were rescaled to be comparable to the usual Phred Score using ARACHNE .
SNPs were called at both the individual level and population level. At the individual level, SNPs were called using both a Bayesian method and a likelihood ratio test (LRT) method. For the Bayesian method, 10-4 was used as the prior for the mutation rate . At the population level, for each locus on the contig, we simulated the error model and marked a locus as a potential SNP if it had a larger number of second alleles in comparison to the critical value from the error model. Furthermore, a potential SNP site had to have at least three individuals sequenced to 2× at that locus unless another potential SNP site was within five basepairs or over 90% of the individuals had been classified as heterozygous at the individual level. This was done to minimize the rate of false positives caused by homologs.
For the Bayesian model, for each contig, Prior = 1 × 10-4 represents the mutation rate; N represents the total number of unique mapping loci with multiple allelic types; A i and a i represent, respectively, the major and minor alleles at locus i; N i represents the total number of alleles observed for locus i, and Y j is the type of the j th allele copy among these N i alleles where j = 0 ⋯ N i ; finally, e j is the probability of error of the j th allele where the error probability is computed as and where Q is the corresponding quality score after rescaling.
Based on the posterior probabilities from above, we classified each of these N loci as homozygous or heterozygous exclusively. If a locus was classified as heterozygous, it was further tested using a likelihood ratio test (LRT) as follows:
where X j stands for the true allele that we should have observed. For each Y j , we have an error probability of e j associated with it.
Where I j = 1 if Y j = A i ; and I j = 0 if Y j = a i
The LRT was performed with the hypothesis of H O : p = 0.5 versus H a : p >0.5 and -2 × LRT ~ χ2(1).
In order to call SNPs at the population level, we simulated the error model for each locus with multiple allelic types; we assumed that a particular locus was homozygous with major allele A i and randomly simulated N i number of alleles copies to be A i or any of the other three allele types from a uniform distribution with probability (1 - e j ) and e j respectively. We repeated this process 10,000 times and recorded the different numbers of second alleles found in the simulation. The critical value was chosen as the number of second alleles with a right-side p-value of 0.001.
Multiplex assays targeting 458 SNPs in 250 F. heteroclitus individuals, 90 F. grandis individuals, 23 F. majalis individuals, and 21 F. similis individuals were attempted using the Sequenom MassARRAY technology. These consisted of 81 putative SNPs identified by the F. heteroclitus pyrosequencing, 350 putative SNPs previously identified in F. heteroclitus ESTs , and 27 putative SNPs from 22 genes containing, amongst others, SNPs in the aryl hydrocarbon receptor , lactate dehydrogenase B , and the proximal promoter of cytochrome P4501A (unpublished). Assays were designed using the MassARRAY Assay Design Software with the goal of maximizing multiplexing of 36 SNPs per well (Sequenom, San Diego, CA, USA). Only SNPs where 70 base pairs were annotated on either side of the polymorphism were included in the study. There were 14 SNPs previously identified with 454 pyrosequencing where this criterion was not met. If multiple SNPs were proximal (< 70 base pairs) to one another, one SNP was chosen and the other(s) was translated into a degenerate nucleotide (e.g., K = G or T). Reaction conditions were performed by iPLEX chemistry as recommended by Sequenom across 13 plates at the University of Minnesota's BioMedical Genomics Center. SNP genotypes were called using the Sequenom System Typer 4.0 Analysis package. This software uses a three-parameter model to calculate the significance of each putative genotype. Based on the relative significance, a final genotype is called and assigned a particular name (e.g., conservative, moderate, aggressive, user call). Non-calls also were noted (e.g., low probability, bad spectrum).
Arlequin v.3.11 was used to calculate genetic diversity among populations (of F. heteroclitus and F. grandis) by calculating the percentage of polymorphic SNPs (P O ), observed (H O ) and expected heterozygosity (H E ), and the within-population fixation index (F) . Fixation index deviations from zero were tested by 10,000 permutations of alleles between individuals. Hardy-Weinberg equilibrium also was tested in each population. An analysis of molecular variance (AMOVA) was performed to calculate the distribution of variance within populations, between North and South regions, and between F. heteroclitus populations within North and South regions. For F. grandis, the AMOVA was performed to calculate the distribution of variance within populations as well as between populations longitudinally along the Gulf of Mexico. Since SNPs were initially identified from F. heteroclitus sequence data, a maximum of 5% missing data was used as a parameter for calculations involving F. heteroclitus and 10% for all others.
A Mantel test was performed to assess the assumption of isolation by distance using XLSTAT 2009 for F. heteroclitus and F. grandis.
STRUCTURE v.2.2 [67, 68] was used to estimate the number of populations (K) in F. heteroclitus, F. grandis, F. majalis and F. similis along both the Western Atlantic and the Gulf of Mexico and to assign individuals to these populations. The Monte Carlo Markov Chain was run for 105 iterations following a burn-in period of 105 iterations for K = 1 to 14 using the correlated allele frequencies model and assumed admixture. Distruct v. 1.1  was used to generate bar plots to depict classifications with the highest probability under the model. JMP Genomics 3.2 for SAS 9.1.3 conducted principal component analysis on all samples to establish population structure.
A total of 111,001 reads were obtained in one run of the GS FLX instrument producing 5,346,445 total bases of sequence (average read length of 218 bases) with 99.98% of bases having a quality score of 20 or greater. Across the eight regions of the plate, there were on average 1,982 reads per individual. The third barcode produced many less reads per region (<1,000) amongst all regions. All other barcodes performed very similarly with respect to the number of reads per individual across regions. Only 46% of the number of expected reads (111,001 instead of 240,000) were obtained from sequencing. Prior to sequencing, the amplification success of loci on the beads was checked for quality using a Bioanalyzer 2100, and all samples passed. However, three of the eight regions produced half the expected number of reads and a fourth region produced only 15% the expected number of reads. This indicated local problems in sequencing with respect to particular regions and the samples in those regions rather than the plate as a whole. All control beads passed the filter control with an average percentage of 90% across all regions, whereas the percentage of samples passing the filter control varied between regions and averaged 36%: regions with fewer than expected reads had fewer samples passing the filter control. Two regions had very high failure rates due to mixed samples, indicating more than one amplicon per bead.
Genotyping success of SNP markers using the MASSARRAY multiplex assay.
Number of SNPs
Percentage of SNPs
SNPs called in >95% of F. heteroclitus individuals
SNPs called in <80% of all individuals
SNPs called in >90% but <95% of all individuals
Monomorphic SNPs called in >95% of all individuals
Polymorphic SNPs called in >95% of all individuals
SNPs called in <90% of all individuals identified in 454
SNPs called in >90% of all individuals identified in 454
SNPs which were identified by Sequenom software as low probability in greater than 50% of all individuals were removed (17 SNPs in total). An additional 20 SNPs were excluded from analyses due to their excessive heterozygosity across individuals and populations of F. heteroclitus. These SNPs may represent segmental duplication where the two duplicate regions are identical, except that a SNP has been driven to high frequency or become fixed in one of the duplicates.
Genetic parameters of sampled populations in two species of Fundulus.
% Departure from HWE
New Bedford Harbor
% Departure from HWE
SNPs identified via 454 sequencing did not have genetic parameters that differed from SNPs identified in ESTs with the exception of Hardy-Weinberg equilibrium. 454-derived SNPs had a higher percentage of SNPs not in Hardy-Weinberg equilibrium due to a lack of heterozygosity (22% versus 9%).
Many SNP loci (60%) in F. heteroclitus had a frequency greater than 0.10 and were considered common SNPs (Additional file 1). In contrast, 90% of SNPs in F. grandis had low minor allele frequencies below 0.10.
In F. heteroclitus, AMOVA showed that most of the variation was distributed within populations (59.05%), but another large proportion of variation (31.1%) was distributed among northern and southern regions. The remaining 9.85% of variation was explained by differences among populations within regions. In F. grandis, most of the variation was distributed within populations (82.4%), and a smaller proportion (17.6%) of variation was distributed longitudinally between populations across the Gulf of Mexico.
A Mantel test showed significant isolation by distance among F. heteroclitus populations (p < 0.001) and F. grandis populations (p = 0.032).
We used high throughput sequencing and genotyping technology to identify and verify SNP markers in four non-model species within the Fundulus genus. Genotype data sharply differentiated northern and southern populations of F. heteroclitus as well as other species in this genus (F. grandis, F. majalis, and F. similis). Within the species where SNPs were originally annotated, most can be successfully verified and used to study population structure as well as the role and outcome of selection forces on a genome-wide scale.
Using the 454 FLX pyrosequencing system, we observed 111,001 reads yielding an average of 22× coverage across 1,464 contigs. Read lengths and quality scores were similar to many other studies using the 454 FLX system to sequence uncharacterized genomes [8, 70], but we identified fewer SNPs. Two-hundred and sixty-one SNPs were identified in 96 of these contigs (81 were further verified with the Sequenom MassARRAY platform). The percentage of contigs containing SNPs did differ between experiments: we obtained 0.07% of contigs containing SNPs while pyrosequencing of Eucalyptus ESTs identified 0.05% of contigs containing SNPs  and pyrosequencing of size selected, genomic DNA from swine identified 11.4% of contigs contained SNPs .
Our 454 pyrosequencing of genomic DNA was originally designed to both discover and genotype SNPs within and among populations of F. heteroclitus. Thus, we attempted to perform genome reduction with selective PCR reactions to approximately 200 loci that could be sequenced in 10 populations of 8 individuals. With 30,000 reads per one-eighth of a 454 sequencing plate, each region would have 15× coverage per individual or 980× coverage across all populations, enabling accurate genotype calls for most individuals. However, preselective amplication was not perfect, and many more than 200 loci were sequenced; most amplified only a single time in a single individual (these singlets therefore were not useful for variant detection). Furthermore, we obtained only 46% of the expected number of reads. In the end, these problems led to the inability to directly call individual genotypes. We were hoping to both identify SNPs and genotype individuals in a single step, but a more successful approach (as evidenced by the swine group ) is to make reduced representation libraries from many pooled individuals for SNP discovery followed by individual genotyping. Because a pool of individuals is used, this approach identifies few singlets and thus enhances the number of reads per contig. Furthermore, improvements in both the number and length of reads using the Titanium series FLX 454 system compared to the original FLX system we used will increase the number of identified SNPs.
To increase our ability to measure population genetic parameters within and among populations, we verified SNPs identified through 454 sequencing and additional SNPs annotated from F. heteroclitus cDNAs using the MassARRAY system. Similar percentages of 454 pyrosequencing derived SNPs and SNPs identified from ESTs were verified (80% and 83%, respectively). Of the 458 putative SNPs, 379 (82.75%) were polymorphic, but only 264 had a greater than 90% successful call rate among all individuals. Among F. heteroclitus, most SNPs amplified (61.3% were called in >95% of individuals) indicating that differences in amplification rate between species led to the lower overall call rate. In white spruce, 91% of SNPs verified with the Illumina SNP bead array platform [71, 72] were true. Comparable to F. heteroclitus, 70% of SNPs in spruce were called in greater than 95% of individuals . Overall, verification of SNPs was powerful in providing information over many markers and individuals and was able to provide data to determine differences within populations, between populations and between species.
Species differentiation was demonstrated using principle component analysis (PCA) as well as STRUCTURE analysis. Both analyses showed separation between F. heteroclitus, F. grandis and F. majalis and similis as well as population structure within F. heteroclitus (Figure 4). These analyses provided the most resolution (even among distinguishing populations) in F. heteroclitus because the SNPs were originally identified in this species (i.e., due to an ascertainment bias). PCA and STRUCTURE did not differentiate sister species, F. similis and F. majalis, from each other or establish population structure within these species. Small sample sizes (1 to 10 individuals per population), high levels of monomorphism (average of 28% of all SNPs), and the fact that only 10% of SNP alleles differed between these two species, decreased the power to detect such differences when analyzed in conjunction with F. heteroclitus and F. grandis. Population structure also was masked in F. grandis when data was analyzed with other species. However, when F. grandis individuals were analyzed separately, they also showed distinct population structure (data not shown). One other study has reported multiple fixed differences in mitochondrial sequences between F. heteroclitus and F. grandis , but no other study to date has evaluated differences at many loci between all four species used in this study.
Within F. heteroclitus and F. grandis species, within-population fixation indices (FIS, averaged across all loci) ranged from 0.09 to 0.32. Among F. heteroclitus, all populations had an overall significant deficiency of heterozygotes indicated by positive FIS values. In these populations, approximately 10% of loci had similarly very large FIS values (>0.5) across populations causing the skew in the average FIS value for each population. Within a population, these loci were predominately homozygous for one allele with a complete absence of the heterozygote and one or a few individuals homozygous for the alternative allele. The loci which presented this pattern were called conservatively at both alleles by Sequenom software across all individuals indicating that genotyping error was not the main reason for this pattern.
Furthermore, all northern populations were predominately homozygous for one allele and all southern populations were predominately homozygous for the alternative allele indicating strong demographic patterns in the data. The same demographic pattern was not found in F. grandis. Among F. grandis populations, most (70%) SNPs with high FIS values were different between populations. This is in contrast to F. heteroclitus populations where loci with high FIS values were shared across populations. Within any one F. grandis population, one allele was predominant as a homozygote with one or a few individuals with the alternative homozygote. The most parsimonious explanation is that there is undetected substructure.
SNPs in Hardy-Weinberg were shown to be moderately polymorphic (average of 60%) in F. heteroclitus. In F. grandis, SNPs were shown to lack polymorphism (7.18%). The higher percentage of monomorphic loci in F. grandis likely is due to ascertainment bias in SNP discovery caused by only using F. heteroclitus populations. Many of the monomorphic loci (24%) represent fixed differences between F. heteroclitus and F. grandis. Thus, while SNP markers developed in F. heteroclitus are not necessarily polymorphic in other Fundulus species, they still can be used to differentiate F. heteroclitus from other species.
Among F. heteroclitus populations, genotype data revealed strong latitudinal clines between the Northern and Southern F. heteroclitus populations. PCA, STRUCTURE, FST values, and the isolation by distance test identified that individuals from Northern populations (above 40-41°N) were distinct from Southern populations. This split is centered around the southern-most extent of the Atlantic coastal advancement during the late Pleistocene . Specifically, observed heterozygosity and allelic richness across all loci is significantly lower (p = 0.043, p = 0.042, respectively) in the north than in the south. These differences have been shown previously in morphological features  numerous allozyme loci [34–36, 75] and microsatellites . The larger historical population size of F. heteroclitus in the south  would maintain greater heterozygosity and allelic richness at shared loci; in the north, where population sizes are smaller, loci have a higher probability of becoming fixed.
Four STRUCTURE clusters encompass the six northern populations while only two clusters encompass the five southern populations (Figure 5A). Separate northern clusters may be driven by smaller population sizes in which drift is greater. When genetic drift has a larger effect it becomes easier to distinguish populations because the average difference in allele frequencies of a marker in different populations will be greater. This is illustrated by a larger average FST of 0.20 among northern populations in comparison to that of an average FST of 0.10 among southern populations. This statistic is also evident for the north and south split, where populations from respective regions had an extremely high FST value of 0.44 when compared against one another. Similar genetic divergence has been reported for F. heteroclitus using microsatellites (0.196 among northern populations, 0.117 among southern populations and 0.330 for the two most divergent populations, Nova Scotia and Georgia ). Similar demographic patterns have been described in freshwater fish  and marine species such as goby  and blue crab , and, as in Fundulus, these patterns are attributed to Pleistocene events.
A similar latitudinal cline occurs between populations of F. grandis, and a Mantel test shows significant isolation by distance. However, there were no significant differences between either levels of polymorphism or observed heterozygosity along latitude or longitude. Williams et al., 2008 reported significant isolation by distance as well as decreased allelic richness with increasing latitude. In this 2008 study, microsatellites were used, and two additional sites southern to those used in our study were included. Since microsatellites have many more alleles than SNPs and two additional sites were found to have relatively higher allelic richness in comparison to all other sampling sites along the gulf, this may account for the differences found in levels of polymorphism.
By targeting SNPs contained in both coding and non-coding areas of the genome, we are able to better understand how evolutionary forces are shaping the Fundulus genome. Similar studies using high throughput methods to sequence SNP markers have been developed in Atlantic cod , white spruce , Eucalyptus , and swine . Like our study, these studies expanded their own species' knowledge base with respect to potential markers for studying evolutionary adaptation (in the case of cod and spruce), genome-wide assessment of diversity (Eucalyptus) or for use in breeding programs (swine)
The authors thank G. Bozinovic and M. Everett for assistance in the collection of samples and D. Crawford for valuable input into methodology. Part of this work was carried out by using the resources of the Computational Biology Service Unit from Cornell University which is partially funded by Microsoft Corporation. Funding was partially provided by NIEHS Training Grant ES525163 award from the Department of Environmental and Molecular Toxicology at North Carolina State University to LMW, NIH RO1 ES011588 to MFO, NSF DEB0948510 to ARB, and NIH R01 HG003229 CDB.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.