HaploSNPer: a web-based allele and SNP detection tool
© Tang et al; licensee BioMed Central Ltd. 2008
Received: 22 October 2007
Accepted: 28 February 2008
Published: 28 February 2008
Single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) are the most common type of polymorphisms and are frequently used for molecular marker development. Such markers have become very popular for all kinds of genetic analysis, including haplotype reconstruction. Haplotypes can be reconstructed for whole chromosomes but also for specific genes, based on the SNPs present. Haplotypes in the latter context represent the different alleles of a gene. The computational approach to SNP mining is becoming increasingly popular because of the continuously increasing number of sequences deposited in databases, which allows a more accurate identification of SNPs. Several software packages have been developed for SNP mining from databases. From these, QualitySNP is the only tool that combines SNP detection with the reconstruction of alleles, which results in a lower number of false positive SNPs and also works much faster than other programs. We have build a web-based SNP discovery and allele detection tool (HaploSNPer) based on QualitySNP.
HaploSNPer is a flexible web-based tool for detecting SNPs and alleles in user-specified input sequences from both diploid and polyploid species. It includes BLAST for finding homologous sequences in public EST databases, CAP3 or PHRAP for aligning them, and QualitySNP for discovering reliable allelic sequences and SNPs. All possible and reliable alleles are detected by a mathematical algorithm using potential SNP information. Reliable SNPs are then identified based on the reconstructed alleles and on sequence redundancy.
Thorough testing of HaploSNPer (and the underlying QualitySNP algorithm) has shown that EST information alone is sufficient for the identification of alleles and that reliable SNPs can be found efficiently. Furthermore, HaploSNPer supplies a user friendly interface for visualization of SNP and alleles. HaploSNPer is available from http://www.bioinformatics.nl/tools/haplosnper/.
Single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) are the most common type of genetic polymorphisms and well suited for molecular marker development due to their abundance within the genome and their slow mutation rate . SNPs have become very popular for haplotype reconstruction . The common approach to haplotype reconstruction is partitioning the genome into blocks that are in high linkage disequilibrium, based on the analysis of genome-wide SNPs, and structuring haplotypes in each block with limited diversity. This approach has been implemented in a number of programs, such as HAP.
Haplotypes can also be reconstructed for specific genes, based on the SNP present in the gene . Haplotypes in this context represent the different alleles of a gene. It is possible to reconstruct alleles with SNPs that are identified in multiple EST sequences of specific genes: several closely linked SNPs from EST sequences of a gene can completely define alleles of the gene [5–7]. A set of SNPs discriminating all identified alleles can be used to study the association between candidate gene genotypes and phenotypes, and to select individuals with specific genotypes. The computational approach to SNP mining is increasingly fruitful because of the continuously increasing number of sequences deposited in databases. Several software packages have been developed to mine for SNPs [8–12]. However, sequences in public repositories often do not store their trace or quality files. Some tools [8, 9, 11, 12] can not process publicly available sequences because they require sequences with trace or quality files, or even the corresponding genomic sequences. Only few software tools can detect SNPs from sequence information alone [5, 10]. QualitySNP  is the only tool that combines SNP detection with the reconstruction of alleles from public EST data without the requirement for trace/quality files or genomic sequences. In this program, gene haplotypes representing alleles are defined by a mathematical algorithm and based on potential polymorphisms. Reliable SNP are identified using these constructed alleles and a confidence score that is calculated based on sequence redundancy in high and low quality regions. SNPServer  is the only web-based tool for SNP discovery that permits the real-time detection of SNPs related to any specified sequence of interest. This tool builds on the use of autoSNP  that utilizes the frequency of occurrence of a polymorphism and co-segregation of multiple SNPs to identify reliable SNPs. However, autoSNP cannot distinguish paralogs, which is a major cause for false detection of SNPs [5, 14]. Moreover, compared to QualitySNP autoSNP detects many more false positive SNPs and requires more calculation time, in particular for large datasets .
In this paper we describe HaploSNPer, a web-based tool for the reliable detection of alleles and SNPs. It is based on finding homologous sequences in user-specified sequence databases using a user-supplied seed sequence, or on a collection of input sequences. HaploSNPer combines the QualitySNP algorithm with database search and sequence alignment tools into an efficient pipeline.
1. Haplotype reconstruction
In the QualitySNP algorithm, a potential haplotype is defined as a group of sequences within a cluster that have the same nucleotide at every polymorphic site . For haplotype reconstruction, the similarity between a candidate sequence and a haplotype group at each single potential SNP is calculated and compared with a threshold to determine whether the nucleotide at that SNP position is identical in the candidate sequence and the haplotype group; then the similarity over all potential SNPs is compared with a second threshold to determine whether the candidate sequence can be reliably assigned to the haplotype group. By using the similarity per polymorphic site as well as the similarity over all polymorphic sites, alleles represented by the haplotype groups can even be reconstructed reliably with sequences containing sequencing errors by setting appropriate threshold values.
2. Recognition of paralogs
Clusters containing paralogous sequences can be expected to contain more polymorphisms than clusters with only allelic sequences. A method based on the number and frequency of polymorphisms may therefore separate paralogs from alleles, such as implemented in POLYBAYES . However, some EST clusters show a larger than average number of SNPs because some genes or regions of genes evolve more rapidly than others. These SNPs present allelic variations but will be mistaken for variations between paralogs by such an approach. The haplotypes that are initially identified by QualitySNP are potential haplotypes that may be groups of alleles and paralogs. In QualitySNP paralogs are distinguished from alleles based on the difference in SNP numbers between the potential haplotypes of the same cluster . The standard deviation (D-value) of the number of potential SNPs among haplotypes in a cluster is calculated and used to assess the probability that the cluster contains haplotypes that are in fact not alleles but paralogs.
With increasing D-value the difference in number of SNPs among haplotypes is larger, so there is a higher probability of including alleles as well as paralogs in the cluster. At lower D-values, the probability that all the potential haplotypes found by QualitySNP are indeed alleles of a gene is higher, so most clusters with low D-values will contain few or no paralogs.
Several parameters can be set to tailor the performance and output of HaploSNPer to the specific requirements of the user (Figure 2). A database corresponding to the species of interest can be selected as the target database from the list provided by HaploSNPer. These databases contain all publicly available EST sequences extracted from the EMBL database and will be updated regularly. Currently HaploSNPer links to databases of nine animal and thirteen plant species. CAP3 or PHRAP can be chosen for sequence alignment. For SNP mining, CAP3 is recommended as it uses individual sequence overlap for cluster construction, while PHRAP tends to extend the consensus sequence by overlap. However, PHRAP is much faster than CAP3 . BLASTN of the BLAST package that is a widely used tool for searching DNA databases for sequence similarities is used to search for sequences similar to the input (seed) sequence. The E-value of BLASTN is to find significantly similar sequences . A series of thresholds of the E-value have been tested on several sequences; an E-value of 1e-60 usually results in the selection of sequences sufficiently similar to the seed sequence, and this value is set as default in HaploSNPer. As the E-value of BLASTN increases with the increasing size of both the query and the database, it can be adjusted by the user according to the outcome of trial runs using different E values and according to their experience.
Several parameters control the performance of the QualitySNP part of the pipeline . The threshold for similarity per polymorphic site can be set based on the expected percentage of good quality sequences; a threshold of 75% achieved satisfactory results in our previous study , and also is the default value in HaploSNPer. The threshold for similarity over all polymorphic sites can be set according to the (assumed) similarity between the alleles. Setting these thresholds too low may result in several different alleles or even paralogous sequences being classified as a single haplotype, while too high settings will result in the separation of allelic sequences into different haplotypes because of sequencing errors.
The extent of the low-quality regions at the 5' and 3' ends of the sequences can be specified; these regions require more redundant information than high-quality regions. Based on examination of public EST sequences, we found that the 5' low-quality region is generally around 30 nucleotides in length, while the 3' low-quality region is about 20% of the sequence length .
HaploSNPer reports the D-value for every cluster, but does not use it as a selection criterion. As discussed in the previous section, higher D-values indicate a higher probability that a cluster contains paralogous as well as allelic sequences. In our previous study on a potato dataset , clusters with a D-value below 0.6 were shown to be generally composed of allelic sequences only. The choice of an appropriate threshold for the D-value requires additional study for each specific application ; however, the D-value can always be used to order clusters for probability of the presence of paralogous sequences.
Results and discussion
The output produced by HaploSNPer consists of three parts. The first displays the settings of the parameters; the second part lists the information on clusters, haplotypes and the statistic information on SNPs, and the third part displays the haplotypes, SNP and sequence alignment for each cluster.
Through extensive testing we have shown that HaploSNPer (and the underlying QualitySNP algorithm) can efficiently detect reliable SNPs, reconstruct haplotypes and therefore identify different alleles using only EST sequence information. Furthermore, HaploSNPer supplies a user friendly interface for visualization of SNPs and alleles, which supplies the selection of informative SNP and allele-specific markers.
Availability and requirements
Project name: HaploSNPer;
Project home page: http://www.bioinformatics.nl/tools/haplosnper/;
Operating system(s): platform independent;
Programming languages: C, PHP, C-shell
Any restrictions to use by non-academics: none
The authors wish to thank Harm Nijveen for testing HaploSNPer and giving valuable suggestions for its improvement.
- Syvanen AC: Accessing genetic variation: genotyping single nucleotide polymorphisms. Nature Reviews Genetics. 2001, 2: 930-942. 10.1038/35103535.View ArticlePubMedGoogle Scholar
- The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.PubMed CentralView ArticleGoogle Scholar
- Halperin E, Eskin E: Haplotype reconstruction from genotype data using imperfect Phylogeny. Bioinformatics. 2004, 20: 1842-1849. 10.1093/bioinformatics/bth149.View ArticlePubMedGoogle Scholar
- Rafalski A: Applications of single nucleotide polymorphisms in crop genetics. Current Opinion in Plant Biology. 2002, 5: 94-100. 10.1016/S1369-5266(02)00240-6.View ArticlePubMedGoogle Scholar
- Tang JT, Vosman B, Voorrips RE, van der Linden GC, Leunissen JAM: QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC bioinformatics. 2006, 7: 438-453. 10.1186/1471-2105-7-438.PubMed CentralView ArticlePubMedGoogle Scholar
- Russell J, Booth A, Fuller J, Harrower B, Hedley P, Machray G, Powell W: A comparison of sequence-based polymorphism and haplotype content in transcribed and anonymous regions of the barley genome. Genome. 2004, 47: 389-398.View ArticlePubMedGoogle Scholar
- Schneider K, Weisshaar B, Borchardt DC, Salamini F: SNP frequency and allelic haplotype structure of Beta vulgaris expressed genes. Molecular Breeding. 2001, 8: 63-74. 10.1023/A:1011902916194.View ArticleGoogle Scholar
- Picoult-Newberg L, Ideker TE, Pohl MG, Taylor S, Donaldson MA, Nickerson DA, Boyce JM: Mining SNP from EST databases. Genome Research. 1999, 9: 167-174.PubMed CentralPubMedGoogle Scholar
- Buetow KH, Edmonson MN, Cassidy AB: Reliable identification of large numbers of candidate SNP from public EST data. Nature Genetics. 1999, 21: 323-325. 10.1038/6851.View ArticlePubMedGoogle Scholar
- Barker G, Batley J, O' Sullivan H, Edwards KJ, Edwards D: Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics. 2003, 19: 421-422. 10.1093/bioinformatics/btf881.View ArticlePubMedGoogle Scholar
- Dantec LL, Chagné D, Pot D, Cantin O, Garnier-Géré P, Bedon F, Frigerio JM, Chaumeil P, Léger P, Garcia V, et al: Automated SNP detection in expressed sequence tags: statistical considerations and application to maritime pine sequences. Plant Molecular Biology. 2004, 54: 461-470. 10.1023/B:PLAN.0000036376.11710.6f.View ArticlePubMedGoogle Scholar
- Weckx S, Del Favero J, Rademakers R, Claes L, Cruts M, De Jonghe P, Van Broeckhoven C, De Rijk P: novoSNP, a novel computational tool for sequence variation discovery. Genome Research. 2005, 15: 436-442. 10.1101/gr.2754005.PubMed CentralView ArticlePubMedGoogle Scholar
- Savage D, Batley J, Erwin T, Logan E, Love CG, Lim GA, Mongin E, Barker G, Spangenberg G, Edwards D: SNPServer: a real-time SNP discovery tool. Nucleic Acids Research. 2005, 33: W493-495. 10.1093/nar/gki462.PubMed CentralView ArticlePubMedGoogle Scholar
- Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier LD, Kwok P, Gish WR: A general approach to single-nucleotide polymorphism discovery. Nature Genetics. 1999, 23: 452-456. 10.1038/70570.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden T, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Madan A: CAP3: a DNA sequence assembly program. Genome Research. 1999, 9: 868-877. 10.1101/gr.9.9.868.PubMed CentralView ArticlePubMedGoogle Scholar
- Phrap. [http://www.phrap.org/]
- Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.View ArticlePubMedGoogle Scholar
- Smit AFA, Hubley R, Green P: [http://repeatmasker.org/]
- Malde K, Coward E, Jonassen I: A graph base algorithm for generating EST consensus sequences. Bioinformatics. 2005, 21: 1371-1375. 10.1093/bioinformatics/bti184.View ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research. 2007, 35: D61-D65. 10.1093/nar/gkl842.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Herpen T, Goryunova S, van der Schoot J, Mitreva M, Salentijn E, Vorst O, Schenk M, van Veelen P, Koning F, van Soest L, et al: Alpha-gliadin genes from the A, B, and D genomes of wheat contain different sets of celiac disease epitopes. BMC Genomics. 2006, 7: 1-13. 10.1186/1471-2164-7-1.PubMed CentralView ArticlePubMedGoogle Scholar
- HaploSNPer. [http://www.bioinformatics.nl/tools/haplosnper/manuals/HaploSNPer_manual.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.