Snat: a SNP annotation tool for bovine by integrating various sources of genomic information
© Jiang et al; licensee BioMed Central Ltd. 2011
Received: 15 April 2011
Accepted: 7 October 2011
Published: 7 October 2011
Most recently, with maturing of bovine genome sequencing and high throughput SNP genotyping technologies, a large number of significant SNPs associated with economic important traits can be identified by genome-wide association studies (GWAS). To further determine true association findings in GWAS, the common strategy is to sift out most promising SNPs for follow-up replication studies. Hence it is crucial to explore the functional significance of the candidate SNPs in order to screen and select the potential functional ones. To systematically prioritize these statistically significant SNPs and facilitate follow-up replication studies, we developed a bovine SNP annotation tool (Snat) based on a web interface.
With Snat, various sources of genomic information are integrated and retrieved from several leading online databases, including SNP information from dbSNP, gene information from Entrez Gene, protein features from UniProt, linkage information from AnimalQTLdb, conserved elements from UCSC Genome Browser Database and gene functions from Gene Ontology (GO), KEGG PATHWAY and Online Mendelian Inheritance in Animals (OMIA). Snat provides two different applications, including a CGI-based web utility and a command-line version, to access the integrated database, target any single nucleotide loci of interest and perform multi-level functional annotations. For further validation of the practical significance of our study, SNPs involved in two commercial bovine SNP chips, i.e., the Affymetrix Bovine 10K chip array and the Illumina 50K chip array, have been annotated by Snat, and the corresponding outputs can be directly downloaded from Snat website. Furthermore, a real dataset involving 20 identified SNPs associated with milk yield in our recent GWAS was employed to demonstrate the practical significance of Snat.
To our best knowledge, Snat is one of first tools focusing on SNP annotation for livestock. Snat confers researchers with a convenient and powerful platform to aid functional analyses and accurate evaluation on genes/variants related to SNPs, and facilitates follow-up replication studies in the post-GWAS era.
Currently, genome wide association studies (GWAS) have been widely accepted as a primary approach for gene identification concerning complex traits. A subset of SNPs related to the trait of interest can be derived from GWAS at a specified level of statistical significance. To further determine true association findings in GWAS, the common strategy is to sift out the most promising SNPs for follow-up replication studies. Hence it is crucial to explore the functional significance of the candidate SNPs in order to screen and select the potential functional ones. So far, a variety of public bioinformatics databases, e.g., NCBI Entrez Gene , UniProt , Gene Ontology , KEGG PATHWAY  and AnimalQTLdb , etc., contain different aspects of biological information required for SNP functional annotation. However, it is infeasible to mine the relevant research data from these public sources by a single query. Furthermore, it is a daunting task to integrate various sources of functional information of interest among the large assortment of data in a manual fashion.
To deal with SNP annotation, a number of bioinformatics tools have been created. However, the majority of these tools are for humans, e.g., SNPit , SNPnexus , and few tools are available for other species such as bovine. FunctSNP  is currently the only tool available for bovine SNP annotation. Although FunctSNP provides various functions to search and manage annotated data related to SNPs, the main limitations maybe exist: Firstly, it is not straightforward and convenient to annotate even one SNP using FunctSNP, because users should initially construct a local database before SNP annotation and cannot be run online through a web interface, leading to low efficiency and time consuming. Secondly, FunctSNP merely outputs some identifiers relating to some public databases in queries, e.g. GO:0000122 (GO term), bta00340 (KEGG Pathway entry), A7YWP4 (UniProt accession), and more detailed descriptions for these identifiers should be obtained manually via accessing the relevant public domains. Finally, FunctSNP only recognizes the cluster ID (rs#) and accurate position of a SNP in dbSNP, triggering an obviously application limitation since users are usually interested in those SNPs without specific cluster IDs or not included in dbSNP in most cases, such that FunctSNP can not deal with these SNPs under this situation.
Focusing on tackling the above limitations, we developed a SNP annotation tool (Snat) to provide a wide array of functional SNP annotations for bovine. In Snat, the most novelty is that recent versions of information from FTPs and webpages of the public domains have been extracted and integrated to construct a composite database. Furthermore, SNPs at arbitrary positions can be well annotated through a single query, regardless of whether they are involved in the dbSNP database or not.
Snat aims at the design for a novel query scheme which can provide precise and comprehensive annotation for bovine SNPs. This would be helpful for mining potential clues of functional importance. Snat is written in Perl as well as SQL scripts with modular architecture. The design of the query scheme with common interfaces supports multiple options for each annotation task.
Besides web-based application of Snat, we also developed an alternative version of Snat in command-line mode, which can be run locally to query and annotate SNPs. Specifically, via downloading the integrated database from the Snat site as a SQL script, users can construct the local database, which is identical with that of web-based version. Based on the local database, SNP annotation can be conducted to print results into a text file via Perl scripts in a command-line fashion. The raw text report, generated by either the locally running Perl programs or downloaded from Snat website, looks not well-formatted but is created with a specified access rule. Thus it can be readily processed via running a Perl program provided by Snat website to produce annotation data structured in TSV files for further analyses.
Due to incorporating various sources of existing public data into a retrieved database, it is feasible and efficient to obtain annotation data automatically via merely one query in Snat, with seldom need of further re-querying the original public databases manually, facilitating users to rapidly pinpoint these SNPs with functional significance. For further demonstrating the features of Snat, systematic comparisons between Snat and a published tool FunctSNP have been conducted in the following aspects.
First, FunctSNP can only recognize the cluster ID (rs#) or accurate coordinate of a SNP in dbSNP while Snat can deal with an ambiguous coordinate via searching for the nearest SNP or SNPs within a specified distance away from it (see the first example in Additional file 2).
Second, FunctSNP can only run locally and users should construct a local database before SNP annotation, while Snat provides a locally running command-line version as well as online annotation services. Furthermore, FunctSNP outputs annotation information step by step while Snat produces all output data in a single query (see the second example in Additional file 2).
Third, compared with FunctSNP, Snat integrates extra information of GeneRIFs and conserved elements. Additionally, Snat integrates more detailed information for terms of UniProt Protein, GO, KEGG Pathway and QTL than FunctSNP. Specifically, FunctSNP merely outputs the protein accession and name from UniProt while Snat can output its accession, name, length, function, post-translational modification, etc.
Online annotation works well for submitting dozens or hundreds of SNPs in a single query. However, when users need to annotate thousands of SNPs, it is more efficient to use the locally running command-line mode of Snat. Via downloading the integrated database from the Snat site as a SQL script, users can construct the local annotation database. Two Perl programs, named dbSNP.pl and anylocus.pl given by Snat website, can achieve the function of local running. The programs dbSNP.pl and anylocus.pl play the same roles with the modules of "Search SNPs" and "Any Locus" on the webpage, respectively. For demonstrating the application features of local running of Snat, three examples based on command-line mode for performing annotation are given below:
$ dbSNP.pl --db_name db_name --db_user db_user --db_password db_pw --rs_id rs109234250 --all ex1.raw
$ anylocus.pl -db_name name -db_user user -db_password pw -chr_pos 14 440000 -gene symbol name -uniprot acc function -go outputfile
$ dbSNP.pl -db_name name -db_user user -db_password pw -input inputfile -gene gene_id symbol name generifs -go -kegg -omia -option 30k outputfile
The command line options db_name and db_user, db_password denote the name, the user name and the password of the local annotation database respectively. Input data can be rs-identifier (for example, --rs_id rs109234250 in the first command), SNP coordinate (for example, -chr_pos 14 440000 in the second command) or an index file that contains numbers of SNP coordinates (for example, -input inputfile in the third command). Output reports are generated corresponding to the annotation options in the command. For examples, the option --all means that all annotation information should be included in the output report (see the first command); "-gene symbol name -uniprot acc function -go" means that the symbol and name of gene, the accession and function of UniProt protein and GO information should be included in the output report (see the second command).
For illustrating practical significance of Snat in facilitating functional analyses on genes/variants in the post-GWAS era for bovine, Snat was implemented to annotate 20 identified significant SNPs (Additional file 4) associated with milk yield in our recent GWAS . Various genomic information data (e.g., NCBI Gene, UniProt, QTLdb, etc.) related to the SNPs of interest as well as all other SNPs within 5k bp region away from these 20 SNPs were mined and integrated into a raw text (Additional file 5). The raw annotation results were then processed by the Perl program raw2TSV.pl to generate a summary report (see Additional files 6,7) and five subdata files (see Additional files 8,9,10,11,12). From Additional file 7, it can be found that within the 5K bp region from the loci Chr14,76703 and Chr14,443937, a number of novel SNPs (see Additional file 8) are present. Among these SNPs several are either missense mutations or harbored in UTR. Moreover, kinds of functional information about genes around the significant SNPs from UniProt, GO and KEGG Pathway further demonstrate that these statistically significant SNPs also have potentially functional significance. For example, with respect to the locus "Chr14,443937", the annotation data of corresponding protein Q8MK44 show "May be involved in VLDL (very low density lipoprotein) CC assembly" (see Additional file 9), and the information of the corresponding pathway bta00561 shows "Glycerolipid metabolism" (see Additional file 11). All the functional information suggests that the identified SNP "Chr14,443937" likely affects milk production traits and merits follow-up functional validation study. In addition, four QTLs (Additional file 12) related to the significant SNP are associated with milk yield, further showing consistent findings with previous QTL mapping studies.
So far, several dense SNP panels, including the Affymetrix Bovine 10K chip array and the Illumina 50K chip array, have been increasingly implemented to identify causal mutations for economic importance in bovine under the framework of GWAS [13, 14]. However, biological information related to these SNPs has not yet been provided by the chip manufacturers. It is quite daunting for researchers to conduct SNP annotation via querying various public resources manually SNP by SNP. To aid functional analyses and accurate evaluation on genes related to SNPs in GWAS, Snat was implemented herein to create SNP annotation files containing all SNPs corresponding to two commercial SNP chips. The biological information on each of these SNPs, including SNP information from dbSNP, gene information from Entrez Gene, protein features from UniProt, gene function annotations from GO, KEGG PATHWAY and OMIA, linkage information from AnimalQTLdb and conserved elements from UCSC Genome Browser Database, are integrated in these output files, which are publicly available for downloading from Snat site.
To explore the practical feasibility of Snat, several SNP queries under different scenarios are performed for testing running speed of online annotation.
Based on the above tests, the speed of online annotation is practically acceptable as running Snat.
We developed a SNP annotation tool which can provide comprehensive biological information related to arbitrary single nucleotide position across bovine genome. To our best knowledge, this is one of first tools focusing on SNP annotation for livestock except those for humans. Due to incorporating various sources of existing public data into a retrieved database, Snat provides an efficient and concise way for researchers to obtain annotated data of SNPs automatically via merely one query, with seldom need of further re-querying the original public databases manually.
As known to us, traditional GWAS focus on identification of statistical significance of SNPs with limited understanding of functional aspects of SNPs concerning complex traits. It is a pressing need to make an essential bridge between the outcomes from GWAS and the potential information on the function of genes and pathways. Snat offers an opportunity to achieve this goal. By Snat, researchers can sift out those SNPs indentified in GWAS with functional significance of the complex trait of interest in bovine.
We conclude that Snat can act as a complementary tool for aiding further functional analyses on genes/variants and facilitating follow-up replication studies in the post-GWAS era for bovine.
Availability and requirements
Project name: Snat
Project home page: http://animalgenetics.cau.edu.cn/snat/
Operating system(s): Any operating system supporting Perl and MySQL
Programming language: Perl, SQL
Other requirements: Perl DBI and DBD-mysql module
License: Free for non-commercial usage
List of abbreviations
Single Nucleotide Polymorphism
Common Gate Interface
Hypertext Markup Language
Tab Separated Values
Acknowledgements and Funding
We are grateful to the anonymous reviewers for their constructive comments and suggestions that greatly improved our manuscript. This work was supported by Key Development of New Transgenic Breeds Program (2009ZX08009-156B), the National Natural Science Foundations of China (Grant No. 30972092), Beijing Municipal Natural Science Foundation (Grant No. 6102016), New-Century Training Programme Foundation for the Talents by the State Education Commission of China (NETC-10-0783), Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry and the National High Technology Research and Development Program of China (863 Program 2011AA100302).
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 2007, 35 (Database): D26-D31. 10.1093/nar/gkl993.PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004, 32 (Database): D115-119.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32 (Database): D277-280.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu ZL, Fritz ER, Reecy JM: AnimalQTLdb: a livestock QTL database tool set for positional QTL information mining and beyond. Nucleic Acids Research. 2007, 35 (Database): D604-D609. 10.1093/nar/gkl946.PubMed CentralView ArticlePubMedGoogle Scholar
- Shen TH, Carlson CS, Tarczy-Hornoch P: SNPit: A federated data integration system for the purpose of functional SNP annotation. Computer Methods and Programs in Biomedicine. 2009, 95 (2): 181-189. 10.1016/j.cmpb.2009.02.010.PubMed CentralView ArticlePubMedGoogle Scholar
- Chelala C, Khan A, Lemoine NR: SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms. Bioinformatics. 2009, 25 (5): 655-661. 10.1093/bioinformatics/btn653.PubMed CentralView ArticlePubMedGoogle Scholar
- Goodswen SJ, Gondro C, Watson-Haigh NS, Kadarmideen HN: FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases. BMC Bioinformatics. 2010, 11 (1): 311-10.1186/1471-2105-11-311.PubMed CentralView ArticlePubMedGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.PubMed CentralView ArticlePubMedGoogle Scholar
- Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ: The UCSC Genome Browser database: update 2010. Nucleic Acids Research. 2009, 38 (Database): D613-D619.PubMed CentralView ArticlePubMedGoogle Scholar
- Nicholas FW: Online Mendelian Inheritance in Animals (OMIA): a comparative knowledgebase of genetic disorders and other familial traits in non-laboratory animals. Nucleic Acids Research. 2003, 31 (1): 275-277. 10.1093/nar/gkg074.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang L, Liu J, Sun D, Ma P, Ding X, Yu Y, Zhang Q: Genome wide association studies for milk production traits in Chinese Holstein population. PLoS One. 2010, 5 (10): e13661-10.1371/journal.pone.0013661.PubMed CentralView ArticlePubMedGoogle Scholar
- Snelling WM, Allan MF, Keele JW, Kuehn LA, McDaneld T, Smith TPL, Sonstegard TS, Thallman RM, Bennett GL: Genome-wide association study of growth in crossbred beef cattle. Journal of Animal Science. 2009, 88 (3): 837-848.View ArticlePubMedGoogle Scholar