PoPoolation DB: a user-friendly web-based database for the retrieval of natural polymorphisms in Drosophila
© Pandey et al; licensee BioMed Central Ltd. 2011
Received: 1 November 2010
Accepted: 2 March 2011
Published: 2 March 2011
The enormous potential of natural variation for the functional characterization of genes has been neglected for a long time. Only since recently, functional geneticists are starting to account for natural variation in their analyses. With the new sequencing technologies it has become feasible to collect sequence information for multiple individuals on a genomic scale. In particular sequencing pooled DNA samples has been shown to provide a cost-effective approach for characterizing variation in natural populations. While a range of software tools have been developed for mapping these reads onto a reference genome and extracting SNPs, linking this information to population genetic estimators and functional information still poses a major challenge to many researchers.
We developed PoPoolation DB a user-friendly integrated database. Popoolation DB links variation in natural populations with functional information, allowing a wide range of researchers to take advantage of population genetic data. PoPoolation DB provides the user with population genetic parameters (Watterson's θ or Tajima's π), Tajima's D, SNPs, allele frequencies and indels in regions of interest. The database can be queried by gene name, chromosomal position, or a user-provided query sequence or GTF file. We anticipate that PoPoolation DB will be a highly versatile tool for functional geneticists as well as evolutionary biologists.
PoPoolation DB, available at http://www.popoolation.at/pgt, provides an integrated platform for researchers to investigate natural polymorphism and associated functional annotations from UCSC and Flybase genome browsers, population genetic estimators and RNA-seq information.
The functional implications of natural variation has been a long-standing interest of evolutionary biologists. Nevertheless, only recently, functional biologists are starting to recognize that natural variation could also provide important insights into the function of genes. Naturally occurring alleles could be viewed as the outcome of a large-scale mutagenesis experiment focusing on mutations with a smaller effect and/or different functionality. Some functional studies have successfully accounted for natural variation in their analyses [1, 2]. The new sequencing technologies provide an unprecedented opportunity to collect sequence information on a genomic scale for a large number of individuals. In particular when pools of genomic DNA are sequenced, it has become feasible to collect population variation data on a genomic scale within the budget of a typical research grant. In the wake of the new sequencing technologies, also new software has been developed for mapping short sequence reads onto a reference genome and extracting SNPs. Nevertheless, linking this information with functional information (exons, introns, transcription levels etc.) still poses a major challenge to many researchers. Furthermore, it is difficult for non-experts to extract population genetic estimators from the vast amounts of sequencing data.
We developed PoPoolation DB http://www.popoolation.at/pgt/ a user-friendly and integrated resource to link variation in natural populations with functional information, allowing a wide range of researchers to take advantage of population genetic data.
PoPoolation DB allows the retrieval of polymorphism data from pooled NGS sequence data using new statistical approaches to obtain population genetic parameters from pooled data . PoPoolation DB can be queried by gene name, chromosomal position, or a user-provided query sequence or GTF file.
Construction and content
Database and web interface development
Fly samples and Sequencing
Genomic DNA was extracted from a pool of flies consisting of five females from each of 113 isofemale lines collected 2008 in Povoa de Varzim (Northern Portugal). DNA was extracted from homogenized pooled flies with the Qiagen DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany). Sequencing followed standard Illumina protocols using Paired End Cluster Generation Kit v2 and sequencing Kits v3 on a Genome Analyzer IIx. Image analysis was performed with the Firecrest, Bustard and Gerald modules of the Illumina pipeline v. 1.4.
Reads were trimmed on both ends for base quality 20 and minimum sequence length of 40 base pairs. The trimmed reads were mapped to the D. melanogaster reference genome (v. 5.18) using the global alignment algorithm implemented in bwa [5, 6]. Mapping parameters were: seeding disabled, error rate of 1% (-n 0.01), maximum of two gap openings (-o 2) and gap extension of a maximum of 12 bases (-e 12, -d 12). Paired-end reads were further processed with the sampe module in bwa enabling the Smith-Waterman algorithm for the unmapped mate and allowing a maximum of 500 bp between the read pairs. Reads left without a pair were removed. Only reads with a minimum mapping quality of 20 were used. The final pileup output was stored in PoPoolation DB.
In order to link natural polymorphism with genomic annotation of reference genome, we have downloaded Drosophila melanogaster reference genome Fasta sequences (release 5) and GFF3 annotation file (v. 5.32) from FlyBase database .
Utility and Discussion
Population Variation Parameters and Tajima's D
Natural variation is typically measured by pairwise sequence comparisons (Tajima's π ) or the number of segregating sites in a sample of DNA sequences (Watterson's θ). Under neutrality and constant population sizes both estimators are unbiased. As selection and demographic events affect both estimators differently, the weighted contrast between Tajima's π and Watterson's θ (Tajima's D ) is frequently used to infer selection and/or population size changes. For any genomic region of interest, users can infer the two population genetic variation parameters Watterson's θ or Tajima's π as well as Tajimas' D  (Figure 3, Figure 4, Figure 5). As each of these measurements is calculated with sliding windows over the specified genomic region, the user can define parameters thresholds describing properties of the windows and of the data within the window. Furthermore, users can define whether they are interested in calculating the specified measure in the regions flanking the fragment of interest, e.g. 100 bp at the 5' end and 3' end gene CG2714 (default: 0 bp). PoPoolation DB provides the user with Watterson's θ, Tajima's π and Tajima's D values pre-calculated using the default parameters.
Currently there is a limitation to the size of the fragment that can be queried, i.e. 100.000 bp. PoPoolation DB also provides 95% and 99% quantiles for each population genetic estimator and chromosome. For computational efficiency, Popoolation DB does not recalculate the quantiles each time a fragment is queried. Rather pre-calculated quantiles for window sizes in a range between 500 bp and 50.000 bp are stored in the database. PoPoolation DB automatically displays the quantiles for the window size closest to the user defined one.
The following is a list and a description of the window parameter thresholds:
Window Size: Length in bp of the window within which the measurement will be calculated (default: 1.000 bp)
Step Size: Number of bp that the window should be moved along the chromosome, e.g. for windows of 1000 bp a step size of 100 bp implies that two neighboring windows overlap in 900 bp (default: 100 bp).
Minimum Count: Every identified SNP requires at least two alleles where each allele has to occur at least 'Minimum Count' times (default: 2).
Minimum Quality Score: Minimum base quality required for a base to be considered in the analysis (default: 20).
Minimum Coverage: Minimum coverage threshold, below which the measurement is not calculated and no SNPs will be identified (default: 4).
Maximum Coverage: Maximum coverage threshold above which the measurement is not calculated and no SNPs will be identified (default: 300).
Minimum Covered Fraction: Proportion of the window that should have sufficient coverage, i.e: larger or equal than minimum coverage and smaller or equal than maximum coverage (default: 0.6).
SNP and Codon Information
PoPoolation DB provides SNP and codon information for the genomic region of interest in display and download mode. The table can be downloaded in a tabular format. For each position in the queried region PoPoolation DB prints the annotated features available for the fragment (e.g. introns, CDS) as stored in gff3 files in Flybase. It also provides a hyperlink to the corresponding gene in which the feature is located in Flybase.
For each SNP in the region of interest, allele counts, coverage and the character state in the D. melanogaster reference genome are provided (Figure 6). The SNP table also contains information about the amino acid in the reference genome and whether a polymorphism is silent or alters the amino acid (e.g. Position X-2582776: Reference codon GAT: Reference amino acid Asp; Variant codon TAT; Variant amino acid Tyr).
PoPoolation DB also prints a table with the indel information of the region of interest (Figure 7). The table can be downloaded in tabular format. For each reference position having an indel, PoPoolation DB shows the frequency of the indel and the nucleotide sequence that is added or deleted. Deletions are marked with a minus (e.g -3AGG) and insertions with a plus (e.g. +4ATCG). For positions with complex indels (e.g.: several different insertions at the same reference position), PoPoolation DB prints separate rows for each type of change (e.g. rows 10 to 12 in Figure 7). The position of indels refers to one position before the indel in the reference sequence. The coverage of the indel is provided as the average of the five neighboring nucleotides on each side of the indel.
Evaluation of the database
Benchmarks for processing time of PoPoolationDB
Size of query
Average processing time (minutes)
In order to increase transparency and reproducibility of the results, PoPoolation DB prints for every query a log file with the parameters used. Additionally PoPoolation DB offers to download all information (SNP and codon table, Indel table) produced by it, which will be especially useful for further downstream analysis. PoPoolation DB not only visualizes the population genetic estimators track in the UCSC and Flybase genome and RNA-Seq browsers, but also allows downloading in the widely used Wiggle file format.
PoPoolation DB is restricted to the analysis of Illumina sequence reads generated from pooled DNA samples. Currently, it is not possible to load sequence data obtained from sequencing individuals separately, as this requires an entirely different handling of data.
The current version of PoPoolation DB has the polymorphism information for one population of D. melanogaster. We plan to add more populations from Drosophila melanogaster and D. simulans. Moreover, we will integrate tools to compare the polymorphism data from various populations.
PoPoolation DB is a user friendly integrated database. This database allows the retrieval of polymorphism data from pooled 2nd generation sequencing data using new statistical approaches to obtain population genetic parameters from pooled data. PoPoolation DB will enable researchers to identify natural polymorphism, their frequencies, and associated functional annotations from UCSC and Flybase genome browsers. Furthermore, population genetic estimators and RNA-Seq information can be obtained for a genomic region of interest.
We anticipate that the database will not only be of interest for the identification of segregating functional variants, but also facilitate primer design and comparative analyses. Currently, PoPoolation DB provides polymorphism data from a single D. melanogaster population from Portugal, but additional populations and species will be uploaded as soon as they become publicly available.
Availability & requirements
PoPoolation DB is freely available to all non-commercial users at http://www.popoolation.at/pgt
We are grateful to all members of the Institute of Population Genetics, in particular A. Betancourt, and J.-M. Gibert for beta testing PoPoolation DB. We are thankful to one anonymous reviewer for spotting a problem in an earlier version of the database. This work has been supported by the Austrian Science Fund (FWF, P22019467) to CS.
- Hilscher J, Schlötterer C, Hauser MT: A single amino acid replacement in ETC2 acts as major modifier of trichome patterning in natural Arabidopsis populations. Current Biology. 2009, 19: 1747-1751. 10.1016/j.cub.2009.08.057.PubMed CentralView ArticlePubMed
- Rebeiz M, Pool JE, Kassner VA, Aquadro CF, Carroll SB: Stepwise modification of a modular enhancer underlies adaptation in a Drosophila population. Science. 2009, 326: 1663-1667. 10.1126/science.1178357.PubMed CentralView ArticlePubMed
- Futschik A, Schlötterer C: Massively parallel sequencing of pooled samples - the next generation of molecular markers. Genetics. 2010, 186 (1): 207-218. 10.1534/genetics.110.114397.PubMed CentralView ArticlePubMed
- Kofler R, Orozco-ter Wengel P, De Maio N, Pandey RV, Nolte V, Futschik A, Kosiol C, Schlötterer : PoPoolation: a toolbox for population genetic analysis of 2nd generation sequencing data from pooled individuals. PLoS One. 2011, 6 (1): e15925-10.1371/journal.pone.0015925.PubMed CentralView ArticlePubMed
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMed
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, Zhang H: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009, 37: D555-D559. 10.1093/nar/gkn788.PubMed CentralView ArticlePubMed
- Karolchik D, Hinrichs AS, Kent WJ: The UCSC Genome Browser. Curr Protoc Bioinformatics. 2009, Chapter 1: Unit1 4.
- Tajima F: Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983, 105: 437-460.PubMed CentralPubMed
- Tajima F: Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989, 123: 585-595.PubMed CentralPubMed
- Watterson GA: On the number of segregating sites in genetical models without recombination. Theoretical population biology. 1975, 7: 256-276. 10.1016/0040-5809(75)90020-9.View ArticlePubMed
- Harr B, Kauer M, Schlötterer C: Hitchhiking mapping: A population-based fine-mapping strategy for adaptative mutations in Drosophila melanogaster. Proc Natl Acad Sci USA. 2002, 99: 12949-12954. 10.1073/pnas.202336899.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.