Single-feature polymorphism discovery by computing probe affinity shape powers
© Xu et al; licensee BioMed Central Ltd. 2009
Received: 27 January 2009
Accepted: 26 August 2009
Published: 26 August 2009
Single-feature polymorphism (SFP) discovery is a rapid and cost-effective approach to identify DNA polymorphisms. However, high false positive rates and/or low sensitivity are prevalent in previously described SFP detection methods. This work presents a new computing method for SFP discovery.
The probe affinity differences and affinity shape powers formed by the neighboring probes in each probe set were computed into SFP weight scores. This method was validated by known sequence information and was comprehensively compared with previously-reported methods using the same datasets. A web application using this algorithm has been implemented for SFP detection. Using this method, we identified 364 SFPs in a barley near-isogenic line pair carrying either the wild type or the mutant uniculm2 (cul2) allele. Most of the SFP polymorphisms were identified on chromosome 6H in the vicinity of the Cul2 locus.
This SFP discovery method exhibits better performance in specificity and sensitivity over previously-reported methods. It can be used for other organisms for which GeneChip technology is available. The web-based tool will facilitate SFP discovery. The 364 SFPs discovered in a barley near-isogenic line pair provide a set of genetic markers for fine mapping and future map-based cloning of the Cul2 locus.
Polymorphisms in DNA sequence between genotypes can be used as genetic markers for a variety of genetic studies. Directly sequencing genomes is one method of detecting these polymorphisms. For example, dideoxy sequencing resulted in the single nucleotide polymorphism (SNP) detections in human, mouse, and Arabidopsis [1–3]. High-throughput next-generation DNA sequencing technologies reduced the cost of and increased the efficiency of polymorphism detection [4–8]. High-density oligonucleotide resequencing arrays provide an alternative approach for polymorphism detection [9–11]. For example, the resequencing Genome-Wide Human SNP Array 6.0  contains 906,600 potential SNPs that can be used to detect polymorphisms in individuals. However, this oligonucleotide resequencing array can only be developed for those species, such as human, mouse, Arabidopsis, and rice (O. sativa), whose genome sequences and the SNP map information are known [9–13]. Because of technology gaps and cost there is a lack of highly-parallel, high-throughput platforms for directly detecting DNA polymorphisms in many other species.
A polymorphic sequence detected by a single probe on an oligonucleotide array is called a single-feature polymorphism (SFP) . With the emergence of microarray expression data, both gene transcript accumulation and SFP detection can be conducted at the same time. SFP discovery using microarray expression data is a rapid and cost-effective method for genetic marker development. The Affymetrix® Corporation has developed more than 100 different types of commercially-available GeneChip expression arrays from over 30 organisms and additional customer-designed arrays . Typically, Affymetrix® GeneChips are designed with 11 perfect match (PM) and 11 mismatch (MM) probes (25-mers each probe) for each gene (probe set). Polymorphic nucleotides in the target transcript affect its binding to the probes, resulting in low hybridization signal intensity. Therefore, it is possible to identify SFPs between two genotypes by comparing these probe targeting polymorphic regions. By the same principle, genomic DNAs can also be applied to the oligonucleotide microarray for polymorphism detection [15, 16].
Several methods have been reported for SFP discovery in a variety of organisms. All methods are based on the idea that variation in a target sequence lowers the probe hybridization signal intensity on an array. However, there are differences in methodologies to detect this decreased intensity. Winzeler et al.  first reported SFP detection in yeast (Saccharomyces cerevisiae) genomic DNA hybridized on a high-density oligonucleotide expression array. This method used a linear regression model to fit probe log(PM) intensities and then an F test to detect the probes that have significant differences in intensities between two yeast strains. This method has been tested on genomic and transcriptome data from Plasmodium falciparum and barley (Hordeum vulgare L.), respectively [16–18]. Borevitz et al.  developed a method that is similar to Significance Analysis of Microarray (SAM)  to analyze those probes that have significant differences in intensity between genotypes and detected SFPs in Arabidopsis thaliana genomic DNAs hybridized on a GeneChip. Coram et al.  recently applied a similar approach in wheat (Triticum aestivum) using the SAM method in the Bioconductor siggenes package  but detected the differences in probe intensities subtracted by the RMA (robust multichip analysis)  normalized-expression index. Rostoks et al.  modified this method for SFP discovery in barley by fitting PM intensities to a linear model (PILM) and then detecting the significantly-different probes using SAM in the Bioconductor package siggenes . This method has been used for SFP discovery in mosquito (Anopheles gambiae)  and rice (Oryza sativa) , mapping qualitative and quantitative traits in Arabidopsis thaliana [26–31], and estimating mutation and recombination parameters in Arabidopsis . Bischoff et al.  recently adapted this linear model by using SAS/JMP (SAS Institute Inc. Cary, NC) instead of SAM for SFP discovery in the swine transcriptome. Ronald et al.  implemented a positional-dependent-nearest-neighboring model (PDNN ) to calculate probe intensity and then examined the intensity differences of single pairs of probes for SFP discovery in yeast. All of these above methods are based upon changes in the probe signal intensity of individual probes between two genotypes.
West et al.  proposed a new method to detect SFPs in Arabidopsis thaliana by examining probe intensity along all 11 probes of a probe set in parallel and calculating a value called SFPdev, which is the hybridization signal difference between one probe and the average of the other 10 probes divided by the individual probe signal. Cui et al.  developed a probe affinity outlier pursuit (PAOP) algorithm by applying probe affinity differences and comparing 11 probes of a probe set to find the outliers for barley SFP detection. This method accounts for multiple probes in parallel in the same probe sets and uses probe affinity instead of probe intensity, and has been used for SFP discovery in cowpea (Vigna unguiculata L. walp) , mapping of translocation breakpoints in wheat , and genotyping in barley . It is well agreed that a typical SFP has a "peak" profile shape, and almost all reports have shown such shape plots. However, none of the above methods were designed to directly detect and capture such an SFP "peak" shape.
Despite the ability to identify SFPs, a high false positive rate and low sensitivity have been observed with the above-mentioned SFP detection methods. For example, a 40% false discovery rate was estimated using the PILM method , and only "present" call probe sets were used in the PAOP method , suggesting a low sensitivity level. The objectives of our study were to: 1) develop and test a new SFP detection method that improves both the specificity and sensitivity; and 2) perform SFP discovery in a barley near-isogenic line (NIL) pair using the newly-developed method. We developed and tested our method on two previously published GeneChip datasets, including a dataset derived from six tissue types from the barley cultivars Golden Promise and Morex and a dataset of five genotypes (Barke, Morex, Steptoe, Oregon Wolf Barley Dominant and Recessive). We also used this method to detect SFPs in a barley NIL pair carrying the mutant and wild type alleles at the Uniculm2 (Cul2) locus.
Results and Discussion
where S tij is the raw PM intensity, I ti is the RMA-normalized  expression value of each gene, and t represents the genotype, i represents probe set, and j the probe. The error factor E is an independent identically-distributed error with a mean of zero, and the estimation of probe affinity value (A) can be calculated by subtracting I from S.
We used the Bioconductor affy package  for PM intensity extraction and RMA-normalized expression calculation . All PM intensities were background subtracted and quantile-normalized . Each probe feature affinity was computed by subtracting the normalized probe intensity by the RMA expression value of that probe set. A probe affinity matrix was created with sample numbers in columns and probe sets multiplying by 11 in rows. The SAM method in the Bioconductor siggenes package  was applied to detect significant differences of each probe affinity among all samples in the probe affinity matrix. The estimated prior probability (p0) that a feature does not bear significant difference in probe affinity was set to 0.95. Using the Barley GeneChip, we detect 250,811 probes at the same time in the SAM method, an FDR is deduced by applying multiple testing [19, 45] to the raw p value of each probe. We used the FDR instead of the significant p value as the cutoff. The FDR was set to 0.1 based on the delta value (D) in the SAM analysis. The FDR cutoff values are arbitrary, and 0.1 is commonly acceptable. More stringent cutoff values can be applied.
where L is the probe affinity in the left position, and R is the probe affinity in the right position of the potential SFP.
In each genotype, if the median of all probe affinity values in the adjacent location (md(Ai-1) or md(Ai+1)) is higher than the median affinity of the potential SFP location i, a value of 1.0 was assigned to the appropriate index (L1, L2 or R1, R2) of the vector. If the probe median affinity of the adjacent location is lower than those at the SFP location, a value of negative one was assigned to the vector. We tested the minimum values of 0.01, 0.05, 0.1, and 0.2 in determining the affinity difference, and found these values did not exhibit significant difference in the performance of SFP detection (Additional file 1). We chose the empirical value of 0.1 as a more reliable minimum value. If the affinity difference is less than or equals 0.1, or if the adjacent location is over the maximum boundary when a potential SFP is at probe position 1 or 11, the adjacent probe median affinity was set to 0 in the vector.
To take into account the direction variation of probe affinities in the median calculation, we computed the proportion (p) of all (n) individual probe affinity directions that have the same direction to the median affinity direction, and multiplied the median direction vector with this proportion.
The minimum value of 1.0 is assigned to the shape power to ensure the original weight at least when a probe does not exhibit a typical affinity shape. The original weight is the SFP affinity difference (Ad) divided by the 30th percentile (30 pct) of affinity differences of the 11 probes within the same probe set. We set a minimum value of 0.1 to the 30th percentile to avoid an extreme large weight score because of an extreme small 30th percentile division. The 30th percentile was optimized to ensure possible multiple SFPs per probe set even though the 30th percentile base line only shows a little bit better performance than other different percentiles at a point of weight score cutoff of 2.5 (Additional file 2). Within the 11 probe affinity values of each probe set, the 30th percentile is the 4th lowest probe. Using the 4th lowest probe as the base allows capture of the remaining 7 probes on the same probe set. We found if more than 8 probes were detected on the same probe set, they were most likely caused by differential expression instead of sequence variation (data not shown).
Specificity and sensitivity are two important criteria in any detection method development. In the SFP discovery methods reported to date, low specificity and low sensitivity are the common problems. Even though PASP is closest to PAOP, the PAOP method has only been shown to function for probe sets with present calls . In general, the Barley1 GeneChip exhibits approximately 60% present calls. Some platforms have a low percentage of present call probe sets. For example, the Wheat GeneChip typically has about 45% present calls, while the Medicago GeneChip has approximately 40% present calls (data not shown). If a platform is used for cross species hybridization, the present calls will be even lower . Therefore, if only present calls are taken into account a lower sensitivity would result and many SFPs would escape detection. The "present/absent" call does not necessarily mean present/absent in terms of hybridization but an Affymetrix definition that a probe set is called absent if there is no significant difference with a p value cutoff of 0.04 between the 11 PM probes and 11 MM probes . In our method design, we first used probe affinity difference instead of probe signal intensity to increase the sensitivity, and then applied shape power scores to increase the specificity. In cases where both PM and MM show no significant difference but both have high intensities, PASP is still capable of detecting SFPs even with an absent call probe set, thus the sensitivity was enhanced. In the case of differentially-expressed genes, after subtracting the expression index (RMA), all the probe affinities are comparable between the two genotypes. If SFPs existed, their probe affinities would be captured. However, these SFPs might escape if only the probe intensities are detected or only present call probe sets are used.
The motivation of the affinity shape power is the following. The SFP probe has different affinity to different genotype targets, and this affinity can exhibit a sharp contrast in neighboring probes within the probe set that are not SFPs. If all sample replicates present this sharp contrast at this probe, this potential SFP bears more power in weight score. In cases that most or all probes within the probe set have different signal intensities between genotypes but do not form affinity contrast with neighboring non-SFP probes, these potential SFPs bear less or no power in the weight score since they might be caused by differential transcript accumulation. Thus, the probe affinity shape power captures the intrinsic feature of SFPs.
When using genomic DNAs instead of transcripts to hybridize on high density GeneChip arrays, the probe intensities can be also summarized to an index by RMA, thereby the affinity shape powers can be computed by our method.
To validate our method, we tested previously-published Barley1 GeneChip expression data sets . Thirty-six GeneChip hybridizations (cel files) from two genotypes, Morex and Golden Promise, were downloaded from the public barley Natural variation web site . For each genotype, six tissue types (coleoptile, crown, embryo, leaf, radicle, and root) were examined. Three biological replicates were examined for each genotype/tissue combination except for root tissue which has two replicates of Morex and four replicates of Golden Promise. This data set was accompanied by 401 polymorphic and 2200 non-polymorphic probes previously sequence-verified between Golden Promise and Morex.
Single-feature polymorphisms (SFPs) discovered in Golden Promise and Morex1
After imposing a weight score cutoff of 2.5, the number of false positives decreased from 84 to 48, with only a loss from 305 to 284 true positive SFPs (Table 1). As a result, a total of 284 SFPs were detected from 401 known polymorphic probes, resulting in a sensitivity rate of 70.82%. A total of 48 false SFPs were called from 2,200 confirmed non-polymorphic probes for a false positive rate (FPR) of 2.18%. The FDR is 14.46%, which was calculated by the number of false positives divided by the sum of false positives and true positives, 48/(48+284).
SFP discovery method comparison
To date, there are several SFP discovery methods reported. Among them, the PILM methods developed by Rostoks et al.  and the PAOP method developed by Cui et al.  have been repeatedly used [14, 20, 23–33, 37–40]. We compared the PASP method developed in this study to these two SFP discovery methods by using the GeneChip dataset of two barley genotypes Morex and Golden Promise, originally used by Rostoks et al. [23, 47].
We plotted the PR curve to compare the detection accuracy performances. The PR curve (Figure 2) shows the PASP method line reached the upper-right-hand corner, and there is not much room for further improvement in SFP detection. The PAOP and PILM are under the PASP line. The areas under the PR curves from PASP, PILM, and PAOP are 0.80, 0.74, and 0.67, respectively (Figure 2, Additional file 4). These areas under the curve (AUC) values represent the overall performances of these three methods. We also tested the Receiver Operator Characteristic (ROC) curve to compare the detection accuracy performances of these three methods (data not shown). While the PR and the ROC are equivalent, the PR curve is more informative when dealing with highly skewed datasets . Thus, we only showed the PR curve.
We further compared PASP with PAOP and PILM by using the cutoff values that produce the best performance for each method. PILM exhibits its best performance at cutoff value of delta (D) 3.0 (red star on PILM line). We chose both the D cutoff of 3.0 (FDR 0%) and D 2.0 (FDR 0.1%, black star) for the PILM comparisons since D 2.0 was optimized in the original report . PAOP shows its best sensitivity and specificity at outlier score percentile of 0.15 (os0.15, red star) even though os0.05 (black star) was used in the original report .
Single-feature polymorphism (SFP) discovery methods comparison
probe affinity, shape power
probe intensity, linear model
probe intensity, linear model
probe affinity, outlier score
probe affinity, outlier score
We computed a comparison score (compScore, see Materials and Methods) for these methods by combining the sensitivity, specificity, FDR, and the discovered SFP number that are calculated using the cutoff values at which the best performance was achieved in each method. The comparison score shows a percentage of improvement or reduction in performance compared to other methods. A score of 1.0 means that the two methods have the same performance. A score of less than 1.0 reflects reduced performance overall. As seen in Table 2, the compScore of the PASP method is better over all methods tested. We adjusted the original cutoff value of the PAOP method from 0.05 to 0.15 to compensate for the sensitivity by sacrificing its original high specificity. At this cutoff, even though the PAOP method shows a little bit better specificity (lower FDR and FPR) than PASP, it has lower sensitivity and less discovered SFPs than PASP. The compScore is less than 1.0 (0.80, Table 2), therefore the overall performance of PAOP is lower than PASP. Since the PILM method had a high FDR, it produces a very low compScore (0.21) even at its best performance cutoff (D 3.0), i.e., the overall performance of PILM in terms of specificity, sensitivity, and detected SFPs is only 21% of that of PASP. The overall compScore computed from the best point for each method shows that PASP is superior to the other two methods (Figure 2).
We also tested our method on the same dataset published by Cui et al.  using the same genotype pair comparison. In the previous study, a total of 2,007 SFPs were reported using a highly stringent cutoff value (5 percentile) . Using our PASP method we detected significantly more SFPs (14,212). These SFPs covered 96.6% (1939 out of the 2007) of the SFPs discovered by the PAOP method . This would seem to call into question the number of false positives in the remaining 12,273 SFPs identified using the PASP method. However, when we compared SFPs between Morex and Golden Promise, we showed that our PASP method using a weight score cutoff of 2.5 has a much higher sensitivity (70.82%) than PAOP (26.43%) using a 5 percentile cutoff at a similar FDR (14.46% versus 13.15%) (Additional file 4, Figure 2). In addition, the Cui et al.  study pointed out that the five different genotypes (Barke, Morex, Steptoe, Oregon Wolf Barley OWB Dominant and OWB Recessive) exhibit a high level of polymorphisms and it is possible that not all SFPs were detected. Thus, the PASP method captures more SFPs in addition to almost all of the previously-identified SFPs using the PAOP method. Again, there could be more false positives in the 14,212 SFPs detected using the PASP method than in the 2,007 SFPs using the PAOP method. However, the PASP method provides an additional set of SFPs to examine.
Since cRNA-based SFP detection is based on the probe hybridization signals, 100% sensitivity can never be reached because there are many genes that exhibit low transcript accumulation. The maximum sensitivity in SFP discovery is determined by the percentage of genes (probe sets) on the array that exhibit signal. The 70% sensitivity that we obtained may be the maximum in barley while maintaining a low FDR (Figure 2, Table 2).
We compared the PASP method with two commonly used methods because these methods have been extensively used in various studies [14, 20, 23–33, 37–40]. Other methods including the PM-MM model method , the K-mean clustering method , the method developed by Winzeler et al. , and the method by Ronald et al.  were compared in a study by Luo et al.  using barley data set. Each of these methods exhibited a high (~64%) FDR [17, 33] and a low sensitivity (27%–37%). West et al.  described a SFPdev method by analyzing an Arabidopsis dataset, but these authors did not extensively test for FDR or sensitivity through sequence information. In contrast, the PASP method retained a high sensitivity but lowered the FDR at the same time, and therefore seems to provide a better performance over previous methods.
Implementation of the Web-based SFP discovery tool
We implemented the PASP method as a public web application tool https://dbw10.msi.umn.edu:8443/sfp. The web application makes SFP detection easy for potential users. Job over-load on the server is a common issue in web application design, especially for high-throughput genomic analysis. Our SFP discovery tool design consists of a client, web server, and application server three-tier components. The jobs are scheduled in a queue on the application server by the web server to avoid the over load issue. The interface has an add-more button allowing for loading a flexible number of files. An email address text field is provided to inform the user of the URL link for downloading the result files when the job is done.
Discovering SFPs in a near-isogenic line pair carrying mutant and wildtype alleles for uniculm2
Four tissue types (crown, embryos, immature inflorescence, and 3 day old seedlings) of the NIL pair carrying the wild type (Bowman) and mutant allele for cul2 (Bowman-cul2) were used in this study. RNA was extracted from these four tissues with three replicates and hybridized to the Barley1 GeneChip® Genome Array.
Single-feature polymorphisms (SFPs) discovered in Bowman and Bowman-cul2
# of SFPs discovered in individual issues
More than half (187) of the SFPs were found as a single SFP in a probe set. An example of a single SFP in a probe set is shown in Figure 3C. This SFP (Contig16609_at probe10) was found in Bowman-cul2 and has an SFP shape with a weight score of 64 (Figure 3C). A maximum of six SFPs per probe set were detected (data not shown). In probe sets exhibiting multiple SFPs, some SFPs were present in one genotype only, and some probe sets possessed SFPs in both genotypes.
Validation of the SFPs discovered in Bowman and Bowman-cul2
PCR sequence verification for 16 single-feature polymorphisms (SFPs)
6H,L, 67.7 cM
6H,L, 67.7 cM
6H,L, 70 cM
6H,L, 70 cM
6H,L, 71.1 cM
6H,S, 49.4 cM
6H,S, 49.4 cM
6H,L, 71.1 cM
6H,L, 60.2 cM
3H,L, 55.6 cM
It is impractical to perform sequencing to validate every SFP call. To further determine the reliability of the SFPs discovered in the two genotypes, we examined other genotypes. We found that nearly one third (118) of the 364 SFPs we discovered between the Bowman and Bowman-cul2 genotypes exist in Golden Promise, Morex, Barke, Morex, Steptoe, or the Oregon Wolfe barleys (Figure 4B).
Since true SFPs possess a typical shape, we reviewed the probe affinity shape plots of all SFPs to justify each SFP. A quantitative likely number can also be assigned to each SFP discovered. West et al.  reported an SFPdev by calculating the probe signal rate after the subtraction from the average probe signal. Cui et al.  reported the probe outlier values. Statistic p value or FDR was also used to rank SFPs [17, 23]. In this study, we present an SFP weight score. The weight score computed for each SFP represents how likely that the SFP is real, i.e., an alternative confirmation of SFPs. From the published barley expression data set, a weight score cutoff of 40 resulted in a 0% false positive rate. For SFP detection in each tissue type of Bowman and Bowman-cul2, between 30 to 45 SFPs possessed a weight score of 20 or above, resulting in a total of 89 SFPs with weight scores at 20 or above among all 364 detected SFPs. About half of the SFPs had a score of 10 or above.
SFP-containing genes on chromosome 6H
There are two clusters of SFPs on chromosome 6H (Figure 5, black). Eight SFP-containing genes (10 SFPs) are clustered in a region of approximately 7 cM in length from 119 cM to 126.2 cM. The second cluster has 42 SFP-containing genes (68 SFPs) clustered in a region of approximately 39 cM in length from 35.8 cM to 75.2 cM. Among the 15 SFPs that were validated by PCR sequencing, nine SFPs are in this region. The nine SFPs are represented by seven genes (Contig4329_at 67.7 cM, Contig5339_s_at 70 cM, Contig7178_s_at 70 cM, Contig14687_at 71.1 cM, HVSMEg0016A12r2_s_at 49.4 cM, HVSMEn0016F09r2_s_at 71.1 cM, Contig2856_x_at 60.2 cM). Gene KFP128 (Contig2856_x_at) was found to be closely linked to Cul2 locus within 4.6 cM . The remaining six sequence-validated SFPs do not have map information. Interestingly, the only SFP that was not confirmed through sequencing was mapped on chromosome 3H.
We also found SFPs on other chromosomes such as chromosome 4 (Figure 5). This reflects that the Bowman-cul2 may still carry progenitor regions of genome sequences even after five backcrosses of the cul2 mutant allele into Bowman. The SFPs found on the other chromosomes may be the result of genetic variation between Bowman and the progenitor sequences. Another possibility is that the SFPs with low weight scores may contain false positives. We tried a more stringent SFP weight score cutoff of 20.0 that resulted in 113 SFPs harbored in 65 probe sets that had mapping information. We plotted these 65 SFP-containing genes on to barley chromosomes and found only one polymorphic cluster with 33 SFP-containing genes representing 45 SFPs on chromosome 6H Bin 6 in the vicinity of Cul2 locus. The other chromosomes only have few SFPs (Figure 5, red).
Barley plants that carry loss-of-function mutations in the Cul2 gene result in plants that do not tiller (vegetative branches) . The Cul2 gene has not been cloned yet and the genetic components of tiller development in barley are unknown. The 68 SFPs in 42 genes discovered in the region from 35.8 cM to 75.2 cM on chromosome 6H provide a potential marker list for future map-based cloning of the Cul2 locus.
We developed a new robust method for SFP discovery and tested the method using Barley1 GeneChip datasets. This SFP discovery method can be used for other organisms for which GeneChip technology is available. Our result clearly showed our new SFP discovery method is superior to previously-reported methods. The web implementation of this method will provide a resource for others to employ the PALP algorithm for SFP detection. The 364 SFPs discovered in this study between plants carrying the wild type and mutant allele for the Cul2 gene provide a potential marker list for fine mapping and future map-based cloning of the Cul2 locus.
The barley cultivar Bowman was a gift from J. Franckowiak, Department of Plant Sciences, North Dakota State University, Fargo, N.D. The Bowman-cul 2 genetic stock carries a single gene recessive mutation at the Cul2 locus. The cul2 mutation was backcrossed five times into the Bowman genetic background to create Bowman-cul2. The Bowman-cul 2 stock (GSHO2038) was obtained from USDA-ARS, National Small Grain Germplasm Research Facility, Aberdeen, Idaho.
Experimental design for detecting SFPs in Bowman and Bowman-cul2
Four tissues were collected at different stages for Bowman and Bowman-cul2: whole seedlings at 2–3 days after germination at the growth stage "first leaf just emerging through the coleoptile" (GRO:0007059), crowns at the seedling growth stage of "first leaves unfolded" (GRO:0007060), immature inflorescences at the "third node detectable" (GRO:0007084), and embryo from the "coleoptilar stage" (PO:0001094). Total RNA was isolated from pooled tissue (10 plants) from each genotype/replication/tissue combination. The experiment was grown in a randomized complete block design with three replicates. RNA extraction, labelling, and hybridization were conducted as previously reported .
The quality of all cel files derived from the Bowman and Bowman-cul2 samples were ensured using GCOS v1.4 (Affymetrix, Inc. Santa Clara, CA 95051 USA). The criteria include: overall "Present" calls were more than 60%; average background values ranging from 40 to 80; the average noise values of less than 5.0 in each array; and all spike controls were called "Present" with consistent ranges among all arrays. The internal house-keeping control genes were "present", and the ratios of 3' end and 5' end of these gene expression values were under 3.0. This criterion reflected the quality of the RNA as well as the sample processing. All GeneChip data were deposited at the Plant Expression Database http://www.plexdb.org/ with accession number BB47.
Barley1 GeneChip datasets
Thirty-six Barley1 GeneChip data cel files were retrieved from the public Barley Natural Variation web site . These are six tissue types (col, coleoptile; cro, seedling crown; gem, embryo from germinating seed; lea, seedling leaf; rad, radicle; roo, seedling root) from two genotypes, Golden Promise and Morex, with three replicates for each tissue except seedling root tissue has two replicates of Morex and four replicates of Golden Promise. We also downloaded the 10,504 SFPs discovered by Rostoks et al. [23, 47] from this data set. In addition, twenty-one Barley1 GeneChip cel files were downloaded from Gene Expression Omnibus (GEO) database (accession GSE3170) , and the 2007 SFPs discovered in this data set were also retrieved from the previous report . This dataset was derived from five genotypes (Barke, Morex, Steptoe, and 6 replicates of Oregon Wolf Barley (OWB) Dominant and OWB Recessive) with three replicates from whole-seedling tissue.
Method comparison calculations
Since the true positives (401) and true negatives (2200) of polymorphic probes are only inner labelled probes of total probes, the detected SFPs must be included in the score calculation.
DNA fragments of Bowman and Bowman-cul2 were amplified from genomic DNA regions flanking the SFPs such that the amplified product for each was about 200–300 bp. The sequence information of features/probes was obtained from Affymetrix . Target sequences carrying probe sites with SFPs were PCR amplified in Bowman wild type and the Bowman-cul2 mutant. PCR was conducted using Takara Ex Taq polymerase (Takara Shuzo Co., Kyoto, Japan) and 50 ng of DNA samples in the final volume of 50 μL following the manufacturer's protocol.
PCR was performed following this protocol: five minutes of initial denaturation at 94°C followed by 9 cycles of touch down PCR (denaturation for 30 s at 94°C, annealing starting at 63°C for 30 s and decreasing 1°C per cycle down to 55°C, and extension at 72°C for 1 min), and additional 30 cycles of PCR at 55°C annealing, 94°C for denaturation, and 72°C for extension were conducted. PCR products were electrophoresed on 1.5% agarose gel, and only single PCR products were excised from the gel for sequence validation. PCR products within excised gel pieces were purified using the Montage gel extraction kit (Millipore, Bedford, MA) following the manufacturer's protocol. Ten ng of purified DNA was sequenced using ABI PRISM® 3130xl Genetic Analyzer at the DNA Sequencing and Analysis Facility at the University of Minnesota.
HarvEST: barley software version 1.68 that contains Assembly #35 was downloaded from the HarvEST website . The map information of SFP probe sets was obtained by directly searching Affymetrix GeneChip identifiers against Assembly #35 and blast searching against mapped EST Contig sequences. An R script was used to plot SFP probe sets on chromosomes.
Availability and requirements
Project name: SFP discovery tool
Project home page: https://dbw10.msi.umn.edu:8443/sfp
Operating system(s): Platform independent
Programming language: Java, R
Other requirements: e.g. Java 1.5.0 or higher, Tomcat 4.0 or higher
Any restrictions to use by non-academics: licence needed
List of abbreviations
area under curve
- cul2 :
false discovery rate
false positive rate
probe affinity shape power
probe affinity outlier pursuit
probe intensity linear model
Robust Multichip Analysis
receiver operator characteristic
Significance Analysis of Microarray
We would like to thank Dr. X. Cui for providing the PAOP source code. We would also like to thank the two anonymous reviewers for their suggestions that have improved this paper. We are grateful for resources from the University of Minnesota Supercomputing Institute for Advanced Computational Research. This research was supported by grants to GJM from the United States Department of Agriculture-CSREES-NRI Plant Growth and Development program grant #2004-03440, from the U.S. Wheat and Barley Scab Initiative, from the Minnesota Small Grains Initiative, and from the Minnesota Soybean Research and Promotion Council. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.
- The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- Lindblad-Toh K, Winchester E, Daly MJ, Wang DG, Hirschhorn JN, Laviolette J, Ardlie K, Reich DE, Robinson E, Sklar P, Shah N, Thomas D, Fan J, Gingeras T, Warrington J, Patil N, Hudson TJ, Lander ES: Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nature Genetics. 2000, 24: 381-386. 10.1038/74215.View ArticlePubMedGoogle Scholar
- The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill F, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, Dolecek C, Achtman M, Dougan G: High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nature Genetics. 2008, 40: 987-993. 10.1038/ng.195.PubMed CentralView ArticlePubMedGoogle Scholar
- Holt KE, Teo YY, Li H, Nair S, Dougan G, Wain J, Parkhill J: Detecting SNPs and estimating allele frequencies in clonal bacterial populations by sequencing pooled DNA. Bioinformatics. 2009, 25: 2074-5. 10.1093/bioinformatics/btp344.PubMed CentralView ArticlePubMedGoogle Scholar
- Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS: SNP discovery via 454 transcriptome sequencing. The Plant Journal. 2007, 51: 910-918. 10.1111/j.1365-313X.2007.03193.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Yu C, Li Y, Lam T, Yiu S, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25: 1966-1967. 10.1093/bioinformatics/btp336.View ArticlePubMedGoogle Scholar
- Malhis N, Butterfield YSN, Ester M, Jones SJM: Slider – maximum use of probability information for alignment of short sequence reads and SNP detection. Bioinformatics. 2009, 25: 6-13. 10.1093/bioinformatics/btn565.PubMed CentralView ArticlePubMedGoogle Scholar
- Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science. 2005, 307: 1072-1079. 10.1126/science.1105436.View ArticlePubMedGoogle Scholar
- Frazer KA, Eskin E, Kang HM, Bogue MA, Hinds DA, Beilharz EJ, Gupta RV, Montgomery J, Morenzoni MM, Nilsen GB, Pethiyagoda CL, Stuve LL, Johnson FM, Daly MJ, Wade CM, Cox DR: A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature. 2007, 448: 1050-1053. 10.1038/nature06067.View ArticlePubMedGoogle Scholar
- Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D: Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana. Science. 2007, 317: 338-342. 10.1126/science.1138632.View ArticlePubMedGoogle Scholar
- Affymetrix support. [http://www.affymetrix.com]
- McNally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE, Stokowski R, Ballinger DG, Frazer KA, Cox DR, Padhukasahasram B, Bustamante CD, Weigel D, Mackill DJ, Bruskiewich RM, Rätsch G, Buell CR, Leung H, Leach JE: Genomewide SNP variation reveals relationships among landraces and modern varieties of rice. Proc Natl Acad Sci USA. 2009, 106: 12273-12278. 10.1073/pnas.0900992106.PubMed CentralView ArticlePubMedGoogle Scholar
- Borevitz JO, Liang D, Plouffe D, Chang HS, Zhu T, Weigel D, Berry CC, Winzeler E, Chory J: Large-scale identification of single-feature polymorphisms in complex genomes. Genome Res. 2003, 13: 513-523. 10.1101/gr.541303.PubMed CentralView ArticlePubMedGoogle Scholar
- Winzeler EA, Richards DR, Conway AR, Goldstein AL, Kalman S, McCullough MJ, McCusker JH, Stevens DA, Wodicka L, Lockhart DJ, Davis RW: Direct Allelic Variation Scanning of the Yeast Genome. Science. 1998, 281: 1194-1197. 10.1126/science.281.5380.1194.View ArticlePubMedGoogle Scholar
- Kidgell C, Volkman SK, Daily J, Borevitz JO, Plouffe D, Zhou Y, Johnson JR, Le Roch K, Sarr O, Ndir O, Mboup S, Batalov S, Wirth DF, Winzeler EA: A Systematic Map of Genetic Variation in Plasmodium falciparum. PLoS Pathogenet. 2006, 2 (6): 0562-0577.Google Scholar
- Luo ZW, Potokina E, Druka A, Wise R, Waugh R, Kearsey MJ: SFP Genotyping From Affymetrix Arrays Is Robust But Largely Detects cis-acting Expression Regulators. Genetics. 2007, 176: 789-800. 10.1534/genetics.106.067843.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang H, Yi M, Mu J, Zhang L, Ivens A, Klimczak LJ, Huyen Y, Stephens RM, Su X: Detection of genome-wide polymorphisms in the AT-rich Plasmodium falciparum genome using a high-density microarray. BMC Genomics. 2008, 9: 398-10.1186/1471-2164-9-398.PubMed CentralView ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMedGoogle Scholar
- Coram TE, Settles ML, Wang M, Chen X: Surveying expression level polymorphism and single-feature polymorphism in near-isogenic wheat lines diVering for the Yr5 stripe rust resistance locus. Theor Appl Genet. 2008, 117: 401-411. 10.1007/s00122-008-0784-5.View ArticlePubMedGoogle Scholar
- Bioconductor. [http://bioconductor.org]
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e15-10.1093/nar/gng015.PubMed CentralView ArticlePubMedGoogle Scholar
- Rostoks N, Borevitz JO, Hedley PE, Russell J, Mudie S, Morris J, Cardle L, Marshall DF, Waugh R: Single-feature polymorphism discovery in the barley transcriptome. Genome Biology. 2005, 6: R54-10.1186/gb-2005-6-6-r54.PubMed CentralView ArticlePubMedGoogle Scholar
- Turner TL, Hahn MW, Nuzhdin SV: Genomic islands of speciation in Anopheles gambiae. PLoS Biol. 2005, 3: e285-10.1371/journal.pbio.0030285.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar R, Qiu J, Joshi T, Valliyodan B, Xu D, Nguyen HT: Single Feature Polymorphism Discovery in Rice. PLoS ONE. 2007, 2: e284-10.1371/journal.pone.0000284.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim S, Zhao K, Jiang R, Molitor J, Borevitz JO, Nordborg M, Marjoram P: Association Mapping With Single-Feature Polymorphisms. Genetics. 2006, 173: 1125-1133. 10.1534/genetics.105.052720.PubMed CentralView ArticlePubMedGoogle Scholar
- Rus A, Baxter I, Muthukumar B, Gustin J, Lahner B, Yakubova E, Salt DE: Natural variants of AtHKT1 enhance Na+ accumulation in two wild populations of Arabidopsis. PloS Genet. 2006, 2: e210-10.1371/journal.pgen.0020210.PubMed CentralView ArticlePubMedGoogle Scholar
- Hazen SP, Borevitz JO, Harmon FG, Pruneda-Paz JL, Schultz TF, Yanovsky MJ, Liljegren SJ, Ecker JR, Kay SA: Rapid array mapping of circadian clock and developmental mutations in Arabidopsis. Plant Physiol. 2005, 138: 990-997. 10.1104/pp.105.061408.PubMed CentralView ArticlePubMedGoogle Scholar
- Borevitz JO, Hazen SP, Michael TP, Morris GP, Baxter IR, Hu TT, Chen H, Werne JD, Nordborg M, Salt DE, Kay SA, Chory J, Weigel D, Jones JDG, Ecker JR: Genome-wide patterns of single-feature polymorphism in Arabidopsis thaliana. Proc Natl Acad Sci USA. 2007, 104: 12057-12062. 10.1073/pnas.0705323104.PubMed CentralView ArticlePubMedGoogle Scholar
- Werner JD, Borevitz JO, Warthmann N, Trainer GT, Ecker JR, Chory J, Weigel D: Quantitative trait locus mapping and DNA array hybridization identify an FLM deletion as a cause for natural flowering-time variation. Proc Natl Acad Sci USA. 2005, 102: 2460-2465. 10.1073/pnas.0409474102.PubMed CentralView ArticlePubMedGoogle Scholar
- Wolyn DJ, Borevitz JO, Loudet O, Schwartz C, Maloof J, Ecker JR, Berry CC, Chory J: Light-response quantitative trait loci identified with composite interval and eXtreme array mapping in Arabidopsis thaliana. Genetics. 2004, 167: 907-917. 10.1534/genetics.103.024810.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang R, Marjoram P, Borevitz JO, Tavaré S: Inferring Population Parameters from Single-Feature Polymorphism Data. Genetics. 2006, 173: 2257-2267. 10.1534/genetics.105.047472.PubMed CentralView ArticlePubMedGoogle Scholar
- Bischoff SR, Tsai S, Hardison NE, York AM, Freking BA, Nonneman D, Rohrer G, Piedrahita JA: Identification of SNPs and INDELS in swine transcribed sequences using short oligonucleotide microarrays. BMC Genomics. 2008, 9: 252-10.1186/1471-2164-9-252.PubMed CentralView ArticlePubMedGoogle Scholar
- Ronald J, Akey JM, Whittle J, Smith EN, Yvert G, Kruglyak L: Simultaneous genotyping gene-expression measurement and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 2005, 15: 284-291. 10.1101/gr.2850605.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol. 2003, 21: 818-821. 10.1038/nbt836.View ArticlePubMedGoogle Scholar
- West MA, van Leeuwen H, Kozik A, Kliebenstein DJ, Doerge RW, St Clair DA, Michelmore RW: High-density haplotyping with microarray-based expression. Genome Res. 2006, 16: 787-95. 10.1101/gr.5011206.PubMed CentralView ArticlePubMedGoogle Scholar
- Cui X, Xu J, Asghar R, Condamine P, Svensson JT, Wanamaker S, Stein N, Roose M, Close TJ: Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit. Bioinformatics. 2005, 21: 3852-3858. 10.1093/bioinformatics/bti640.View ArticlePubMedGoogle Scholar
- Das S, Bhat PR, Sudhakar C, Ehlers JD, Wanamaker S, Roberts PA, Cui X, Close TJ: Detection and validation of single feature polymorphisms in cowpea (Vigna unguiculata L. Walp) using a soybean genome array. BMC Genomics. 2008, 9: 107-10.1186/1471-2164-9-107.PubMed CentralView ArticlePubMedGoogle Scholar
- Bhat PR, Lukaszewski A, Cui X, Jin Xu, Svensson JT, Wanamaker S, Waines JG, Close TJ: Mapping translocation breakpoints using a wheat microarray. Nucleic Acids Research. 2007, 35: 2936-2943. 10.1093/nar/gkm148.PubMed CentralView ArticlePubMedGoogle Scholar
- Walia H, Wilson C, Condamine P, Ismail AM, Xu J, Cui X, Close TJ: Array-based genotyping and expression analysis of barley cv. Maythorpe and Golden Promise. BMC Genomics. 2007, 8: 87-10.1186/1471-2164-8-87.PubMed CentralView ArticlePubMedGoogle Scholar
- Li C, Wong HW: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbell E, Liu WM, Mei R: Robust estimators for expression analysis. Bioinformatics. 2002, 18: 1585-1592. 10.1093/bioinformatics/18.12.1585.View ArticlePubMedGoogle Scholar
- Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F: A model based background adjustment for oligonucleotide expression data. J Am Sta Assoc. 2004, 99: 909-917. 10.1198/016214504000000683.View ArticleGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185.View ArticlePubMedGoogle Scholar
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71-103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
- GeneChip expression analysis, data analysis fundamentals. [https://www.affymetrix.com/support/downloads/manuals/data_analysis_fundamentals_manual.pdf]
- Naturalvariation. [http://naturalvariation.org/barley]
- Davis J, Goadrich M: The relationship between Precision-Recall and ROC curves. Proceedings of the Twenty-Third International Conference on Machine Learning (ICML'06), Pittsburgh, PA. 2006Google Scholar
- Babb S, Muehlbauer GJ: Genetic and morphological characterization of the barley uniculm2 (cul2) mutant. Theor Appl Genet. 2003, 106: 846-857.PubMedGoogle Scholar
- HarvEST: barley. [http://harvest.ucr.edu/]
- Rostoks N, Mudie S, Cardle L, Russell J, Ramsay L, Booth A, Svensson JT, Wanamaker SI, Walia H, Rodriguez EM, Hedley PE, Liu H, Morris J, Close TJ, Marshall DF, Waugh R: Genome-wide SNP discovery and linkage analysis in barley based on genes responsive to abiotic stress. Mol Gen Genomics. 2005, 274: 515-527. 10.1007/s00438-005-0046-z.View ArticleGoogle Scholar
- Cho S, Garvin DF, Muehlbauer GJ: Transcriptome Analysis and Physical Mapping of Barley Genes in Wheat-Barley Chromosome Addition Lines. Genetics. 2006, 172: 1277-1285. 10.1534/genetics.105.049908.PubMed CentralView ArticlePubMedGoogle Scholar
- Gene Expression Omnibus (GEO) database. [http://www.ncbi.nlm.nih.gov/geo/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.