 Proceedings
 Open Access
 Published:
Local false discovery rate and minimum total error rate approaches to identifying interesting chromosomal regions
BMC Genetics volume 6, Article number: S23 (2005)
Abstract
The simultaneous testing of a large number of hypotheses in a genome scan, using individual thresholds for significance, inherently leads to inflated genomewide false positive rates. There exist various approaches to approximating the correct genomewide pvalues under various assumptions, either by way of asymptotics or simulations. We explore a philosophically different criterion, recently proposed in the literature, which controls the false discovery rate. The test statistics are assumed to arise from a mixture of distributions under the null and nonnull hypotheses. We fit the mixture distribution using both a nonparametric approach and commingling analysis, and then apply the local false discovery rate to select cutoff points for regions to be declared interesting. Another criterion, the minimum total error, is also explored. Both criteria seem to be sensible alternatives to controlling the classical type I and type II error rates.
Background
The increase in genomewide experiments and sequencing of multiple genomes has resulted in the analysis of large data sets that involve the simultaneous testing of statistical hypotheses on a large number of features in a genome. Traditionally, researchers have tackled the problem of multiple testing in linkage analysis by proposing a common threshold to control the familywise (i.e., genomewide) error rate. The simplest and frequently used Bonferroni correction is often conservative and is mainly useful when the number of tests involved is not very large. Recently, new criteria have been proposed, the false discovery rate [1] being one of them. The large number of hypotheses can be looked upon as coming from a mixture of null and nonnull hypotheses. Linkage analysis is then a matter of assigning categories after fitting a mixture to the linkage signals.
A nonparametric empirical Bayes approach that makes simultaneous inferences based on zvalues (standard normal deviates), converted from the pvalues of the test statistics has been introduced [2, 3]. The authors have explored a philosophically different approach that does not claim whether a test is significant or not, but rather whether a result is "interesting" or "uninteresting". The problem of multiple testing is addressed via the concept of the local false discovery rate (LFDR). Let the prior proportion of "uninteresting" hypotheses (e.g., absence of any major gene effect) be p_{o}, with corresponding density for the zvalues being f_{o}(z), and let the "interesting" hypotheses (e.g., presence of a major gene effect) have a prior proportion p_{1}, with corresponding density for the zvalues being f_{1}(z). Thus, f(z) = p_{o} f_{o}(z) + p_{1} f_{1}(z) gives the distribution of the zvalues under the mixture of the populations of interesting and uninteresting hypotheses. The posterior probability for a particular zvalue,z, to be from an uninteresting hypothesis, is p_{0}f_{0}(z)/f(z), and that to be from an interesting hypothesis is 1  p_{o} f_{o}(z)/f(z) The LFDR is defined to be fdr(z) = f_{0}(z)/f(z), which should be very close to the posterior probability P(hypothesis is uninteresting  z) = p_{0} f_{0}(z)/f(z) under the often valid assumption that p_{0} is close to 1. In Efron [2], all the zvalues for which fdr(z) ≤ 0.10 are declared to be from interesting hypotheses. A threshold of 0.10 or smaller is desirable in that the proportion of false discoveries will in this way be controlled. We have devised another criterion, the minimum total error (MTE), for identifying "interesting" regions in the genome. This method is explained in the methods section.
Efron [2] estimated the mixture distribution f(z) nonparametrically by fitting a Poisson regression to the zvalues. In addition to implementing this approach, we also estimate the mixture distribution parametrically. For the nonparametric approach, a natural spline, as proposed by Efron [2], is fitted to the zvalues resulting from the genome scan. The normal empirical null is then estimated from the central peak. The results of the analysis can change drastically depending on whether the null distribution is set to be the usual N(0, 1) or is estimated from the data. We use the notation NLFDR for the procedure of estimating the mixture nonparametrically via a spline and then applying the LFDR to determine the cutoff point.
We also propose, as an alternative, to assume that both f_{0}(z) and f_{1}(z) are normal distributions with common variances but a larger mean for f_{1}(z). This is a common practice in statistics as a first step in comparing two treatments, with the difference in the means of the normals being the treatment effect. After fitting the mixture distribution via commingling analysis as implemented in S.A.G.E. [4], both the LFDR and the MTE criteria were applied to determine the cutoff for claiming a chromosomal region as being interesting. These two approaches are referred to as parametriclocal false discover rate (PLFDR) and parametricminimum total error (PMTE), respectively, in what follows.
Methods
Data and linkage model
Linkage analysis was performed on the Kofendrerd Personality Disorder (KPD) data with disease being expressed as a binary trait. The true regions causing the disease were known. This study is based on the nuclear family data available from the Aipotu (AI), Karangar (KA), and Danacaa (DA) populations. The analysis was performed on both the microsatellite (MS) and the singlenucleotide polymorphism (SNP) data sets. For the parametric approaches, ten replicates (2, 14, 17, 23, 35, 42, 67, 84, 85, and 90), were randomly chosen and analyzed. Another ten replicates of the AI population (1, 26, 48, 50, 51, 60, 65, 88, 93, and 95) were used in calculations studying the effects of increasing sample size. For the NLFDR, we used all twenty of the above replicates. All methods of analysis were performed on each population, for the SNPs and the MS markers separately.
In each case, the HasemanElston [5] regression model as extended by Shete et al. [6] was fitted using the W4 option of SIBPAL in S.A.G.E. 4.5[4]. In this approach, the dependent variable is a weighted combination of the squared difference and squared meancorrected sum of the sibling trait values. The pvalues for a whole multipoint genome scan at 2cM intervals were converted to standard normal zvalues, that is, z = Φ^{1}(1p), where Φ(●) is the standard normal cumulative distribution function, so that interesting hypotheses are more likely to produce larger zvalues.
The NLFDR approach
For the NLFDR, the zvalues were plotted in a histogram of 60 equilength intervals, each of 0.1 unit. Then a natural spline of degree seven was fitted to the histogram (which has been shown in Efron [2] to be equivalent to fitting a Poisson regression to estimate the expected number of counts in each interval into which the data have been partitioned) to obtain , the estimate of the mixture distribution. A spline of degree seven has the same degrees of freedom as a polynomial of degree six. To fit a bimodal curve a polynomial of degree at least six is needed. Generalized crossvalidation [7] was used to evaluate the choice of the degree for fitting a spline. We applied this procedure to both the MS and SNP data of replicate 85. For the MS data, both AI and KA showed the optimal degrees to be approximately seven. The optimal degrees for the other data were higher than seven and were all different. For uniformity we decided to fit a spline of degree seven in every case. SPLUS was used to fit the mixture distribution.
The empirical null distribution, f_{0}(z) is assumed to be normal with mean δ_{0} and variance σ_{ 0 }^{2}. These parameters were estimated from the central peak of the observed mixture distribution. Specifically, the mean is estimated by the mode of the spline, = argmax . A potential estimate for the standard deviation is
However, this estimate is numerically unstable and a smoothing technique, as suggested by Efron [2], was used instead. A quadratic equation of the form (where x_{ k }is the median of the k^{th} of 30 0.1unit intervals within 1.5 units of δ_{0}) was fitted by least squares, and the estimated standard deviation is . Table 1 shows the parameter estimates of the null distributions for the different populations using SNPs and MSs from replicate 85. The NLFDR is estimated as and those chromosomal locations for which the NLFDR is less than or equal to 0.1 were deemed interesting.
The PLFDR and PMTE approaches
Commingling analysis was performed on the zvalues using the SEGREG program in S.A.G.E. 4.5 [4]. Initially, a BoxCox transformation [8] was attempted. However, this produced a density with a smaller mean than that of . Thus, zvalues less than – 2, which are classified as uninteresting, were trimmed off to allow the fitting. The percentage of zvalues below 2 was on average 1%. Promising results were then found after fitting a mixture of two normal distributions without any transformation and with both means estimated from the data. The variances were constrained to be equal to avoid numerical instability.
After fitting the mixture distribution, a criterion is called for to identify zvalues that are likely to be classified as interesting, i.e., locations that are potentially genetically linked to disease causing regions. Two criteria, the PLFDR and the PMTE, were explored for this purpose. The PMTE method sets the cutoff at the intersection of the fitted null and alternative distributions. Figure 1 illustrates this method. If x_{0} is the intersection point of and , then locations with zvalues > x_{0} are declared interesting. In Figure 1, x_{0} = 1.45. Note that the area under the null distribution curve to the right of this point is the probability of type I error, that is, we make an error by declaring an underlying uninteresting region as interesting. Similarly, the area under the alternative distribution curve to the left of this point is the probability of type II error, that is, genuine interesting regions are declared uninteresting. The PMTE method minimizes the total probability of these two types of errors. Note that the PMTE criterion implicitly assigns equal costs to committing either type of error, which may not be desirable. The PLFDR criterion, on the other hand, assumes most of the large number of hypotheses are uninteresting and controls the false discovery rate. As for NLFDR, we declared a region as interesting when fdr(z) < 0.1.
Results
For each population in each replicate, analysis yields pvalues at 1,921 locations on the ten chromosomes with the MS data, and at 2,380 locations with the SNP data. The pvalues were converted to zvalues by inverting the distribution function of a standard normal. Mixture distributions were fitted to the zvalues from each population, each replicate, separately for each type of marker data. Both the nonparametric and the parametric approaches were used in the fitting. The interesting regions obtained were compared with the true regions. The disease causing loci are D1, D2, D3, and D4 for the AI and KA populations and D1 and D2 for the DA population. The modifying loci, D5 and D6, were not considered as disease loci, as our focus is to compare the performance of the new criteria rather than to seek the best linkage analysis model. We defined a location as being "linked" to the disease if it is within 10 cM of a disease locus. We defined loci more than 20 cM from any disease locus as "unlinked". Chromosomal regions between 10 cM and 20 cM away from any disease locus were intentionally ignored to allow for a sharp distinction between "linked" and "unlinked" cases.
For NLFDR, where the mixture distribution is fitted by a spline, in all the replicates of the DA population the true regions were identified 39.52% of the time using the MSs and 46.67% of the times using the SNPs. Given that the SNPs are more than twice as dense as the MSs, the power of using either is comparable in our data. Figure 2 shows the zvalues on chromosome 1 for the MS and SNP data from the DA population in replicate 85. The thresholds for declaring interesting regions are different in the two datasets. Thus we shifted the zvalues for the MS data down by 1.1 to equate the thresholds presented in Figure 2. In this replicate the MS marker data were unable to identify D1. The figure shows that MSs and SNPs do not always yield concordant results. Also, it appears that the SNPs yield more accurate localizations than the MSs because the peak of the signal using MS markers is shifted to the left. However, the method was not very successful in identifying the true disease locations for the AI population, possibly due to a relatively more complicated disease model and the lack of covariates in the linkage model.
The parametric approach fitted the mixture distributions to the zvalues via commingling analysis. Both the PMTE and the PLFDR criteria were applied to the fitted mixture distributions to determine the threshold on the zvalues for declaring a genome location interesting. The results for the DA population with MS data are shown in Figure 1. The former gives a threshold of 1.45, while the latter gives one of 2.29.
For each of the three populations and each of the two types of marker data, the probability of committing each of the two types of errors, pooled over all replicates considered, has been calculated and is reported in Table 2. The PMTE method, as expected, minimizes the total probabilities of the two types of errors, while the PLFDR more effectively minimizes type I error. For both methods, the SNP data yielded a relatively smaller type II error rate and a larger type I error rate compared with those from the MS data. Overall, the DA population has the smallest error rates, which is not surprising because the genetic effects are better defined and more evident in this population.
We also investigated whether increasing the sample size will provide better results. For this, we merged replicates to form larger samples and repeated the analysis on these samples. The results on the AI population are reported in Tables 3 and 4. Specifically, there are 20 replicates, each of 100 pedigrees. To make samples of 200 pedigrees, pairs of replicates are merged to yield 10 such samples. Five 400pedigree samples are obtained by merging 4 replicates. Similar analysis was performed each time on the merged datasets.
The results for NLFDR are provided in Table 3. With increasing sample size, the true regions are discovered with higher probabilities. The method also then controlled the false positive rates considerably. The false positive rate for the data with 400 pedigrees appears to be higher than that for samples with 200 pedigrees (0.29% versus 0.47%). However, not much confidence can be placed on estimates of small probabilities for a binomial distribution in small samples, as the standard deviation is several times the magnitude of the estimate. What is clear is that the type I error rates are quite small for all three sample sizes.
Results for the parametric approach are presented in Table 4. Type I error rates were reduced considerably for both PMTE and PLFDR when the sample size increased from 100 to 200, and the improvement is more dramatic when the sample size increased from 200 to 400. Power improves considerably as sample size increases from 100 to 200, but does not see an improvement when the sample size is further increased to 400. This again may be due to the fact that we have only five 400pedigree samples.
Discussion
Appropriate control of various types of error rates in multiple testing scenarios has long been an intriguing research problem. However, despite the immense efforts spent on this subject, satisfactory solutions are not available. For example, the application of the genomewide significance criterion suggested by Lander and Kruglyak [9] to the simulated dataset yielded extremely conservative results.
In this article, we explored the control of false discovery rates, a philosophically different approach, instead of the classical type I and type II error rates in multiple testing problems. Individual test statistics, after appropriate scaling, are viewed as having arisen from a mixture of null and nonnull hypotheses. The ratio of the null density versus the mixture density at a given test statistic provides a measure of the LFDR. A procedure that places a cutoff on this ratio controls the FDR.
The mixture distributions were fitted using two different procedures: by fitting 1) a spline (NLFDR), and 2) a mixture of normal distributions with differing means (PMTE and PLFDR). A direct comparison of these approaches with methods in the literature that control the classical types of error rates is not necessarily meaningful because of the philosophical difference between them. Both the local FDR and the MTE criteria have intuitively simple and appealing interpretations.
PMTE and PLFDR were compared in terms of finding genuine regions containing disease genes. We observed that the PLFDR method is more conservative than the PMTE method in declaring a location "interesting". Therefore, while fewer type I errors are incurred, more truly interesting locations are missed. Table 2 indicates that the SNP data produced greater type I error rates and smaller type II error rates compared with the MS data using the same linkage analysis method. This might be due to that SNPs and MS data have different information content.
The DA population had a simpler genetic model and hence the associated genes were identified with lower type I and type II error rates than in the other populations, as expected. As the sample size increases (Tables 3 and 4) the power increases, which we would expect from any good criterion and analysis method. In NLFDR we also observe that the type I error remains the same for samples of 100 pedigrees and 200 pedigrees, but there is an increase in the type I error in the samples of 400 pedigrees. This may be due to the fact that there were only five samples of 400 pedigrees. With increasing sample size in the parametric approach both types of error rate decrease, confirming that the two distributions in the mixture separate further with increasing sample size. Note that we are estimating small binomial probabilities, which have standard errors many times the mean and hence these estimates should not be trusted when the sample size is small.
Finally, although both methods for fitting the mixture distributions implicitly assume that the test statistics are independent, which is surely violated in a true multipoint genome scan, the estimate of the mixture distribution is nevertheless consistent and requires a large sample size to be accurate.
Abbreviations
 KPD:

Kofendrerd Personality Disorder
 LFDR:

Local false discovery rate
 MS:

Microsatellite
 MTE:

Minimum total error
 PLFDR:

Parametriclocal false discover rate
 PMTE:

Parametricminimum total error
 SNP:

Singlenucleotide polymorphism
References
 1.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995, 57: 289300.
 2.
Efron B: Largescale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004, 99: 96104. 10.1198/016214504000000089.
 3.
Efron B, Tibshirani R, Storey J, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 11511160. 10.1198/016214501753382129.
 4.
Department of Epidemiology and Biostatistics Rammelkamp Center for Education and Research MetroHealth Campus, Case Western Reserve University: S.A.G.E.: Statistical Analysis for Genetic Epidemiology. [http://darwin.epbi.cwru.edu/sage/]
 5.
Haseman JK, Elston RC: The investigation of linkage between a quantitative trait and a marker locus. Behav Genet. 1972, 2: 319. 10.1007/BF01066731.
 6.
Shete S, Jacobs KB, Elston RC: Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: weighting sums and differences. Hum Hered. 2003, 55: 7985. 10.1159/000072312.
 7.
Loader C: Local Regression and Likelihood. 1999, New York: Springer
 8.
Box GEP, Cox DR: An analysis of transformations. J Roy Stat Soc B. 1964, 26: 211252.
 9.
Lander E, Kruglyak L: Genetic dissection of complex traits guidelines for interpreting and reporting linkage results. Nat Genet. 1995, 11: 241247. 10.1038/ng1195241.
Acknowledgements
This work was supported in part by U.S. Public Health Service Resource Grant RR03655 from the National Center for Research Resources, Research Grant GM28356 from the National Institute of General Medical Sciences, and Training Grant HL07567 from the National Heart, Lung and Blood Institute.
Author information
Additional information
Authors' contributions
RS, MS, and GM are equally involved in the statistical analysis, interpretation, preparation of the draft and in the final revision of the manuscript. RCE and YL contributed to the initial conception and design, and critical revision of the manuscript. All authors approved of the manuscript.
Ritwik Sinha, Moumita Sinha, George Mathew contributed equally to this work.
Rights and permissions
About this article
Cite this article
Sinha, R., Sinha, M., Mathew, G. et al. Local false discovery rate and minimum total error rate approaches to identifying interesting chromosomal regions. BMC Genet 6, S23 (2005) doi:10.1186/147121566S1S23
Published
DOI
Keywords
 Mixture Distribution
 True Region
 Natural Spline
 Major Gene Effect
 Local False Discovery Rate