Volume 6 Supplement 1
Local false discovery rate and minimum total error rate approaches to identifying interesting chromosomal regions
© Sinha et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
The simultaneous testing of a large number of hypotheses in a genome scan, using individual thresholds for significance, inherently leads to inflated genome-wide false positive rates. There exist various approaches to approximating the correct genomewide p-values under various assumptions, either by way of asymptotics or simulations. We explore a philosophically different criterion, recently proposed in the literature, which controls the false discovery rate. The test statistics are assumed to arise from a mixture of distributions under the null and non-null hypotheses. We fit the mixture distribution using both a nonparametric approach and commingling analysis, and then apply the local false discovery rate to select cut-off points for regions to be declared interesting. Another criterion, the minimum total error, is also explored. Both criteria seem to be sensible alternatives to controlling the classical type I and type II error rates.
The increase in genome-wide experiments and sequencing of multiple genomes has resulted in the analysis of large data sets that involve the simultaneous testing of statistical hypotheses on a large number of features in a genome. Traditionally, researchers have tackled the problem of multiple testing in linkage analysis by proposing a common threshold to control the family-wise (i.e., genome-wide) error rate. The simplest and frequently used Bonferroni correction is often conservative and is mainly useful when the number of tests involved is not very large. Recently, new criteria have been proposed, the false discovery rate  being one of them. The large number of hypotheses can be looked upon as coming from a mixture of null and non-null hypotheses. Linkage analysis is then a matter of assigning categories after fitting a mixture to the linkage signals.
A nonparametric empirical Bayes approach that makes simultaneous inferences based on z-values (standard normal deviates), converted from the p-values of the test statistics has been introduced [2, 3]. The authors have explored a philosophically different approach that does not claim whether a test is significant or not, but rather whether a result is "interesting" or "uninteresting". The problem of multiple testing is addressed via the concept of the local false discovery rate (LFDR). Let the prior proportion of "uninteresting" hypotheses (e.g., absence of any major gene effect) be po, with corresponding density for the z-values being fo(z), and let the "interesting" hypotheses (e.g., presence of a major gene effect) have a prior proportion p1, with corresponding density for the z-values being f1(z). Thus, f(z) = po fo(z) + p1 f1(z) gives the distribution of the z-values under the mixture of the populations of interesting and uninteresting hypotheses. The posterior probability for a particular z-value,z, to be from an uninteresting hypothesis, is p0f0(z)/f(z), and that to be from an interesting hypothesis is 1 - po fo(z)/f(z) The LFDR is defined to be fdr(z) = f0(z)/f(z), which should be very close to the posterior probability P(hypothesis is uninteresting | z) = p0 f0(z)/f(z) under the often valid assumption that p0 is close to 1. In Efron , all the z-values for which fdr(z) ≤ 0.10 are declared to be from interesting hypotheses. A threshold of 0.10 or smaller is desirable in that the proportion of false discoveries will in this way be controlled. We have devised another criterion, the minimum total error (MTE), for identifying "interesting" regions in the genome. This method is explained in the methods section.
Efron  estimated the mixture distribution f(z) nonparametrically by fitting a Poisson regression to the z-values. In addition to implementing this approach, we also estimate the mixture distribution parametrically. For the nonparametric approach, a natural spline, as proposed by Efron , is fitted to the z-values resulting from the genome scan. The normal empirical null is then estimated from the central peak. The results of the analysis can change drastically depending on whether the null distribution is set to be the usual N(0, 1) or is estimated from the data. We use the notation N-LFDR for the procedure of estimating the mixture nonparametrically via a spline and then applying the LFDR to determine the cut-off point.
We also propose, as an alternative, to assume that both f0(z) and f1(z) are normal distributions with common variances but a larger mean for f1(z). This is a common practice in statistics as a first step in comparing two treatments, with the difference in the means of the normals being the treatment effect. After fitting the mixture distribution via commingling analysis as implemented in S.A.G.E. , both the LFDR and the MTE criteria were applied to determine the cut-off for claiming a chromosomal region as being interesting. These two approaches are referred to as parametric-local false discover rate (P-LFDR) and parametric-minimum total error (P-MTE), respectively, in what follows.
Data and linkage model
Linkage analysis was performed on the Kofendrerd Personality Disorder (KPD) data with disease being expressed as a binary trait. The true regions causing the disease were known. This study is based on the nuclear family data available from the Aipotu (AI), Karangar (KA), and Danacaa (DA) populations. The analysis was performed on both the microsatellite (MS) and the single-nucleotide polymorphism (SNP) data sets. For the parametric approaches, ten replicates (2, 14, 17, 23, 35, 42, 67, 84, 85, and 90), were randomly chosen and analyzed. Another ten replicates of the AI population (1, 26, 48, 50, 51, 60, 65, 88, 93, and 95) were used in calculations studying the effects of increasing sample size. For the N-LFDR, we used all twenty of the above replicates. All methods of analysis were performed on each population, for the SNPs and the MS markers separately.
In each case, the Haseman-Elston  regression model as extended by Shete et al.  was fitted using the W4 option of SIBPAL in S.A.G.E. 4.5. In this approach, the dependent variable is a weighted combination of the squared difference and squared mean-corrected sum of the sibling trait values. The p-values for a whole multipoint genome scan at 2-cM intervals were converted to standard normal z-values, that is, z = Φ-1(1-p), where Φ(●) is the standard normal cumulative distribution function, so that interesting hypotheses are more likely to produce larger z-values.
The N-LFDR approach
For the N-LFDR, the z-values were plotted in a histogram of 60 equi-length intervals, each of 0.1 unit. Then a natural spline of degree seven was fitted to the histogram (which has been shown in Efron  to be equivalent to fitting a Poisson regression to estimate the expected number of counts in each interval into which the data have been partitioned) to obtain , the estimate of the mixture distribution. A spline of degree seven has the same degrees of freedom as a polynomial of degree six. To fit a bimodal curve a polynomial of degree at least six is needed. Generalized cross-validation  was used to evaluate the choice of the degree for fitting a spline. We applied this procedure to both the MS and SNP data of replicate 85. For the MS data, both AI and KA showed the optimal degrees to be approximately seven. The optimal degrees for the other data were higher than seven and were all different. For uniformity we decided to fit a spline of degree seven in every case. SPLUS was used to fit the mixture distribution.
The empirical null distribution, f0(z) is assumed to be normal with mean δ0 and variance σ 0 2 . These parameters were estimated from the central peak of the observed mixture distribution. Specifically, the mean is estimated by the mode of the spline, = argmax . A potential estimate for the standard deviation is
Estimates of means and standard deviations of the empirical null distributions (replicate 85, N-LFDR).
The P-LFDR and P-MTE approaches
Commingling analysis was performed on the z-values using the SEGREG program in S.A.G.E. 4.5 . Initially, a Box-Cox transformation  was attempted. However, this produced a density with a smaller mean than that of . Thus, z-values less than – 2, which are classified as uninteresting, were trimmed off to allow the fitting. The percentage of z-values below -2 was on average 1%. Promising results were then found after fitting a mixture of two normal distributions without any transformation and with both means estimated from the data. The variances were constrained to be equal to avoid numerical instability.
For each population in each replicate, analysis yields p-values at 1,921 locations on the ten chromosomes with the MS data, and at 2,380 locations with the SNP data. The p-values were converted to z-values by inverting the distribution function of a standard normal. Mixture distributions were fitted to the z-values from each population, each replicate, separately for each type of marker data. Both the nonparametric and the parametric approaches were used in the fitting. The interesting regions obtained were compared with the true regions. The disease causing loci are D1, D2, D3, and D4 for the AI and KA populations and D1 and D2 for the DA population. The modifying loci, D5 and D6, were not considered as disease loci, as our focus is to compare the performance of the new criteria rather than to seek the best linkage analysis model. We defined a location as being "linked" to the disease if it is within 10 cM of a disease locus. We defined loci more than 20 cM from any disease locus as "unlinked". Chromosomal regions between 10 cM and 20 cM away from any disease locus were intentionally ignored to allow for a sharp distinction between "linked" and "unlinked" cases.
The parametric approach fitted the mixture distributions to the z-values via commingling analysis. Both the P-MTE and the P-LFDR criteria were applied to the fitted mixture distributions to determine the threshold on the z-values for declaring a genome location interesting. The results for the DA population with MS data are shown in Figure 1. The former gives a threshold of 1.45, while the latter gives one of 2.29.
Estimated probabilities of the two types of errors, with criteria P-MTE and P-LFDR.
Type I error rate
Type II error rate
Type I error rate
Type II error rate
Efficiency of N-LFDR versus sample size (AI population, MS markers, N-LFDR).
Number of pedigrees per sample
Type I error rate (%)
1- Type II error rate (power) (%)
Effect of sample size on the error rates (AI population, MS marker, parametric approaches with criteria P-MTE and P-LFDR).
Type I error rate
1-Type II error rate (power)
Type I error rate
1-Type II error rate (power)
The results for N-LFDR are provided in Table 3. With increasing sample size, the true regions are discovered with higher probabilities. The method also then controlled the false positive rates considerably. The false positive rate for the data with 400 pedigrees appears to be higher than that for samples with 200 pedigrees (0.29% versus 0.47%). However, not much confidence can be placed on estimates of small probabilities for a binomial distribution in small samples, as the standard deviation is several times the magnitude of the estimate. What is clear is that the type I error rates are quite small for all three sample sizes.
Results for the parametric approach are presented in Table 4. Type I error rates were reduced considerably for both P-MTE and P-LFDR when the sample size increased from 100 to 200, and the improvement is more dramatic when the sample size increased from 200 to 400. Power improves considerably as sample size increases from 100 to 200, but does not see an improvement when the sample size is further increased to 400. This again may be due to the fact that we have only five 400-pedigree samples.
Appropriate control of various types of error rates in multiple testing scenarios has long been an intriguing research problem. However, despite the immense efforts spent on this subject, satisfactory solutions are not available. For example, the application of the genome-wide significance criterion suggested by Lander and Kruglyak  to the simulated dataset yielded extremely conservative results.
In this article, we explored the control of false discovery rates, a philosophically different approach, instead of the classical type I and type II error rates in multiple testing problems. Individual test statistics, after appropriate scaling, are viewed as having arisen from a mixture of null and non-null hypotheses. The ratio of the null density versus the mixture density at a given test statistic provides a measure of the LFDR. A procedure that places a cut-off on this ratio controls the FDR.
The mixture distributions were fitted using two different procedures: by fitting 1) a spline (N-LFDR), and 2) a mixture of normal distributions with differing means (P-MTE and P-LFDR). A direct comparison of these approaches with methods in the literature that control the classical types of error rates is not necessarily meaningful because of the philosophical difference between them. Both the local FDR and the MTE criteria have intuitively simple and appealing interpretations.
P-MTE and P-LFDR were compared in terms of finding genuine regions containing disease genes. We observed that the P-LFDR method is more conservative than the P-MTE method in declaring a location "interesting". Therefore, while fewer type I errors are incurred, more truly interesting locations are missed. Table 2 indicates that the SNP data produced greater type I error rates and smaller type II error rates compared with the MS data using the same linkage analysis method. This might be due to that SNPs and MS data have different information content.
The DA population had a simpler genetic model and hence the associated genes were identified with lower type I and type II error rates than in the other populations, as expected. As the sample size increases (Tables 3 and 4) the power increases, which we would expect from any good criterion and analysis method. In N-LFDR we also observe that the type I error remains the same for samples of 100 pedigrees and 200 pedigrees, but there is an increase in the type I error in the samples of 400 pedigrees. This may be due to the fact that there were only five samples of 400 pedigrees. With increasing sample size in the parametric approach both types of error rate decrease, confirming that the two distributions in the mixture separate further with increasing sample size. Note that we are estimating small binomial probabilities, which have standard errors many times the mean and hence these estimates should not be trusted when the sample size is small.
Finally, although both methods for fitting the mixture distributions implicitly assume that the test statistics are independent, which is surely violated in a true multipoint genome scan, the estimate of the mixture distribution is nevertheless consistent and requires a large sample size to be accurate.
Kofendrerd Personality Disorder
Local false discovery rate
Minimum total error
Parametric-local false discover rate
Parametric-minimum total error
This work was supported in part by U.S. Public Health Service Resource Grant RR03655 from the National Center for Research Resources, Research Grant GM28356 from the National Institute of General Medical Sciences, and Training Grant HL07567 from the National Heart, Lung and Blood Institute.
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995, 57: 289-300.Google Scholar
- Efron B: Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004, 99: 96-104. 10.1198/016214504000000089.View ArticleGoogle Scholar
- Efron B, Tibshirani R, Storey J, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.View ArticleGoogle Scholar
- Department of Epidemiology and Biostatistics Rammelkamp Center for Education and Research MetroHealth Campus, Case Western Reserve University: S.A.G.E.: Statistical Analysis for Genetic Epidemiology. [http://darwin.epbi.cwru.edu/sage/]
- Haseman JK, Elston RC: The investigation of linkage between a quantitative trait and a marker locus. Behav Genet. 1972, 2: 3-19. 10.1007/BF01066731.View ArticlePubMedGoogle Scholar
- Shete S, Jacobs KB, Elston RC: Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: weighting sums and differences. Hum Hered. 2003, 55: 79-85. 10.1159/000072312.View ArticlePubMedGoogle Scholar
- Loader C: Local Regression and Likelihood. 1999, New York: SpringerGoogle Scholar
- Box GEP, Cox DR: An analysis of transformations. J Roy Stat Soc B. 1964, 26: 211-252.Google Scholar
- Lander E, Kruglyak L: Genetic dissection of complex traits guidelines for interpreting and reporting linkage results. Nat Genet. 1995, 11: 241-247. 10.1038/ng1195-241.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.