Testing allele homogeneity: the problem of nested hypotheses
© Izbicki et al.; licensee BioMed Central Ltd. 2012
Received: 20 June 2012
Accepted: 16 October 2012
Published: 23 November 2012
Skip to main content
© Izbicki et al.; licensee BioMed Central Ltd. 2012
Received: 20 June 2012
Accepted: 16 October 2012
Published: 23 November 2012
The evaluation of associations between genotypes and diseases in a case-control framework plays an important role in genetic epidemiology. This paper focuses on the evaluation of the homogeneity of both genotypic and allelic frequencies. The traditional test that is used to check allelic homogeneity is known to be valid only under Hardy-Weinberg equilibrium, a property that may not hold in practice.
We first describe the flaws of the traditional (chi-squared) tests for both allelic and genotypic homogeneity. Besides the known problem of the allelic procedure, we show that whenever these tests are used, an incoherence may arise: sometimes the genotypic homogeneity hypothesis is not rejected, but the allelic hypothesis is. As we argue, this is logically impossible. Some methods that were recently proposed implicitly rely on the idea that this does not happen. In an attempt to correct this incoherence, we describe an alternative frequentist approach that is appropriate even when Hardy-Weinberg equilibrium does not hold. It is then shown that the problem remains and is intrinsic of frequentist procedures. Finally, we introduce the Full Bayesian Significance Test to test both hypotheses and prove that the incoherence cannot happen with these new tests. To illustrate this, all five tests are applied to real and simulated datasets. Using the celebrated power analysis, we show that the Bayesian method is comparable to the frequentist one and has the advantage of being coherent.
Contrary to more traditional approaches, the Full Bayesian Significance Test for association studies provides a simple, coherent and powerful tool for detecting associations.
One of the main goals in genetic epidemiology is the evaluation of associations between specific genotypes or alleles and a certain disease. Association studies are usually performed in a case-control framework in which one or several polymorphisms of candidate genes are evaluated in a group of cases (that is, patients that have a disease) and in a group of controls from the same population (that is, healthy individuals) . The frequencies of each of the genotypes are then computed so that statistical tests that aim at checking for associations between genes and the disease can be performed. The population studied usually must be homogeneous regarding ethnicity, gender distribution and other factors that may bias the results rendering false-positive associations. See  for nontechnical summary of reasons that may render false discoveries in case-control studies and  for a theoretical analysis of the consequences of population stratification. For more on case-control studies, the reader is referred to .
Several statistical tests are usually employed for this scenario. Among them, Cochran-Armitage test for trends , homogeneity chi-square tests for contingency tables of both genotypic and allelic frequencies , likelihood ratio tests and Wald tests  are performed. See, for example,  and  for a summary of these tests. Some of these statistics are specifically designed to work under assumptions such as dominance models, recessive models or Hardy-Weinberg Equilibrium (HWE). However, a big importance is being given on new methods that are robust to model misspecification, mainly because power is usually small when the model is wrong and type 1 error rates are usually incorrect (see e.g. [7, 10, 11]).
HWE plays an important role in genetic studies, in particular when testing for allelic homogeneity . The main reason is that the traditional test for allelic homogeneity fails when HWE does not hold, a point to which we will get back later. In words, HWE is a constrain on the genotypic proportions that implies, under some assumptions, stability of the different genotypes over the generations of the population (see e.g.  and ). These assumptions include, for example, random mating between individuals. For many diseases, random mating is not expected to be satisfied. The same holds for other conditions required for HWE, that in practice may be unrealistic in some situations. In fact, as stated by , “a population will never be exactly in HWE”. Hence, the need to design tests that are robust to departures from HWE is evident. A common practice in such problems is to first test HWE, discarding genes that are not in equilibrium. This is done in an attempt of identifying genotyping errors. Such an approach should be avoided, as discussed by . The main reason is that the 2 steps procedure alters type-1 errors. They also emphasize that the correct way to deal with this problem is to inherently account for deviations from HWE with adjusted tests, the approach we take here.
The aims of this paper are four-fold: 1 - to describe how the analysis of such data is usually conducted and to emphasize its known flaw (namely lack of robustness to departures from HWE); 2 - to describe one exact frequentist approach which is correct from a classical point of view; 3 - to present a Bayesian method to deal with the problem, and 4 - to advocate the use of the Bayesian solution by demonstrating why this is the best solution compared to the others. The main argument is based on an undesirable logical inconsistency that can happen whenever p-values are used to test nested hypotheses. We prove that this does not happen when using the Bayesian method proposed. We also show that the Bayesian and the correct frequentist solutions have comparable power. Simulations and analyses of real data are shown in order to illustrate the problem.
The paper is organized as follows. Section Methods contains three subsections: Usual Procedures, which introduces the notation that is used throughout the paper, discusses the usual methods to deal with the problem and argues why the test for allelic homogeneity is wrong when there are departures from HWE; A Different Frequentist Test, which introduces a frequentist test that works even when departures of HWE happen and Bayesian Solution, which introduces the FBST approach to solve the problem. Section Results and Discussion first focuses on the issue of the logical incoherence that happens when using the frequentist procedures discussed in the paper and also shows that the same does not happen to the Bayesian method FBST. A brief discussion on Bayes factors is also provided. Finally, we address the question of whether the Bayesian method has good frequentist properties. Section Conclusions summarizes the findings of the paper.
Here, we formally describe three different approaches to deal with the problem described: the usual procedure, a correct frequentist proposal and a Bayesian solution.
We begin by describing the statistical model that is used to deal with the problem approached in this paper (namely, product of multinomials) and also how the hypotheses of interest are usually tested in genetic literature. For more details, see .
Population genotypic frequencies
x AA (γ AA )
x AB (γ AB )
x BB (γ BB )
y AA (Π AA )
y AB (Π AB )
y BB (Π BB )
which is the product of two multinomial distributions.
where is the maximum likelihood estimator for the genotypic frequency i under the hypothesis . Under , Q G has asymptotic distribution (chi-square distribution with 2 degrees of freedom). Using this fact, it is possible to calculate an asymptotic p-value. If one prefers exact tests, Monte Carlo methods can also be used. To sum up, in order to test the first hypothesis, one usually performs a traditional chi-square test of homogeneity to Table 3.
Population allelic frequencies
x A = 2x AA + x AB
x B = 2x BB + x AB
y A = 2y AA + y AB
y B = 2y BB + y AB
where is the maximum likelihood estimator for the allelic frequency i under the hypothesis that allelic frequencies are the same in both groups. This statistic is then compared to a distribution, or sampled using a Monte Carlo method to calculate the p-value. However, in this scenario, the distribution of the test statistic under the null hypothesis is not chi-square unless alleles are statistically independent. In other words, the distribution is chi-square only if a product multinomial model can be applied to Table 4. Essentially, this independence corresponds to the HWE. In fact,  formally proves that this is a valid test if, and only if, both groups, case and control, are under HWE. Otherwise, this test is biased: nominal level of significance is different from the real one .  also shows how deviations from HWE alter type-I error rates, a point that will also be illustrated in Section Results and Discussion. Therefore, this test should not be used. It is important to note that despite being wrong, it is still widely used in genetic literature nowadays (see e.g. , that also discusses some aspects of the lack of robustness of this test). This leads to a larger number of false conclusions than the nominal errors of the procedures.
Applying the traditional tests to data from Table 1, one gets a p-value of 0.152 for genotypic association and of 0.049 for allelic association. This means that the evidence we have that the two groups are in genotypic homogeneity is larger than the evidence we have that they are in allelic homogeneity. However, if genotypic proportions are the same, allelic proportions must also be the same. This implication will be made formal in Section Results and Discussion. In practice, the first p-value being larger than the second implies that one can accept the hypothesis of genotypic homogeneity while rejecting allelic homogeneity, which is a contradiction. For instance, this is the case when the level of significance is 10%, as 0.152 > 0.1 but 0.049 < 0.1. To summarize: we are testing two nested hypotheses, that is, the nature of the problem is such that the first hypothesis implies the second. However, even though we reject the second, we do not reject the first. Does the contradiction happen because the allelic test is wrong? Next Section answers this by presenting an exact test for allelic homogeneity that is valid even if HWE does not hold.
Some attempts to correct the above-mentioned allelic test so that it works even when HWE assumption is not met are considered by [8, 12, 17, 19]. See  for a summary of these. Also, see , that proofs that the test proposed by  to correct for departures of HWE is equivalent to the one proposed by , which is the Armitage’s trend test .
Note that this formulation is always true independent of the Hardy-Weinberg equilibrium restriction and does not involve changing neither the sample space nor the parametric space.
Maximization of Equation (3) can be efficiently done by using numerical methods such as Newton’s method , which are already implemented in most statistical and mathematical softwares such as R and MATLAB. To calculate p-values, the statistic Q A∗ can then be compared to a distribution or, if one wishes to perform an exact test (the approach we take here), sampled using Monte Carlo methods. That is, one can generate several values of Q A∗under the null hypothesis and compute the proportion of these that are larger than the observed statistic on the sample. This is the (estimate of the) exact p-value. Confidence intervals can be obtained for it by using a normal approximation to the binomial distribution. Note that the dimension of the parametric space is 4 and under the null hypothesis it becomes 3. Hence, the number of degrees of freedom of the distribution of the chi-square statistic is . This is also the number of degrees of freedom of the chi-square for the allelic test described before.
This test is very similar to the ones recently introduced by , except that the statistics used are different (Wald statistic, score statistic and maximum profile likelihood ratio), and results are asymptotic: chi squared approximation is used. Even though these tests are asymptotically equivalent, in order to illustrate our points it is important to have exact tests here.
The allelic p-value for data from Table 3 is 0.069. It is surprising that despite the fact this test is correct, this p-value is still smaller than 0.152 - the p-value for genotypic association. Hence, incoherence remains even when correcting the traditional allelic test. We note that the p-value found by  for this same data set using corrected allelic test is 0.066, which also does not remove the contradiction. Note that here we use the exact test, hence this is not a problem of using an approximation. In the Section Results and Discussion we present other data sets in which this incoherence happens, showing that this problem is not unique to the particular data we chose to illustrate the point. Next Section is devoted to present a framework where this kind of contradiction does not happen.
Bayesian methods are the alternative inductive way to deal with such a problem. These methods are widely used nowadays because they allow prior knowledge from the researcher and scientific community to be incorporated into the analysis (see  for applied examples of these methods in genetics) and, contrary to usual classical procedures, they do not require large samples for the analysis to be correct. That is, optimality of the procedure does not rely on asymptotic considerations. Many Bayesian methods designed to deal with precise hypotheses, i.e., hypotheses which have lower dimension than the parametric space, have been developed. Precise hypotheses must have a different treatment in Bayesian statistics: in general, they have zero posterior probability, so that they would always be rejected when using traditional methods. One way to deal with this problem is to assign a positive prior probability to null hypothesis , but this may seem a rather ad hoc solution and may lead to some inconsistencies . Another approach is to use Bayes factors , a point to which we will get back later in the paper (see Section Results and Discussion).
In words, e-value is the posterior probability of the subset of the parametric space consisting of points with lower posterior density than the maximum achieved under H. It is interesting to note the duality between p-values and e-values: while the former are tails in the sample distribution from the observed values under the null hypothesis, the latter are tail areas in the posterior distribution from the sharp hypothesis. E-values are easy to be calculated and successful papers that use FBST procedure in genetics include [14, 28, 29]. For more on e-values, the intuition behind it, asymptotic consistency results, and decision-theoretic considerations see . High e-values indicate high evidence in favor of the hypothesis, while low e-values indicate that the hypothesis is false.
Implementation of the FBST procedure requires two simple steps, which can be performed numerically:
Optimization - Finding the supremum of the posterior distribution under the null hypothesis, . This is usually done by using built-in functions from statistical packages such as R.
Integration - Integrating the posterior density over the Tangential Set, . This step can be done by sampling from the posterior distribution by using methods such as MCMC. For the problem considered here, a usual Monte Carlo method is enough to efficiently sample from the posterior.
More details on the implementation of the FBST procedure can be found in . To perform the complete FBST procedure one also needs to set a cut-off point, that is, one must say what a “small” e-value means. Several approaches are available:
Empirical power analysis 
Reference sensitivity analysis and paraconsistent logic .
 relate e-values to p-values.
Bayesian decision-theoretic approach , by the specification of a loss function that gives origin to FBST procedure.
Note that in this case the posterior distribution is also the product of two independent Dirichlet distributions (once they are conjugate with the multinomial distribution). Their parameters are (x AA + a AA ,x AB + a AB ,x BB + a BB ) and (y AA + b AA ,y AB + b AB ,y BB + b BB ) respectively. Simulation of the Dirchlet distribution can be efficiently done by sampling from Gamma distributions; see  for details. Note that the case where all (hyper)parameters (a i and b i ) are equal to 1, θ is uniformly distributed a priori.
We see that while both groups from Figure 1 seem to be under HWE (in this case, tangential sets have small probabilities, and therefore e-values are large), the ones from Figure 2 seem to be far from the equilibrium (in this case, tangential sets have large probabilities, and therefore e-values are small).
When testing genotypic and allelic homogeneity using FBST and uniform priors (a i = b i = 1 for all i’s in the Dirichlet distribution), we obtain e-values of 0.434 and 0.493 respectively. Hence, contrary to what happens to p-values, there is more evidence in favor of the allelic homogeneity hypothesis than there is in favor of the genotypic homogeneity hypothesis. Therefore the contradiction of not rejecting the first hypothesis while rejecting the second one cannot happen for any cutoff that is chosen. In fact, as we will show in next Section, this is a property of the FBST procedure: the undesirable contradiction can never happen.
Analysis of real data
Analysis of simulated data
Hence, it is reasonable to expect that p-values, as well as any other measure of evidence, should be such that . To sum up, there should be more evidence in favor of than in favor of . In fact, this is what motivates the tests proposed by . More generally, if we have two nested hypotheses, A ⊆ B ⊆ Θ, it would be desirable to have p(B) ≥ p(A). That is, one should always believe that B is at least as plausible as A. It is worth noting that this inequality must hold if one wants to guarantee that for any significance level α the rejection of B will imply the rejection of A. In other words, p(B) should always be greater than p(A) so that one will never conclude that A holds but B does not, which is, as we showed, logically impossible.
Even though this logical coherence is desirable, the analysis of data presented by  (Table 5) shows that this property is not achieved neither when using the traditional p-value for allelic frequencies, nor when using the alternative test presented here. Hence, depending on the level of significance used (for example, 10%), one can conclude that genotypic homogeneity holds, but allelic homogeneity does not. This leads one to a logical contradiction that may be embarrassing for the researcher when showing his results to scientific community. Some authors (e.g. [36–39]) have already noticed that p-values cannot be used as a measure of evidence because they do not respect this property. Attempts to correct frequentist tests so that they are coherent have been tried in some specific situations such as Analysis of Variance , but no general procedure could be obtained.
For the problem considered here, this means that will hold for all datasets. Hence, one will always have at least as much evidence in favor of as in favor of , and therefore when performing the FBST procedure (that is, comparing the e-values with a given cutoff) one will never fall into the logical contradiction of rejecting while not rejecting . Equation 4 proves that the incoherence can never happen when using the FBST. Table 5 shows that this inequality indeed holds for the data presented. It is also interesting to note that in the case of nested hypotheses, FBST provides an intrinsic penalty that can be used for model selection .
In Table 6, one can find similar results on simulated data. Data was simulated in three different conditions: 1 - under genotypic homogeneity (and, therefore, allelic homogeneity), 2 - under only allelic homogeneity and 3 - under neither allelic nor genotypic homogeneity. Bold p-values indicate situations in which there is incoherence in the sense described here. Note that, as it was expected due to the proof that was given, none of the samples have incoherence when using analyses provided by e-values.
An important question is why we use FBST methodology rather then standard Bayes factors, the traditional Bayesian procedure to test sharp hypotheses . The reason is that, contrary to e-values, Bayes factors are also not monotonic when dealing with sharp hypotheses as we will show here. In order to calculate Bayes factors, one must first assign a probability distribution for the parameters under each of the hypothesis of interest. In the problem we deal with, this means it is necessary to assign probabilities for θ under Θ, and . The Bayes factor for hypothesis H is then defined to be . For the real dataset presented in  (Table 1), when using uniform probabilities for θ in Θ, and we have a Bayes factor of 6.63 in favor of , while of 0.28 in favor of , so that lack of monotonicity remains. The main reason for this is that it is not necessarily true that . See  for a different example where this happens. An informal explanation of the lack of monotonicity is given by : “What the Bayes factor actually measures is the change in odds in favor of the hypothesis when going from the prior to the posterior”. Note that even though they are not monotonic, Bayes Factors provide a great tool for model selection , a point which we further discuss in the conclusions. One may also argue about the merits of using FBST as a genuine Bayesian procedure rather than traditional Bayes factors. We advocate that while Bayes factors are primarily motivated by the epistemological framework of Decision Theory and p-values are supported by Popperian falsificationism, e-values and FBST are supported by the framework of Cognitive Constructivism. The reader is referred to [43–45] for more epistemological considerations and comparisons of these methods. It is also interesting that FBST can also be justified as a minimization procedure of a loss function, as shown by . This makes e-value also compatible with standard Decision Theory and therefore traditional Bayesian statistics. We emphasize that whenever hypotheses are not sharp, posterior probabilities are usually more adequate.
Although the traditional approach of doubling the sample size to test allelic homogeneity hypothesis was already shown to be incorrect when Hardy-Weinberg equilibrium is not met, many recent articles in biology still use it. As Figure 3 illustrates by using power analysis functions, the nominal level of significance for the allelic usual test is not attained: at zero in the x-axis, the power is larger than 5%, contrary to the alternative ones. We have shown in this paper that a logical inconsistency that happens when using such procedure remains even when using adjusted frequentist tests. The main point of this inconsistency is the fact that if two vectors are equal any function of them must maintain the equality. The fact that even when using an exact approach incoherence remains hints that the problem is the change of dimension when going for global homogeneity to partial homogeneity: genotypic homogeneity is in dimension 2 (two degrees of freedom) and allelic homogeneity is in dimension 1 (one degree of freedom). As Wald Tests, Likelihood Ratio Tests, and Chi-Square tests are asymptotically equivalent, it is also expected that contradictions may happen to all of them.
Similar incoherences of p-values in other situations have already been reported in the literature. As a simple ANOVA-like example, suppose we wish to compare the means of independent random variables from 3 different groups, μ 1,μ 2 and μ 3. If we assume their distribution is normal with variance 1 and the sample means in each group (sufficient statistics) are −0.192,0.015 and 0.017, the likelihood ratio p-value for the hypothesis μ 1 = μ 2 is 0.037. On the other hand, when testing μ 1 = μ 2 = μ 3 we get a p-value of 0.054. Hence, at the level of 5%, the first hypothesis is rejected, but the second one is not. This makes it debatable whether it reasonable to use them as measures of evidence . On the other hand, if we use the improper prior f(μ 1,μ 2,μ 3)∝1, the e-values are 0.232 and 0.121, respectively. Hence the contradiction cannot happen for any cutoff.
As probabilities are monotonic, traditional Bayesian tests based on posterior probability calculations do enjoy monotonicity property, however using them here may be problematic because the hypotheses of interest are sharp. Mixed continuous-discrete distributions are needed in this case. Bayes Factors, on the other hand, were shown to be not monotonic. This does not invalidate its use: in fact, as pointed out by  and , Bayes Factors provide a great tool for model selection. One of the reasons for this is that parsimonious models can have better predictive power than complex models .
The FBST computation always is performed in the full space that has dimension 4. Hence subhypotheses should coherently follow the orientation of the main hypothesis. Moreover, there is no need of specifying special priors for each of the null hypotheses, only for the whole parametric space Θ. It can also be easily implemented. The problem with the FBST is that the values of the significance index, “e”, are related to the dimension and increase as the dimension increases. However, in  it is shown how “e” relates with “p”. This allows one to look for the corresponding e-value for 5% of significance for instance. Another point in favor of the FBST is that its power is almost the same as the best frequentist test. Moreover, it is correct even when HWE does not hold. It is important to remember that e-values are probabilities of subsets of the parameter spaces although p-values are probabilities of sets (tails) of the sample spaces. On the other hand one must understand that hypotheses are statements about points of the parameter space and not of the sample space: May this explain the reason why the e-values, contrary to p-values, are coherent in all situations?
Using the R Software, a routine that performs all the tests considered in this paper can be downloaded on http://www.ime.usp.br/~cpereira/programs/nested.r
The authors are grateful for Luís Gustavo Esteves, Julio Stern, Marcelo Lauretto, Rafael Bassi Stern and Sergio Wechsler for having discussed all the methods of FBST used in this paper. We also thank them for all the patience and painstakingly reading. We thank the anonymous referees for their comments that much improved the quality of the paper. This work was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior; Conselho Nacional de Desenvolvimento Científico e Tecnológico; and Fundação de Amparo à Pesquisa do Estado de São Paulo.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.