Robust tests for matched case-control genetic association studies

  • Yong Zang1 and

    Affiliated with

    • Wing Kam Fung1Email author

      Affiliated with

      BMC Genetics201011:91

      DOI: 10.1186/1471-2156-11-91

      Received: 30 March 2010

      Accepted: 12 October 2010

      Published: 12 October 2010

      Abstract

      Background

      The Cochran-Armitage trend test (CATT) is powerful in detecting association between a susceptible marker and a disease. This test, however, may suffer from a substantial loss of power when the underlying genetic model is unknown and incorrectly specified. Thus, it is useful to derive tests obtaining the plausible power against all common genetic models. For this purpose, the genetic model selection (GMS) and genetic model exclusion (GME) methods were proposed recently. Simulation results showed that GMS and GME can obtain the plausible power against three common genetic models while the overall type I error is well controlled.

      Results

      Although GMS and GME are powerful statistically, they could be seriously affected by known confounding factors such as gender, age and race. Therefore, in this paper, via comparing the difference of Hardy-Weinberg disequilibrium coefficients between the cases and the controls within each sub-population, we propose the stratified genetic model selection (SGMS) and exclusion (SGME) methods which could eliminate the effect of confounding factors by adopting a matching framework. Our goal in this paper is to investigate the robustness of the proposed statistics and compare them with other commonly used efficiency robust tests such as MAX3 and χ 2 with 2 degrees of freedom (df) test in matched case-control association designs through simulation studies.

      Conclusion

      Simulation results showed that if the mean genetic effect of the heterozygous genotype is between those of the two homozygous genotypes, then the proposed tests and MAX3 are preferred. Otherwise, χ 2 with 2 df test may be used. To illustrate the robust procedures, the proposed tests are applied to a real matched pair case-control etiologic study of sarcoidosis.

      Background

      The population-based case-control association study is a powerful approach in detecting the association between a candidate marker and a disease. Compared with the family-based association study which recruits samples from family members, the case-control study is more cost effective because cases and controls are unrelated hence easy to recruit from population. To test the genetic association using the case-control design, the genotypic data for a bi-allelic marker are usually described by a 2 × 3 table where rows represent the disease status and columns represent the genotypic counts. Hence, to test for genetic association is equivalent to test for association between the rows and the columns. Generally, the Pearson's χ 2 with 2 df test can be used to detect such an association. Besides, if a linear trend among the rows can be assumed, a more powerful test which utilizes the score test for a logistic regression can be obtained. This score test is known as the Cochran-Armitage trend test (CATT)[13].

      To apply the CATT, increasing scores are specified a priori for the underlying genetic model. A genetic model refers to the model of inheritance, which defines some relationship of the risks of having the disease given different genotypes. The common genetic models include, but not limit to, recessive (REC), additive (ADD) and dominant (DOM) models. If the underlying genetic model is known, the asymptotically optimal CATT can be used. Otherwise, the CATT is not robust when the scores are misspecified [4]. Unfortunately, the underlying genetic model is usually unknown in practice and an incorrect choice of the genetic model may result in a substantial loss of power for the CATT. Thus, a robust method which does not assume a prior knowledge of the underlying genetic model is often useful.

      Methods robust for a variety of underlying model of inheritance have recently become an important area of research. The Pearson's χ 2 test with 2 df does not assume any structure of a genetic model so it is a robust test against the genetic model. Moreover, the maximin efficiency robust test (MERT) and the MAX method using the maximum of the CATTs optimal for REC, ADD and DOM respectively were extensively studied [510]. Recently, Zheng and Ng [11] proposed the genetic model selection (GMS) method to test for genetic association. Different from other robust tests, the GMS approach is a two-phase analysis which uses the Hardy-Weinberg disequilibrium trend test (HWDTT) [12] to choose the most suitable genetic model in the first phase followed by the CATT optimal for the selected genetic model to detect the association in the second phase. Since the same data were used twice in the analysis, the nominal type I error for the second test needs to be adjusted so that the GMS can obtain a correct size. This GMS has an assumption that the marker allele associated with the disease allele is known. Such assumption can be difficult to justify, for instance, in many complex diseases. Thus, to remove this restriction, Joo et al. [13] proposed to use the CATT optimal for the ADD model to detect the risk allele. After the risk allele is determined, the GMS corresponding to the detected risk allele is then carried out. As a result, using their GMS, people do not need to assume that the risk allele is known. Besides the modified GMS, Joo et al. [13] developed another two-phase test called the genetic model exclusion (GME) which excludes the most unlikely genetic model rather than selecting the most likely one. They also showed that when the genetic relative risks (GRRs) are small, the GME is more efficiency robust than the GMS. Besides the frequentist analysis, a Bayesian hierarchical model which regards the genetic model parameter as a fixed effect has been proposed by Minelli et al. [14]. If expert opinion or external evidence is available, an informative prior distribution of the genetic model parameter could be adopted; otherwise, a vague prior distribution should be used to avoid the undue influence on the posterior distribution.

      Although the population based case-control study is powerful and feasible to implement, spurious association may arise due to known confounding factors such as gender, age and race. Intuitively, the GMS and GME do not work in the presence of confounding factors. One of the reasons is that when the samples are divided into several sub-populations via the confounding factors, the Hardy-Weinberg equilibrium (HWE) assumption needed in the first phase of the GMS and GME does not hold any more. Besides, the CATTs used in the second phase of the GMS and GME do not control the size well due to the confounding factors.

      Typically, when the confounding factors can be observed, they could be treated as the covariates of interest and incorporated in the logistic regression. However, further calculation to adjust for the covariates may complicate the trend test. Alternatively, the matching strategy is frequently used as a much simpler way to control potential confounding factors in epidemiological studies. Specifically, a single case is matched with a certain number of controls based on the confounding factors constructing for each matched set. Then, a conditional logistic regression analysis is normally used to fit the matched data. Recently, an increasing number of matching studies are conducted by either adopting the matched design [1518] or developing statistical procedures [1924] for matched genetic association studies.

      Similar to the unmatched case-control association study, when the underlying genetic model is unknown, the robustness of the statistics for the matched case-control design is also worth studying. Zheng and Tian [21] proposed the MAX3 test based on the matching trend test (MTT) derived from a conditional logistic regression. However, to our knowledge, this is the only paper discussing the robust tests in matched case-control design, compared to the large amount of literature in the unmatched design. Thus, in this paper, we start by developing the stratified genetic model selection (SGMS) and exclusion (SGME) methods for matched case-control association, then we study the robustness of the test statistics. The performance of the robust tests and MTTs is compared by simulation for a wide range of scenarios. Finally, the tests are applied to a real matched pair case-control etiologic study of sarcoidosis.

      Methods

      Genetic model selection and exclusion

      When the genetic model is unknown, the T HWDTT test proposed by Song and Elston [12] can be used to detect the latent genetic model. Zheng and Ng [11] demonstrated that under Hardy-Weinberg equilibrium (HWE) and when the allele investigated is the risk allele (denoted as D), T HWDTT > 0 under the REC model and T HWDTT < 0 under the DOM model. Denote T 0, T 0.5 and T 1 as the CATTs optimal for REC, ADD and DOM respectively, Zheng and Ng [11] proposed to use T 0 if T HWDTT >c; T 1 if T HWDTT < -c and T 0.5 otherwise to test for genetic association, where c is a pre-specified threshold.

      Note that for the original GMS mentioned above, the risk allele is assumed to be known. However, if the risk allele cannot be correctly specified, such GMS may have some problems. Specifically, consider a bi-allelic marker with alleles D and d and assume D is the risk allele. So if D is really the risk allele, T 0, T 0.5 and T 1 are optimal for the REC, ADD and DOM models respectively. On the other hand, if d is the true risk allele, then -T 1, -T 0.5 and -T 0 are optimal for the REC, ADD and DOM models respectively. Joo et al. [13] proposed to use T 0.5 to decide which one is the risk allele followed by the corresponding GMS which depends on the determined risk allele. Joo et al. [13] suggested the modified GMS which can be written as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ1_HTML.gif
      (1)

      where I(.) is an indicator function.

      When the GRRs are small, Joo et al. [13] found that the probability of selecting the true genetic model by using T HWDTT becomes small. On the other hand, the probability of correctly excluding the most unlikely genetic model remains high against the GRRs. Furthermore, when the most unlikely genetic model is excluded, the simple MERT [5] can be carried out to build a robust test against the remaining models. These facts inspired Joo et al. [13] to develop the genetic model exclusion (GME) approach. Specifically, first denote http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq1_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq2_HTML.gif where http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq3_HTML.gif is an estimate of the correlation between http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq4_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq5_HTML.gif under the null hypothesis of no association, then one can obtain the GME statistic from the GMS test by replacing T 0, T 0.5 and T 1 in (1) by http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq6_HTML.gif , http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq7_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq8_HTML.gif respectively. Since the GMS and GME are two stage tests and the same data set is used twice, the critical values of the tests in the second stage need to be adjusted to control the overall type I error rates; see Zheng and Ng [11] and Joo et al. [13].

      Although GMS and GME are efficiency robust tests, they could be seriously affected by confounding factors. In the presence of sub-populations, GMS and GME may not keep the correct size. Therefore, to overcome this limitation, we propose the stratified genetic model selection (SGMS) and exclusion (SGME) approaches in the following.

      Notation

      Consider a bi-allelic marker with alleles d and D and assume D is the risk allele. Denote the three genotypes of this marker as G 0 = dd, G 1 = Dd and G 2 = DD. Suppose that the confounding factors define L strata, denoted by C l , l = 1,..., L. In the lth stratum, r l cases are drawn from the population and m controls are matched to each case. Thus, the total number of controls in the lth stratum is s l = mr l for l = 1,..., L. The genotype counts for (G 0, G 1, G 2) in cases and controls in the lth stratum are denoted by (r 0l , r 1l , r 2l ) and (s 0l , s 1l , s 2l ), respectively. Hence, http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq9_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq10_HTML.gif . The total number of cases is r = ∑ l r l and the total number of controls is s = mr. The total sample size is then n = (m + 1)r.

      In the lth stratum (l = 1,..., L), denote the penetrance by f il = Pr(case|G i , C l ) for i = 0, 1, 2, the disease prevalence by k l = Pr(case|C l ) = ∑ i f il Pr(G i |C l ), and the genotype frequencies in cases and controls by p il = Pr(G i |case,C l ) = f il Pr(G i |C l )/k il and q il = Pr(G i |control, C l ) = (1 - f il )Pr(G i |C l ) = (1 - k il ), respectively. Define GRRs in the lth stratum as λ 1l = f 1l /f 0l and λ 2l = f 2l /f 0l (f 0l > 0). A genetic model is REC, ADD and DOM if λ 1l = 1, λ 1l = (λ 2l + 1)/2 and λ 1l = λ 2l , respectively. We assume that HWE holds in each stratum. Thus, http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq11_HTML.gif , Pr(G 1|C l) = 2p l q l and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq12_HTML.gif where p l is the allele frequency of A in the lth stratum and q l = 1 - p l .

      Stratified genetic model selection and exclusion

      Let X 1lj and X 2ljk denote the genotypic scores for the jth case and the kth control matched with the jth case in the lth stratum, j = 1,..., r l , k = 1,..., m and l = 1,..., L. Each score takes one of the three possible values: 0, x or 1 for the genotypes G 0, G 1 or G 2 respectively, where x is 0, 0.5 or 1 for the REC, ADD or DOM model. Following Day and Byar [25] and Zheng and Tian [21], the likelihood function conditional on the outcomes of cases and matched controls for the candidate marker can be written as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ2_HTML.gif
      (2)
      The null hypothesis of no association H 0 : β = 0 can be tested by the score statistic given by http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq13_HTML.gif . Using the matched case-control data, the closed form of the matching trend test (MTT) can be written as [21]
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ3_HTML.gif
      (3)

      Obviously, http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq14_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq15_HTML.gif . Z MTT(x), follows N(0,1) under the null hypothesis of no association.

      Suppose a family of scientifically plausible models is defined. Similar to the CATTs, corresponding to each model, an asymptotically optimal normally distributed MTT can be obtained. For example, Z MTT(0), Z MTT(0.5) and Z MTT(1) are optimal for the REC, ADD and DOM models respectively. When the genetic model is uncertain, a pre-specified test from this family is not fully efficient, hence, MTTs are not suggested to be directly used when the underlying genetic model is unknown. This underlying genetic model, however, can be ascertained using the Hardy-Weinberg Disequilibrium (HWD) coefficient which is de-noted as Δ = Pr(DD) - [Pr(DD)+Pr(Dd)/2]2. In the unmatched study, denote the HWD coefficients in the case group and the control group as Δ p = Pr(DD|case) - [Pr(DD|case)+Pr(Dd|case)/2]2 and Δ q = Pr(DD|control) - [Pr(DD|control) + Pr(Dd|control)/2]2, Zheng and Ng [11] obtained that Δ p - Δ q > 0 under REC and Δ p - Δ q < 0 under DOM. Using the matched design described above, we denote Δ pl and Δ ql as the HWD coefficients in the case group and the control group of the lth sub-population respectively, l = 1,..., L. Similar to the unmatched counterpart, we have Δ pl - Δ ql > 0 for each l, l = 1,..., L thus http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq16_HTML.gif under REC and Δ pl - Δ ql < 0 for each l, l = 1,..., L thus http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq17_HTML.gif under DOM.

      Denote http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq18_HTML.gif where http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq19_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq20_HTML.gif for i = 0, 1, 2 and l = 1,..., L. Under the null hypothesis of no association and assume HWE in each stratum, using simple algebra we can obtain http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq21_HTML.gif where http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq22_HTML.gif . Thus, using the same motivation as the Cochran-Mantel-Haenszel (CMH) statistic [1, 25, 26] we can construct the stratified model reduction test (SMRT):
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ4_HTML.gif
      (4)
      Notice that the denominator of Z SMRT is estimated under the null hypothesis thus Z SMRT is a score test [27]. We may also use the Wald test or likelihood ratio test. However, if we adopt the Wald test, the statistic becomes much more complex; if we adopt the likelihood ratio test, the statistic cannot be expressed explicitly thus it is hard to derive the correlations between the two stage tests and calculate the p-value of the overall test. For these reasons, the score test is adopted. Under the null hypothesis, Z SMRT asymptotically follows a standard normal distribution N(0, 1). Z SMRT tends to be large if the true genetic model is REC and tends to be small if the true genetic model is DOM. Hence, with a pre-specified threshold c > 0 (set to be Φ-1(0.95)), we can classify the underlying genetic model as REC if Z SMRT >c, DOM if Z SMRT < -c and ADD otherwise. So when the underlying genetic model is decided, Z MTT(x) optimal for the corresponding genetic model can be used to test for association. Notice that in the above discussion, we assume that D is the risk allele. If D is the risk allele as we assume, Z MTT(0) and Z MTT(1) are optimal for the REC and DOM models respectively. On the other hand, if d is the risk allele, then Z MTT(0) and Z MTT(1) are optimal for the DOM and REC models respectively. Besides, the expected values of Z MTT(0) and Z MTT(1) are negative in this case. Similar to Joo et al. [13], we use Z MTT(0.5) to determine the risk allele. That is, if Z MTT(0.5) > 0, Z MTT(0), Z MTT(0.5), Z MTT(1) are optimal for the REC, ADD and DOM models; if Z MTT(0.5) ≤ 0, -Z MTT(1), -Z MTT(0.5), -Z MTT(0) are optimal for the REC, ADD and DOM models. Hence, the stratified genetic model selection (SGMS) test is proposed as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ5_HTML.gif
      (5)

      Under the null hypothesis of no association, we show that (Z MTT(0.5), Z SMRT, Z MTT(x)) asymptotically follows a multivariate normal distribution N(0, ∑ x ) where

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq23_HTML.gif

      x = 0, 1. In addition, ZMTT(0.5) and ZSMRT are asymptotically independent. Detailed proof and the forms of ρ x and ρ x ,0.5 as well as their consistent estimates are derived in the Appendix. Define ø x (z1, z2, z3) as the density function of N(0, x ) and ø(z) as the density function of the standard normal distribution. Let t > 0 be the observed value of ZSGMS and the corresponding p-value is obtained as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ6_HTML.gif
      (6)

      With a pre-specified significance level ζ, we declare a significant association if PV s <ζ.

      Although ZSMRT can be used to determine the underlying genetic model, the probability of selecting the correct genetic model is low when the GRRs are small or moderate. On the other hand, the probability of correctly excluding the most unlikely genetic model remains high when GRRs are very small. That is, when Z SMRT >c, the underlying genetic model is likely to be either REC or ADD rather than just REC and excluding the DOM model is more reasonable than just selecting the REC model. Similarly, when Z SMRT < -c, excluding REC is more reasonable than just selecting the DOM model. Therefore, when the GRRs are low, the strategy of excluding the most unlikely genetic model is more preferred to that of selecting the most suitable genetic model.

      Similar to Joo et al. [13], we define the MERT-type statistic named the matching averaged test (MAT) as http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq24_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq25_HTML.gif . The definition of MATs indicate that Z MAT(0) is optimal for either REC or ADD and Z MAT(1) is optimal for either DOM or ADD. Besides, ZMTT(0.5) is still optimal for just ADD. Utilizing the stratified genetic model exclusion (SGME) strategy, we use Z MAT(0) to test for association if Z SMRT >c thus DOM is excluded; use Z MAT(1) if Z SMRT < -c thus REC is excluded and Z MTT(0.5) otherwise. In addition, similar to SGMS, Z MTT(0.5) is used at the beginning to determine the risk allele. Hence, the statistic for the SGME approach can be written as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ7_HTML.gif
      (7)
      Under the null hypothesis of no association, we obtain that (Z MTT(0.5), Z SMRT, Z MAT(x)) asymptotically follows a multivariate normal distribution http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq26_HTML.gif where
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equa_HTML.gif
      x= 0, 1. Define http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq27_HTML.gif as the density function of http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq26_HTML.gif , similar to the test of ZSGMS, the p-value of ZSGME can be derived as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ8_HTML.gif
      (8)

      We declare a significant association if PV e <ζ where ζ is the pre-specified significance level.

      Other robust procedures

      In equation (2), we use one indicator to code three genotypes. One the other hand, if we define two dummy variables ((X 1lj1, X 1lj2) for the cases and (X 2ljk1, X 2ljk2) for the controls) taking values (0,0), (0,1) and (1,1) to code three genotypes G 0, G 1 and G 2, the conditional likelihood function becomes [10]
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equ9_HTML.gif
      (9)

      The score test derived from equation (9), denoted by http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df, has an asymptotic χ 2 distribution with 2 df under H 0 : β 1 = β 2 = 0. Note that http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df does not rely on any information of the underlying genetic model so it is a robust test against model of inheritance.

      Another robust test is the MAX3 which was also proposed as an efficiency robust test for unmatched genetic association studies [7, 28]. Analogy to the unmatched counterpart, Zheng and Tian [21] proposed the MAX3 statistic for matched case-control association study which is defined as
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equb_HTML.gif

      Compared with the optimal MTTs and χ 2 test with 2 df, MAX3 has the largest minimum power across the three genetic models [21, 29]. As mentioned in Zheng and Tian [21] and Joo et al. [10], the null distribution of Z MAX3 can be approximated by Monte-Carlo simulation. In addition, the p-value of Z MAX3 can also be obtained according to the asymptotic formula given by Zang et al. [29].

      Results

      Simulation

      To check whether GMS and GME can keep the correct size in the presence of confounding factors, we carried out simulation studies to examine the performance of GMS and GME in the presence of sub-populations. The nominal level was set at 0.05. We assumed that due to confounding factors, each of the case and control populations was divided into two sub-populations with equal probability. The simulation results are summarized in Table 1. We find that when there are no confounding factors (Scenario 1 ), GMS and GME can control the size well. On the other hand, in the presence of confounding factors and adopt the matched design to each of the sub-populations, GMS and GME are found to be conservative (Scenarios 2, 3 and 4 ). Furthermore, without the matched design, the type I error rates of GMS and GME are seriously inflated (Scenarios 5 to 8 ). The simulation results show that in the presence of sub-populations, GMS and GME cannot keep the correct size whether or not the matched design is utilized.
      Table 1

      Type I error rates of GMS and GME based on 10,000 replicates without confounding (Scenario 1) and in the presence of confounding factors (Scenarios 2-8), with the significance level 0.05 using r l cases and s l controls; p l is the risk allele frequency and k l is the prevalence, l = 1,2.

      Scenario

      r 1

      r 2

      s 1

      s 2

      p 1

      p 2

      k 1

      k 2

      GMS

      GME

      1

      250

      250

      250

      250

      0.3

      0.3

      0.05

      0.05

      0.0510

      0.0502

      2

      250

      250

      250

      250

      0.05

      0.5

      0.01

      0.1

      0.0199

      0.0141

      3

      250

      250

      250

      250

      0.1

      0.5

      0.01

      0.1

      0.0190

      0.0167

      4

      250

      250

      250

      250

      0.2

      0.4

      0.03

      0.07

      0.0391

      0.0384

      5

      300

      200

      200

      300

      0.2

      0.4

      0.03

      0.07

      0.3923

      0.4403

      6

      325

      175

      175

      325

      0.2

      0.4

      0.03

      0.07

      0.7337

      0.7880

      7

      350

      150

      150

      350

      0.2

      0.4

      0.03

      0.07

      0.9077

      0.9567

      8

      375

      125

      125

      375

      0.2

      0.4

      0.03

      0.07

      0.9625

      0.9954

      To check if the ability of Z SMRT to select the correct genetic model is low when GRRs are small, we conducted a simulation to compare the selection procedure with the exclusion procedure. Considered 300 cases with 600 matched controls, the samples were divided into 3 sub-populations with proportions being 0.3, 0.3 and 0.4 respectively. Set the MAFs and the penetrance in the three strata as (p 1, p 2, p 3) = (0.1, 0.3, 0.5) and (f 01, f 02, f 03) = (0.01, 0.05, 0.02). The threshold c was fixed as Φ-1(0.95) and let GRR2 = λ 2l increase from 1.1 to 2.0 with increments of 0.1, l = 1,..., L.

      The results are summarized in Figure 1, with circles representing the probabilities of selecting the correct genetic models and triangles representing the probabilities of correctly excluding the most unlikely genetic models. From Figure 1 we can find that under REC and DOM the triangles are always higher than 90%, whereas the circles can be less than 20% when the GRR2 is small. However, when ADD is the true genetic model, the circles coincide with the triangles. This is because under ADD, both REC and DOM are the most unlikely models thus the selection procedure is just the same as the exclusion procedure.
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Fig1_HTML.jpg
      Figure 1

      The probabilities of correctly selecting the genetic models and of correctly excluding the most unlikely genetic models based on 10,000 replicates.

      Next, we performed simulations with no disease association and under various genetic models to evaluate the performance of the proposed robust methods. Moreover, we also considered the MTTs optimal for the REC, ADD and DOM models, i.e. Z MTT(0), Z MTT(0.5) and Z MTT(1) respectively. Let R, S, F i , K and P denoted the vectors of r l , s l , f il , k l and p l respectively across sub-populations, l = 1,..., L, i = 0, 1, 2. Each sub-population was in HWE. We first examined the type I error rates of the mentioned tests under the null hypothesis of no association with nominal levels taken as 0.05 and 0.01 respectively. The results are summarized in Table 2.
      Table 2

      Type I error rates of Z MTT(0), Z MTT(0.5), Z MTT(1), Z SGMS, Z SGME, Z MAX3 and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df based on 10,000 replicates in the presence of confounding factors with the significance level α using R cases and S controls.

      Scenario

      α

      ZMTT(0)

      ZMTT(0.5)

      ZMTT(1)

      Z SGMS

      Z SGME

      Z MAX3

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif

      A

      0.05

      0.0527

      0.0498

      0.0518

      0.0501

      0.0490

      0.0527

      0.0531

      B

       

      0.0487

      0.0503

      0.0493

      0.0494

      0.0502

      0.0515

      0.0481

      C

       

      0.0510

      0.0512

      0.0509

      0.0506

      0.0510

      0.0513

      0.0490

      D

       

      0.0526

      0.0524

      0.0537

      0.0529

      0.0528

      0.0516

      0.0484

      E

       

      0.0519

      0.0512

      0.0534

      0.0536

      0.0526

      0.0501

      0.0507

      F

       

      0.0485

      0.0486

      0.0479

      0.0488

      0.0467

      0.0501

      0.0481

      G

       

      0.0493

      0.0497

      0.0490

      0.0492

      0.0498

      0.0457

      0.0491

      H

       

      0.0522

      0.0493

      0.0480

      0.0522

      0.0521

      0.0525

      0.0522

      A

      0.01

      0.0092

      0.0081

      0.0100

      0.0083

      0.0081

      0.0075

      0.0084

      B

       

      0.0096

      0.0091

      0.0106

      0.0106

      0.0103

      0.0096

      0.0101

      C

       

      0.0093

      0.0101

      0.0109

      0.0108

      0.0104

      0.0101

      0.0109

      D

       

      0.0121

      0.0101

      0.0098

      0.0093

      0.0093

      0.0095

      0.0105

      E

       

      0.0098

      0.0094

      0.0101

      0.0093

      0.0092

      0.0083

      0.0114

      F

       

      0.0109

      0.0095

      0.0106

      0.0109

      0.0102

      0.0102

      0.0109

      G

       

      0.0100

      0.0094

      0.0105

      0.0114

      0.0103

      0.0111

      0.0093

      H

       

      0.0087

      0.0111

      0.0104

      0.0099

      0.0103

      0.0117

      0.0107

      A : R = (150, 150, 200), S = (300, 300, 400), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      B : R = (100, 300, 100), S = (200, 600, 200), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      C : R = (300, 300, 400), S = (300, 300, 400), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      D : R = (200, 600, 200), S = (200, 600, 200), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      E : R = (250, 250), S = (500, 500), P = (0.2, 0.4), K = (0.01, 0.02)

      F : R = (150, 350), S = (300, 700), P = (0.2, 0.4), K = (0.01, 0.02)

      G : R = (500, 500), S = (500, 500), P = (0.2, 0.4), K = (0.01, 0.02)

      H : R = (300, 700), S = (300, 700), P = (0.2, 0.4), K = (0.01, 0.02)

      We considered eight separate scenarios (A to H) with different numbers of cases, controls, risk allele frequencies and disease prevalences. For example, in scenario A, 150, 150 and 200 cases from 3 different sub-populations comprised the whole case group and each case was matched with 2 controls within the same sub-population. The risk allele frequencies of the 3 sub-populations were 0.1, 0.3 and 0.5 respectively and the disease prevalences equalled to 0.01, 0.05 and 0.02. Table 2 shows that the type I error rates of all the mentioned tests are close to the nominal levels and so the robust tests and MTTs can control the sizes well. Besides, although we assume that HWE holds in each sub-population, a moderate departure from HWE has little impact to the sizes of SGMS and SGME (results skipped for brevity).

      We also conducted simulation to investigate the performance of the proposed tests for small sized samples, where the number of cases is at most 100. The results are summarized in Table 3. The settings were the same as those in Table 2 except that the sample sizes in Table 3 were only 10% of those in Table 2. The results show that the proposed tests can keep the size reasonably well even for small sample case.
      Table 3

      Type I error rates of Z MTT(0), Z MTT(0.5), Z MTT(1), Z SGMS, Z SGME, Z MAX3 and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df for small sample size.

      Scenario

      α

      ZMTT(0)

      ZMTT(0.5)

      ZMTT(1)

      Z SGMS

      Z SGME

      Z MAX3

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif

      A*

      0.05

      0.0474

      0.0501

      0.0487

      0.0493

      0.0495

      0.0452

      0.0502

      B*

       

      0.0531

      0.0479

      0.0497

      0.0480

      0.0485

      0.0462

      0.0503

      C*

       

      0.0569

      0.0526

      0.0489

      0.0507

      0.0526

      0.0516

      0.0488

      D*

       

      0.0470

      0.0480

      0.0516

      0.0482

      0.0484

      0.0488

      0.0503

      E*

       

      0.0492

      0.0498

      0.0489

      0.0518

      0.0511

      0.0535

      0.0498

      F*

       

      0.0519

      0.0503

      0.0514

      0.0535

      0.0537

      0.0486

      0.0489

      G*

       

      0.0484

      0.0505

      0.0526

      0.0483

      0.0466

      0.0551

      0.0502

      H*

       

      0.0504

      0.0453

      0.0451

      0.0451

      0.0456

      0.0484

      0.0504

      A*

      0.01

      0.0076

      0.0083

      0.0092

      0.0102

      0.0089

      0.0108

      0.0126

      B*

       

      0.0075

      0.0091

      0.0097

      0.0078

      0.0088

      0.0086

      0.0081

      C*

       

      0.0078

      0.0089

      0.0096

      0.0082

      0.0084

      0.0126

      0.0092

      D*

       

      0.0080

      0.0095

      0.0116

      0.0093

      0.0092

      0.0111

      0.0099

      E*

       

      0.0072

      0.0093

      0.0098

      0.0099

      0.0092

      0.0091

      0.0120

      F*

       

      0.0087

      0.0081

      0.0089

      0.0085

      0.0081

      0.0091

      0.0116

      G*

       

      0.0077

      0.0120

      0.0120

      0.0113

      0.0125

      0.0081

      0.0102

      H*

       

      0.0079

      0.0087

      0.0088

      0.0073

      0.0081

      0.0098

      0.0095

      The results are simulated based on 10,000 replicates in the presence of confounding factors with the significance level α using R cases and S controls.

      A*: R = (15, 15, 20), S = (30, 30, 40), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      B*: R = (10, 30, 10), S = (20, 60, 20), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      C*: R = (30, 30, 40), S = (30, 30, 40), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      D*: R = (20, 60, 20), S = (20, 60, 20), P = (0.1, 0.3, 0.5), K = (0.01, 0.05, 0.02)

      E*: R = (25, 25), S = (50, 50), P = (0.2, 0.4), K = (0.01, 0.02)

      F*: R = (15, 35), S = (30, 70), P = (0.2, 0.4), K = (0.01, 0.02)

      G*: R = (50, 50), S = (50, 50), P = (0.2, 0.4), K = (0.01, 0.02)

      H*: R = (30, 70), S = (30, 70), P = (0.2, 0.4), K = (0.01, 0.02)

      The powers of the MTTs and robust tests were compared under three genetic models (REC, ADD and DOM). The settings were the same as those in Table 2 except that the nominal level was set to be 0.05 and the GRR was determined so that the optimal MTT has the maximum power of about 80%. The results are summarized in Table 4. In each row, the power of the robust test which performs best among the four robust tests considered in Table 4 is bold-faced.
      Table 4

      Empirical powers of Z MTT(0), Z MTT(0.5), Z MTT(1), Z SGMS, Z SGME, Z MAX3 and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df based on 10,000 replicates.

      Scenario

      Model

      ZMTT(0)

      ZMTT(0.5)

      ZMTT(1)

      Z SGMS

      Z SGME

      Z MAX3

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif

      ρ*

      A

      REC

      0.8059

      0.5590

      0.1369

      0.6981

      0.6760

      0.7340

      0.7154

      0.3267

       

      ADD

      0.4890

      0.7998

      0.7126

      0.7594

      0.7896

      0.7629

      0.7188

      0.7623

       

      DOM

      0.1237

      0.6818

      0.8040

      0.7142

      0.7155

      0.7300

      0.7158

      0.2639

      B

      REC

      0.8073

      0.5367

      0.1356

      0.6725

      0.6497

      0.7229

      0.7147

      0.3423

       

      ADD

      0.4637

      0.7977

      0.7258

      0.7646

      0.7908

      0.7502

      0.7140

      0.7445

       

      DOM

      0.1287

      0.7011

      0.8054

      0.7168

      0.7259

      0.7383

      0.7199

      0.2756

      C

      REC

      0.8057

      0.5503

      0.1330

      0.6970

      0.6691

      0.7244

      0.7153

      0.3038

       

      ADD

      0.4896

      0.8052

      0.7139

      0.7654

      0.7952

      0.7648

      0.7094

      0.7295

       

      DOM

      0.1193

      0.6877

      0.8062

      0.7153

      0.7210

      0.7445

      0.7112

      0.2710

      D

      REC

      0.7978

      0.5235

      0.1400

      0.6655

      0.6433

      0.7177

      0.7144

      0.3124

       

      ADD

      0.4639

      0.8045

      0.7308

      0.7654

      0.7934

      0.7499

      0.7090

      0.7037

       

      DOM

      0.1204

      0.7024

      0.8071

      0.7144

      0.7225

      0.7250

      0.7033

      0.2792

      E

      REC

      0.7974

      0.5276

      0.1453

      0.7166

      0.6782

      0.7288

      0.7218

      0.3492

       

      ADD

      0.4667

      0.8014

      0.7294

      0.7554

      0.7884

      0.7517

      0.7131

      0.7396

       

      DOM

      0.1261

      0.6989

      0.8027

      0.7135

      0.7234

      0.7161

      0.7077

      0.2795

      F

      REC

      0.8056

      0.5547

      0.1535

      0.6991

      0.6742

      0.7317

      0.7173

      0.3561

       

      ADD

      0.5014

      0.8068

      0.7195

      0.7650

      0.7955

      0.7524

      0.7112

      0.7661

       

      DOM

      0.1389

      0.6933

      0.8003

      0.7167

      0.7231

      0.7291

      0.7141

      0.2887

      G

      REC

      0.8045

      0.5241

      0.1499

      0.7233

      0.6786

      0.7316

      0.7082

      0.3172

       

      ADD

      0.4562

      0.8020

      0.7313

      0.7522

      0.7883

      0.7560

      0.7068

      0.6970

       

      DOM

      0.1247

      0.7091

      0.8004

      0.7167

      0.7297

      0.7468

      0.7099

      0.2825

      H

      REC

      0.7967

      0.5481

      0.1467

      0.6950

      0.6651

      0.7203

      0.7136

      0.3279

       

      ADD

      0.4885

      0.8017

      0.7154

      0.7610

      0.7911

      0.7529

      0.7009

      0.7272

       

      DOM

      0.1254

      0.6944

      0.8055

      0.7183

      0.7248

      0.7366

      0.7091

      0.2945

      The settings are the same as those in Table 2 except that the GRRs are determined so that the optimal MTT has the maximum power of about 80%. The significance level is 0.05. ρ* is the minimum correlation of the optimal tests.

      From Table 4 we notice that although the MTTs can obtain the highest power if the genetic models are correctly specified, the minimum powers of Z MTT(0) and Z MTT(1) are below 20% and the minimum powers of Z MTT(0.5) are between 50% to 60%. On the other hand, the minimum powers of the robust tests are about 65% across all genetic models. Table 4 clearly shows the advantage of the robust tests that, when the genetic model is unknown, the robust tests are more preferred than the MTTs. Besides, from Table 4 we can conclude that if only the REC, ADD and DOM models are considered, Z SGME and Z MAX3 perform better than the other two robust tests and Z MAX3 always dominate http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df under such situations. Table 4 also reports ρ*, which is defined as the minimum correlation of the optimal tests [5]. For example, when REC is the true model, then ρ* = min(corr(Z MTT(0), Z MTT(0.5)), corr(Z MTT(0), Z MTT(1))). ρ* is considered here as a guideline for choosing efficiency robust tests between Z SGME and Z MAX3. From Table 4 we find that when ρ* is small (around 0.3), Z MAX3 performs better or at least as powerful as Z SGME. However, when ρ* is large (around 0.7), ZSGME is a better choice. Notice that this finding is similar to the property of the efficiency robust procedures in survival data analysis studied by Freidlin et al. [30] who also suggest to use the MAX-type statistic if ρ* is less than 0.6 or 0.7.

      We further compared Z SGMS, Z SGME, Z MAX3 and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df under different genetic models. The parameter settings were the same as those of scenario A in Table 2 except that λ 2 increased from 1.1 to 2.0 with increments of 0.1 and λ 1l = 1 + x(λ 2l - 1). The results are summarized in Figures 2, 3, 4 and 5 with titles a, b, c, d, e, f, g representing x = 0.25, 0, 0.25, 0.5, 0.75, 1, 1.25 respectively.
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Fig2_HTML.jpg
      Figure 2

      Empirical powers ofZ SGMS ,Z SGME ,Z MAX3 andwith 2 df under genetic model a. The significance level is 0.05.

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Fig3_HTML.jpg
      Figure 3

      Empirical powers ofZ SGMS ,Z SGME ,Z MAX3 andwith 2 df under genetic models b and c. The significance level is 0.05.

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Fig4_HTML.jpg
      Figure 4

      Empirical powers ofZ SGMS ,Z SGME ,Z MAX3 andwith 2 df under genetic models d and e. The significance level is 0.05.

      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Fig5_HTML.jpg
      Figure 5

      Empirical powers ofZ SGMS ,Z SGME ,Z MAX3 andwith 2 df under genetic models f and g. The significance level is 0.05.

      Notice that x = 0, 0.5, 1 (figures b, d, f) correspond to the REC, ADD and DOM models respectively. Under these three commonly used genetic models, Z SGMS, Z SGME and Z MAX3 have comparable powers although Z MAX3 may be slightly more powerful than the other two tests under the REC and DOM models, and Z SGME may dominate Z MAX3 and Z SGMS under the ADD model. http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df has the least power among all the tests considered here. x = 0.25 (figure c) indicates a genetic model between REC and ADD and x = 0.75 (figure e) corresponds to a genetic model between ADD and DOM. The performance of the robust tests under such two genetic models is similar to that under ADD. Z SGME is slightly more powerful than Z SGMS and Z MAX3, and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df still obtains the least power.

      x = - 0.25 and 1.25 indicate two less plausible models, the under-recessive model (figure a) and over-dominant model (figure g). Under the under-recessive model where f1l <f0l , http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df is the most powerful test followed by ZSGMS and ZMAX3. ZSGME performs the worst in such a situation. Under the over-dominant model where f1l >f2l , all the robust tests perform very similarly.

      To summarize, if the mean genetic effect of the heterozygous genotype is between those of the two homozygous genotypes, then we suggest Z MAX3, Z SGMS and Z SGME. On the other hand, if the genetic effects are not ranked in accordance with the genotypes, then http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df is preferred. This is reasonable because http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df does not take the order of the genetic effects into consideration so it should perform well if the genetic effects are not ranked in accordance with the genotypes.

      Notice that in our simulation we consider the common disease common variant (CDCV) which is currently the most popular theory underlying complex disease etiology. However, if the common disease rare variant (CDRV) assumption holds which implies that the disease etiology is caused collectively by multiple rare variants with moderate to high penetrances, the proposed tests perform conservatively and underpowered for detecting association [31]. In this case, the combined multivariate and collapsing (CMC) method proposed by Li and Leal [31] may be used to increase the power of the proposed tests.

      An application

      We applied MTTs and the robust tests to a matched pair case-control etiologic study of sarcoidosis (ACCESS) [15]. In this study, a total of 497 matched pairs of case-control sets samples based on their age (within 5 years), race (Caucasian and African-American) and gender were recruited to test for association between immunoglobulin gene polymorphism and sarcoidosis. A subset containing 219 African-American matched pairs was used by Zheng and Tian [21]. We consider the KM(1,3) polymorphism as the candidate marker. After estimating the risk allele frequencies in controls of the matched sets defined by the two confounding factors (gender and race), we find three sub-populations namely Caucasian, Female African-American and Male African-American. The details of the matched data and sub-structure information are summarized in Table 5.
      Table 5

      The pair-matched case-control study of ACCESS.

        

      Controls

       

      Caucasian

      '11'

      '13'

      '33'

      Total

      Cases

      '11'

      0

      0

      1

      1

       

      '13'

      0

      9

      36

      45

       

      '33'

      2

      29

      201

      232

       

      Total

      2

      38

      238

      278

        

      Controls

       

      Female/African-American

      '11'

      '13'

      '33'

      Total

      Cases

      '11'

      1

      11

      8

      20

       

      '13'

      8

      26

      40

      74

       

      '33'

      4

      34

      24

      62

       

      Total

      13

      71

      72

      156

        

      Controls

       

      Male/African-American

      '11'

      '13'

      '33'

      Total

      Cases

      '11'

      1

      2

      5

      8

       

      '13'

      1

      14

      17

      32

       

      '33'

      1

      11

      11

      23

       

      Total

      3

      27

      33

      63

        

      Controls

       

      Combined

       

      '11'

      '13'

      '33'

      Total

      Cases

      '11'

      2

      13

      14

      29

       

      '13'

      9

      49

      93

      151

       

      '33'

      7

      74

      236

      317

       

      Total

      18

      136

      343

      497

      First we applied the MTTs optimal for the REC, ADD and DOM models to the data set and obtained the p-values being 0.058, 0.025 and 0.093 for Z MTT(0), Z MTT(0.5) and Z MTT(1) respectively. Thus, whether or not there is a significant association is unclear under a nominal level 0.05 because different genetic models give different answers.

      Then we applied http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq28_HTML.gif with 2 df and Z MAX3 to the data set and obtained the p-values as 0.076 and 0.056, which were also hard to provide a more conclusive finding under a significance level of 0.05. Note that the p-value of Z MAX3 was calculated according to the asymptotic formula obtained by Zang et al. [29]. Thereafter we applied ZSGMS and ZSGME to the same data. We obtained Z SMRT = 0.124, which falls in the interval [-1.645,1.645] and strongly suggested an ADD model. Thus, for SGMS we select ADD and for SGME we exclude REC and DOM. Using formulas (6) and (8) we obtained the p-values as 0.0398 for Z SGMS and 0.0310 for Z SGME, both suggesting a marginally significant association. According to our simulation, Z SGME is the most powerful robust test under the ADD model. We also obtained the minimum correlation of the optimal tests ρ* = 0.603, which indicates that Z SGME is a better choice than Z MAX3 according to our previous discussion for Table 4. Obviously, our results are consistent with the findings in that discussion. To sum up, we observe that there is some association between the candidate marker and sarcoidosis.

      Discussion

      In this paper, we extended the GMS [11] and GME [13] methods to the matched case-control association study and proposed the SGMS and SGME methods so that they can be used when there are confounding factors in the recruited samples. We showed that the p-values of both tests can be determined analytically based on the asymptotic tri-variate normal distributions. Besides, we also reviewed some other robust tests in matched case-control association study such as the MAX3 test and the χ 2 with 2 df test. Simulations were carried out to examine the robustness of all these tests. The tests were also used to analyze a real pair matched data set of sarcoidosis. Simulation results indicate that when the genetic model is unknown, a mis-specification of the genetic model may result in a substantial loss of power for the MTTs. In this situation, robust tests are preferred. Further comparisons among the robust tests were also conducted. According to our simulation, when the genetic effects are ordered in accordance with their genotypes, MAX3, SGMS and SGME are preferred. On the other hand, if the less plausible genetic models such as the over-dominant and under-recessive models cannot be excluded, then χ 2 with 2 df test is a good choice.

      We adopted the matching framework in the stage of recruiting samples so our study is a pre-matched case-control association study. In practice, even in the unmatched case-control design matching is still an important tool to eliminate the effect of latent confounding factors such as the population stratification and cryptical relatedness. For example, Guan et al. [24] recently proposed a matched design in an unmatched case-control study. They post-matched individuals by their genotypes followed by a conditional matching analysis to correct for population stratification in genome-wide association studies. In fact, after applying their method or the principal components method [32] and its extension [33] to classify the latent population structure, all the robust tests discussed in this paper can be used as robust approaches as well as correcting the latent population stratification in the unmatched case-control or genome-wide association studies. The regression approach is also suggested in the literature to adjust for confounding factors other than markers. However, if the whole population has many subpopulation due to confounding, the performance of the regression method could be affected because too many nuisance parameters need to be estimated. Furthermore, how to derive the variance-covariance matrices of the distribution of the robust tests in this case is still uncertain. Further research in this area is needed.

      Conclusion

      Simulation results and real data analysis show that SGMS and SGME can keep a correct Type I error rate for stratified data while have good efficiency robustness against genetic model uncertainty. Besides, the proposed formulas in this paper can easily be used to calculate the corresponding p-values. Thus, SGMS and SGME are useful for genetic data analysis of matched case-control design.

      Appendix

      First we derive the correlation ρ x between Z SMRT and Z MTT(x). Define U(x) as the numerator of Z MTT, under the null hypothesis,
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equc_HTML.gif
      Following Zheng and Ng [11], we can obtain that
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equd_HTML.gif
      When n → ∞, http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq29_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq30_HTML.gif . Hence, we have
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Eque_HTML.gif

      Substitute http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq31_HTML.gif for p l , we obtain the estimate http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq32_HTML.gif for ρ x (x = 0, 0.5, 1).

      Next we report the correlation ρ x ,0.5 between Z MTT(x) and Z MTT(0.5) (x = 0, 1). Under the null hypothesis,
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equf_HTML.gif
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equg_HTML.gif
      Since http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq33_HTML.gif , after simple algebra, we have,
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_Equh_HTML.gif

      Substitute http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq34_HTML.gif for p il (i = 0,1,2), we obtain the estimate http://static-content.springer.com/image/art%3A10.1186%2F1471-2156-11-91/MediaObjects/12863_2010_Article_838_IEq35_HTML.gif for ρ x, 0.5 (x = 0,1).

      Declarations

      Acknowledgements

      The research of Y. Zang was partially supported by the China Natural Science Foundation grant 10701067 and the research of W. K. Fung was partially supported by the HKU Research Output Prize Funding.

      Authors’ Affiliations

      (1)
      Department of Statistics and Actuarial Science, The University of Hong Kong

      References

      1. Cochran WG: Some methods for strengthening the common chi-square test. Biometrics 1954, 10:417–451.View Article
      2. Armitage P: Test for linear trends in proportions and frequencies. Biometrics 1955, 11:375–386.View Article
      3. Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics 1997, 53:1253–1261.View ArticlePubMed
      4. Zheng G, Freidlin B, Li Z, Gastwirth JL: Choice of scores in trend tests for case-control studies of candidate-gene associations. Biometrical Journal 2003, 45:335–348.View Article
      5. Gastwirth JL: On robust procedures. Journal of the American Statistical Association 1966, 61:929–948.View Article
      6. Gastwirth JL: The use of maximin efficiency robust tests in combining contingency tables and survival analysis. Journal of the American Statistical Association 1985, 80:380–384.View Article
      7. Freidlin B, Zheng G, Li Z, Gastwirth JL: Trend tests for case-control studies of genetic markers: power, sample size and robustness. Human Heredity 2002, 53:146–152.View ArticlePubMed
      8. Gonzalez JR, Carrasco JL, Dubridge F, Armengol L, Estivill X, Moreno V: Maximizing association statistics over genetic models. Genetic Epidemiology 2008, 32:246–254.View ArticlePubMed
      9. Li Q, Zheng G, Li Z, Yu K: Efficient approximation of p-value of the maximum of correlated tests, with applications to genome-wide association studies. Annals of Human Genetics 2008, 27:397–406.View Article
      10. Joo J, Kwak M, Chen Z, Zheng G: Tutorial in biostatistics: Efficiency robust statistics for genetic linkage and association studies under genetic model uncertainty. Statistics in Medicine 2010, 29:158–180.PubMed
      11. Zheng G, Ng HKT: Genetic model selection in two-phase analysis for case-control association studies. Biostatistics 2008, 9:391–399.View ArticlePubMed
      12. Song K, Elston RC: A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Statistics in Medicine 2006, 25:105–126.View ArticlePubMed
      13. Joo J, Kwak M, Zheng G: Improving power for testing genetic association in case-control studies by reducing alternative space. Biometrics 2010, 66:266–276.View ArticlePubMed
      14. Minelli C, Thompson JR, Abrams KR, Lambert PC: Bayesian implementation of a genetic model-free approach to the meta-analysis of genetic association studies. Statistics in Medicine 2005, 24:3845–3861.View ArticlePubMed
      15. ACCESS Research Group: Design of a case control etiologic study of sarcoidosis (ACCESS). Journal of Clinical Epidemiology 1999, 52:1173–1186.View Article
      16. Manusirivithaya S, Siriaunkgul S, Khunamornpong S, Sripramote M, Sampatanukul P, Tangjitgamol S, Srisomboon J: Association between Bcl-2 expression and tumor recurrence in cervical cancer: A matched case-control study. Gynecologic Oncology 2006, 102:263–269.View ArticlePubMed
      17. Suzuki H, Li YN, Dong XQ, Hassan MM, Abbruzzese JL, Li DH: Effect of insulin-like growth factor gene polymorphisms alone or in interaction with diabetes on the risk of pancreatic cancer. Cancer Epidemiology Biomarker and Prevention 2008, 17:3467–3473.View Article
      18. Yin JY, Vogel U, Ma Y, Qi R, Wang HW: Association of DNA repair gene XRCC1 and lung cancer susceptibility among nonsmoking Chinese women. Cancer Genetics and Cytogenetics 2009, 188:26–31.View ArticlePubMed
      19. Lee WC: Case-control association studies with matching and genomic controlling. Genetic Epidemiology 2004, 27:1–13.View ArticlePubMed
      20. Kraft P, Cox DG, Paynter RA, hunter D, De Vivio I: Accounting for haplotype uncertainty in matched association studies: A comparison of simple and flexible techniques. Genetic Epidemiology 2005, 28:261–272.View ArticlePubMed
      21. Zheng G, Tian X: Robust trend tests for genetic association using matched case-control design. Statistics in Medicine 2006, 25:3160–3173.View ArticlePubMed
      22. Zhang H, Zhang H, Li Z, Zheng G: Statistical methods for haplotype-based matched case-control association studies. Genetic Epidemiology 2007, 31:316–326.View ArticlePubMed
      23. Chen J, Rodriguez C: Conditional likelihood methods for haplotype-based association analysis using matched case-control data. Biometrics 2007, 63:1099–1107.View ArticlePubMed
      24. Guan W, Liang L, Boehnke M, Abecasis GR: Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genetic Epidemiology 2009, 33:508–517.View ArticlePubMed
      25. Day NE, Byar DP: Testing hypotheses in case-control studies-equivalence of Mantel-Haenszel statistics and logit score tests. Biometrics 1979, 35:623–630.View ArticlePubMed
      26. Mantel N, Haenszel W: Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 1959, 22:719–748.PubMed
      27. Agresti A: Categorical data analysis. second edition. John Wiley & Sons, Inc; 2002.View Article
      28. Zheng G, Chen Z: Comparison of maximum statistics for hypothesis testing when nuisance parameter is present only under the alternative. Biometrics 2005, 61:254–258.View ArticlePubMed
      29. Zang Y, Fung WK, Zheng G: Asymptotic powers for matched trend tests and robust matched trend tests in case-control genetic association studies. Computational Statistics and Data Analysis 2010, 54:65–77.View Article
      30. Freidlin B, Podgor MJ, Gastwirth JL: Efficiency robust tests for survival or order categorical data. Biometrics 1999, 55:883–886.View ArticlePubMed
      31. Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. American Journal of human genetics 2008, 83:311–321.View ArticlePubMed
      32. Price AL, Patterson NJ, Plenge RM, Reich D: Principle components analysis corrects for stratification in genome-wide association studies. Nature Genetics 2006, 33:904–909.View Article
      33. Li Q, Wacholder S, Hunter DJ, Hoover RC, Chanock S, Thomas G, Yu K: Genetic background comparison using distance-based regression with application in population stratification evaluation and adjustment. Genetic Epidemiology 2009, 33:432–441.View ArticlePubMed

      Copyright

      © Zang and Fung. 2010

      This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

      Advertisement