Assessing the joint effect of population stratification and sample selection in studies of gene-gene (environment) interactions

Background It is well known that the presence of population stratification (PS) may cause the usual test in case-control studies to produce spurious gene-disease associations. However, the impact of the PS and sample selection (SS) is less known. In this paper, we provide a systematic study of the joint effect of PS and SS under a more general risk model containing genetic and environmental factors. We provide simulation results to show the magnitude of the bias and its impact on type I error rate of the usual chi-square test under a wide range of PS level and selection bias. Results The biases to the estimation of main and interaction effect are quantified and then their bounds derived. The estimated bounds can be used to compute conservative p-values for the association test. If the conservative p-value is smaller than the significance level, we can safely claim that the association test is significant regardless of the presence of PS or not, or if there is any selection bias. We also identify conditions for the null bias. The bias depends on the allele frequencies, exposure rates, gene-environment odds ratios and disease risks across subpopulations and the sampling of the cases and controls. Conclusion Our results show that the bias cannot be ignored even the case and control data were matched in ethnicity. A real example is given to illustrate application of the conservative p-value. These results are useful to the genetic association studies of main and interaction effects.


Background
In the search of causative agents of human disease, both environmental and genetic risk factors have been identified. Overwhelming evidence indicates that there are reasons to believe that relative common polymorphisms in a wide spectrum of genes may modify the effect of environmental agents [1,2]. Several studies also have demonstrated the presence of gene-gene interaction in complex human diseases [3][4][5][6][7]. Gene-gene interaction, or epistasis, is also considered as a basic genetic concept which has been widely used by biologists for a long time [8].
Many association designs have been proposed for studying gene-environment or gene-gene interactions. Recently, Wang and Zhao [9] found that in the study of gene-gene interactions, the unmatched case-control association design is more powerful than both the matched case-control design and case-parents design. They also found that when a logistic regression model is fitted for assessing gene-environment interactions based on case-parents sample, the approach may be susceptible to the PS bias [10]. However, case-control design is also well known to be susceptible to the PS bias in the study of genetic effect, if the gene under study shows marked variation in allele frequency across subgroups of the population and if these subgroups also differ in their base-line disease risks [11][12][13][14][15][16][17]. Wang, et al. [18] recently provided numerical examples showing that when the correlation between genetic and environmental factors is small or the linkage disequilibrium is weak, and casecontrol data were collected according to a simple random sampling (SRS) scheme, that is no selection bias, the PS bias in testing null interaction odds ratio is also small. However, selection bias often occurs in case-control studies and more studies are needed in order to better understand the impact of the PS and SS.
In this paper, we investigate the joint effect of population stratification and sample selection in testing null main or interaction effects. Under general sampling, we quantify the magnitude of the PS-SS bias in terms of the baseline disease risks, genotype frequencies, exposure rates, their odds ratios (linkage disequilibrium coefficients), and the effect sizes of the risk factors. Based on this result, we find that matching in ethnicity cannot eliminate bias in association studies. Using the bias, we are also able to derive important conditions under which it is null.
The PS-SS bias cannot be estimated, since we don't know how many subpopulations involved in the studied population and/or which subpopulation a person belongs to. Although adjusting for covariates such as principal components can be used to account for PS in genome wide association studies [19], however, it is not clear whether the same approach can be applied in the studies of interaction. Since, for example, the bias level also depends on the effect size of the environmental factor. In this paper, we also derive useful bounds to measure the maximal impact of the bias. Sometimes, these bounds can be estimated so that tests robust to the joint effect of PS and SS can be derived; see Lee and Wang [20] for similar suggestion in studies of gene-disease association. We use theoretical formula and simulation results to show the general properties of the usual association test in the presence of PS or selection bias. We also provide a real example to demonstrate computation of a conservative p-value in studying interaction effect of maternal smoking and GSTT1 variant on the risk of orofacial cleft.

The Magnitude of the Bias
We begin this section with the notation that will be used throughout this work. Disease status is denoted as D with levels D = 1, and 0, indicating the presence and absence of the disease, respectively. Let G = 1(0) represent the presence (absence) of the genotype of interest. H = 1(0) represents the presence (absence) of the environmental exposure or another genotype of interest. Although we only focus on 2 × 2 × 2 table, however, all results can be extended to any number of risk factors or any number of levels. We also assume that the population under study consists of K subpopulations and denote S as the stratification variable, taking values s = 1,..., K. However, K is unknown and S is not observable in our discussion of the PS effect.
To quantify the PS effect, we assume that the risk model is given by where the genetic and environmental data are obtained from subpopulation s. As usual, we use s = 1, g = 0, and h = 0 to represent the referent subpopulation, genotype and environmental exposure, respectively. For the purpose of identifiability, we define α 1 = 0. α s ,s = 1,..., K, are the subpopulation-specific parameters representing the potential heterogeneity of disease risk across subpopulations. In this model, log-odds-ratio b measures the association between the genotype and risk of disease, log-odds-ratio g measures the association between the environmental exposure (or another genotype) and risk of disease. The multiplicative interaction δ measures the change of the disease-genotype logodds-ratios according to different levels of risk factor H. Similar risk models for studying genetic effect under PS can be found in Satten et al. [21] and Cheng and Lin [17], for examples. For subpopulation s, we use OR s to represent the baseline G-H odds ratio (given D = 0). Define as the baseline Gfrequency odds and baseline Hfrequency odds H s is similarly defined. Also define D s as the baseline disease frequency odds given by In the discussion of PS effect, one often assumes that case and control data are sampled according to the SRS design. Let P(S = s|D = 1) and P(S = s|D = 0) represent the corresponding proportions of subpopulation s in the cases and controls, respectively. However, in real applications, selection bias often happens and sampling may not be done according to the SRS scheme for various reasons. Let the true proportion of subjects in the cases (controls) that are from subpopulation s be denoted by P # (S = s|D = 1) (P # (S = s|D = 0)). We use DS s = P # (S = s|D = 1) exp(b*), exp(g*) and exp(δ*) are the bias levels. We note that if D s DS s is a constant with respect to s, then K (g, h)is also a constant and there is no bias of any kind. A sufficient condition for this to hold is when the baseline disease risk is identical across all subpopulations and sampling of the study follows a SRS design. Further, since . Accordingly, if the case and control are matched in ethnicity, then the bias should be very small. However, P(D = 1|S = s) ≈ P(D = 1|G = H = 0,S = s) for all subpopulations is often not true when environmental factor, such as smoking, are involved in causing the disease risk. Under this scenario, even the cases and controls are perfectly matched, the bias can still be large. This conclusion is different from that under the gene-disease association study; see for example, Cheng, Lee and Chen [22]. We shall see more discussion of this issue in latter sections.

Maximal bias and conditions for the null bias
Here, we give conditions for the null bias and bounds for bias. The bias exp(b*) to the estimation of genetic main effect depends on the variation of the genotype frequencies measured by G † = max Note that the bias b* depends only on K(g, 0). We first present some conditions for the null bias b* = 0, when the true genetic main effect is null: (1) if the baseline genotype frequency is constant across subpopulations, then the bias b* is zero (can be proved using equation (1) in the Methods section); (2) if the sample selection follows a SRS scheme (DS † = 1), and the disease risk is constant, then the bias is also null. (However, if the sampling is not SRS, the bias may be non-null; see Tables 1 and 2.); (3) if the case and control data are matched in ethnicity, and g = δ = 0 (both H-main effect and interaction are null), then the bias is null.
When the interaction effect is null, some conditions for the null bias δ* = 0 are: (1) if the baseline G-H odds ratios and G(or H)frequency odds are constant across subpopulations, then the bias δ* is null (can be proved using equation (2) in the Methods section); (2) if the sample selection of the study follows SRS, and the disease risk is constant, then the bias δ* is also null. However, see Tables 1 and 2 for the presence of bias when the SRS condition fails.
Next, we present bound to measure the largest bias to the estimation of main effect. In the Methods section, we show that the bias exp(b*) can be expressed as where w s are some constants satisfying 0 ≤ w s ≤ 1 and K s=1 w s = 1. The bias is the greatest when the number of subpopulations is 2. The bias is also bounded below byL β ≡ U −1 β . These bounds give the maximal impact of the bias in making inference about the genetic main effect. Under rare disease, the background disease rate is approximately equal to the background disease odds. We find that the bound under SRS (DS † = 1) is similar to that given by Lee and Wang [19]. However, our result is more general in the sense that their risk model was a special case of ours and selection bias was not considered in their paper either.
In the Methods section, we also showed that under SRS, the bias exp(δ*) was bounded above by U These are the same bounds derived by Wang et al. [18]. Unfortunately, these bounds are not valid when there is selection bias. Under the general sample selection, we showed that the bias exp(δ*) was bounded above by Table 1 Biases and the true type I errors of the chi-square tests when G † = 5 and LD = (0,0) PM means that perfect matching P # (S = s|D = 1) = P # (S = s|D = 0) is satisfied.
and bounded below by 1/U (2) δ ≡ L (2) δ . Using these bounds we can easily conclude that if the genetic factors are in linkage equilibrium within each subpopulation, and the variation of the G (or H) frequency odds is small then the bias is also expected to be small.

True type I errors
In case-control studies, one often expects that the type I errors of the association tests can be approximately controlled at some predetermined level. However, in the presence of PS or selection bias, the usual test statistic does not have a chi-square distribution under the null hypothesis. Instead, it has a non-central chi-square distribution, with non-centrality parameter depending on PM means that perfect matching P # (S = s|D = 1) = P # (S = s|D = 0) is satisfied.
the level of the bias. Thus, the usual chi-square test tends to have inflated type I errors.
Suppose that the intended type I error rate of the chisquare test is a and let χ 2 1;1−α represent the 100(1-a) percentile of the chi-square distribution with one degree of freedom. Let χ 2 1 ( ) represent a non-central chisquare random variable with one degree of freedom and non-centrality parameter Δ. In the case of testing null interaction, the non-centrality parameter is given by .

Conservative p-values
In most practical applications, one often does not know the true value of the non-centrality parameter and therefore it is difficult to calculate the true p-value of the chi-square test when the PS is present and/or there is selection bias. However, we are able to develop a bound for the non-centrality parameter, and the latter may be estimable in many cases. Define * δ ( * β ) as Δ δ (Δ b ) but with δ* (b*) replaced by its upper bound log U (2) δ (logU b ). Let χ 2 δ ( χ 2 β ) be the usual statistic for testing null interaction (main effect). Then following Cheng, Lee and Chen [22], a conservative p-value of the chi-square test is given by P( ). We note that by using the property of non-central chi-square distribution, the test based on using conservative p-value always have true type I error rate smaller than or equal to the significance level and the latter is always smaller than or equal to the true type I error rate of the usual chi-square test. If a test has conservative p-value less than or equal to the designated significance level, it is significant even there is PS or selection bias. Tables 1 and 2 show some values of the biases b* and δ* and true type I error rates a b and a δ of the usual chisquare tests when the significance level is 0.05. We assumed that there are two subpopulations (K = 2), b = δ = 0, g = 0 or 1. G (H-) frequency of the first subpopulation was given by P(G = 1|S = 1) = 0.51 (P(H = 1|S = 1) = 0.19), the first subpopulation disease risk was P(D = 1|S = 1) = 0.05, the proportion of subpopulation 1 in the overall population was 0.7, and case and control sample sizes both equaled to n = 500. We defined LD s = (LD 1 , LD 2 ) where LD s was the linkage disequilibrium coefficient between loci G and H in subpopulation s, and considered linkage disequilibrium coefficient LD s = 0 or 0.05. We also assumed that the sampling proportions of the cases followed SRS but those of the controls might not. The rest of the parameter values were determined from the values for the variations G † ,H † ,D † and DS † given in the tables with the assumption that subpopulation 2 has the maximal baseline G (or H) frequency odds, disease risk, and sampling deviation (this implies that P # (S = 2|D = 0) ranges from 0.0585 to o.7163). Finally, we note that in computing the non-centrality parameters, the sample frequencies n d gh were replaced by n × P(G = g, H = h|D = d). The simulation results for G † = 5 were given in Tables 1 and 2, and those for G † = 3 can be found from Tables S1 and S2 in Additional file 1.

Examples of true biases and type I error rates
According to the results in Table 1 the true type I error a b ranges from 0.05 to 0.9998 under linkage equilibrium. If the SRS condition holds and g = 0, the true type I error a b ranges from 0.05 to 0.9602 with mean 0.4377 and standard error 0.3298. Under the same conditions but g = 1, the corresponding range becomes (0.05, 0.9326) with mean 0.3822 and standard error 0.2969. On the other hand, if the sampling is not SRS (DS † = 3 or 5) and g = 0, the range of a b is (0.05, 0.9998) with mean 0.6871 and standard error 0.317. Under non-SRS but g = 1, the corresponding range becomes (0.05, 0.9992) with mean 0.6291 and standard error 0.3117. These results indicate that the bias can be quite large and its level may be modified by the sample selection and the level of H-main effect. We also observe that the bias b* may be nonzero under perfect matching. For example, if matching is perfect and Hmain effect g = 1, the largest true type I error is 0.1064, which occurs at the case with G † = H † = D † = 5. This is contrary to our usual belief that matching between cases and controls in ethnicity can eliminate the PS bias.
However, except in some special cases, the bias under perfect matching design are smaller than those under other sampling designs.
Wang et al. [18] suggested that the bias δ* to the interaction effect is small when the linkage disequilibrium coefficient is small and the sampling is SRS. Our Table 1 also shows that under the same condition, the true type I error a δ in testing null interaction ranges from 0.05 to 0.0659. This agrees with their finding. However, if there is selection bias (DS † = 3 or 5), the true type I error rate a δ has range (0.05, 0.2656), mean 0.101, and standard error 0.056 when g = 0, and range (0.05, 0.2750), mean 0.1053, and standard error 0.0597 when g = 1. The means and standard errors given here and later were computed based on the results shown in Tables 1 and 2, and Tables S1 and S2 in Additional file 1. These results indicate that PS and SS also can cause serious bias problem in case-control study of gene-gene interactions even when the two genes are in linkage equilibrium. Under this scenario, the best way of reducing the bias is to match cases and controls in ethnicity. We note that under perfect matching and linkage equilibrium, the range of a δ is only between 0.05, and 0.0541.
Linkage disequilibrium between two genes or correlation between genetic and environmental factors play important role in determining the bias level in the studies of interaction. According to results presented in Table 2 we find that the bias to the estimation of the genetic main effect becomes smaller when the linkage disequilibrium coefficient increases from 0 to 0.05. When g = 0, the mean of a b is 0.3377 under SRS and 0.5514 under non-SRS (selection bias), and when g = 1 the mean becomes 0.2716 and 0.4597, under SRS and non-SRS, respectively. On the contrary, the bias to the estimation of the interaction effect increases when the linkage disequilibrium coefficient increases from 0 to 0.05. Our results show that when g = 0, the mean of a δ is 0.1642 under SRS and 0.5512 under non-SRS. When g = 1, the mean becomes 0.1706 and 0.5555, under SRS and non-SRS, respectively. In all, bias δ* seems to become larger when linkage disequilibrium coefficient gets larger. Under stronger linkage disequilibrium, the true type I error a δ can be as large as 0.1101 even the cases and control were perfectly matched.

An application
Shi et al. [23] studied the interaction effects of maternal smoking and maternal or fetal pharmacogenetic variants on the risk of orofacial cleft based on 1244 subjects from Demark and Iowa, USA with facial clefting and 4183 parents, siblings or unrelated population controls. We considered the combined Denmark and Iowa casecontrol data with H = 1if maternal smoking was yes (0 if no) and G = 1if GSTT1 genotype was null (0, if genotype was not-null); see Table A6 of [23]. Based on these data, we found that G × H interaction was 3.2499 and chi-square test had p-value equal to 5.5676 × 10 -4 , indicating strong interaction effect. Also, from [24] we found that GSTT1 genotype frequencies of the Caucasian populations were between 0.129 and 0.276, giving the variation of the genotype frequencies G † = 4.8762. The range of maternal smoking rate was between 0.101 and 0.244 (see [25][26][27]), giving the variation of exposure rates H † = 1.968. Since maternal smoking and GSTT1 were independent in the unrelated control population (p-values of the independence test for the Demark data and Iowa data were respectively equal to 0.0942 and 0.0976), our upper bound for the bias exp(δ*) (see equation 2) equals to 1.6149, leading to the conservative pvalue equal to 2.0353 × 10 -2 . This suggests that the maternal smoking effect on the cleft risk can be modified by the GSTT1 genotype even the population stratification and selection bias are both present in the study.

Discussion
The impact of population stratification is considered by many to be important in case-control studies of genedisease association. Many authors have suggested quantitative methods to control type I errors of the usual association test. The most popular treatments include the "genomic control" method [28][29][30][31][32][33] and the "structured association" method [34][35][36][37]. Each of the proposed methods requires typing extra polymorphic markers to generate an estimate of PS which can be used to adjust the test statistic. The impact of PS in case-control studies of gene-gene (environment) interaction is considered to be less important, when the genes under studied are in linkage equilibrium or when the gene-environment correlation is weak [18,38]. However, this conclusion holds only when the sampling of the case and control data follow a SRS design, that is no selection bias. Unfortunately, there is no formal method for testing the validity of the SRS condition when the PS is present.
In practical applications, the selection bias is not unusual. For examples, when the hospital-based cases (controls) are used in the study and they are not representative of the population-based cases (controls) or when many non-response of the cases or/and controls occur in the study or there are self-selections, then the SRS condition may fail. In this paper, we show that under slight selection bias (DS † = 3), the bias to the estimation of main or interaction effect may become unacceptable. Our suggestion is that the bias should be treated seriously, even when the genetic factors are in linkage equilibrium or the genetic and environmental factors are uncorrelated. Large correlation or strong linkage disequilibrium could make the bias become even larger. Also, small variation in disease risk cannot guarantee small bias, unless there is also small selection bias. In applications, it is important to be able to measure the impact of the bias. In this paper, we drive some bounds for the bias. If these bounds are estimable, then they can be used to make conservative inference. We show one real example that a conservative p-value for testing null interaction can be computed and significance conclusion can be reached even there is bias. Genotype frequencies of the SNPs and their LDs are readily available from international HapMap project. Further, disease prevalence is also available from many nations or from World Health Organization, for example. This information allows us to easily compute bounds and then conservative p-values.
We note that matching in ethnicity between cases and controls has been suggested by epidemiologists as an affective method to control the PS bias in case-control gene-disease association study. However, in a more complicated risk model such as the one discussed here, bias (b*) (see equation 1) to the genetic main effect also depends on the effect size of other risk factor. We found that if g = δ = 0 then the residual bias after matching is small. However, if g = 1, and δ = 0, the residual bias after matching is still quite substantial. A sufficient condition to assure bias b* = 0 under perfect matching is g = δ = 0. Tables 1 and 2 also show that matching cannot remove bias to the estimation of the interaction effect.
Since the presence of PS and selection bias may cause unacceptable bias to the usual interaction analysis, it is of importance to have an efficient method to control the bias. Unfortunately, so far there exists no effective method. The major difficulty is that the level of the bias depends on the effect size of other related factor which is in general unknown or not estimable under the PS. However, under some special cases, for example, when the genetic main effects are null (or weak) and testing gene-gene interaction is the main focus, one may follow the idea of genomic control to type extra pairs of null markers and apply the computed interaction levels to control the bias. In principle, if the candidate markers are in linkage equilibrium, the selected pairs of null markers also need to be in linkage equilibrium so that the important characteristics of the bias can be captured. On the other hand, if the candidate markers are in linkage disequilibrium, the paired null markers also need to be correlated. We are currently working to solve this important problem. Another approach for reducing bias is to match the cases and controls in ethnicity. According to our simulations, we find that under perfect matching and weak linkage disequilibrium, the bias to the estimation of the interaction effect is small. However, more study is needed in order to understand the impact of the residual bias when the matching is not perfect.

Conclusions
In this paper, the biases to the estimation of genetic main and interaction effects are quantified and their bounds are derived. We find that if there is environmental effect or interaction, the bias to the genetic main effect cannot be ignored even cases and controls were matched in ethnicity. The bias to the estimation of interaction effect also has the same problem. The estimated bound can be used to compute conservative pvalue for the association test. The computation of conservative p-value does not require the knowledge on the number of subpopulations involved in the study or the membership of each study subject. In real applications, it is usually not clear that if there is PS or selection bias or both. However, if appropriate information such as the variation of genotype frequencies is known, we always can compute the conservative p-value. If the conservative p-value is smaller than the designated significance level, we can safely claim that the test is significant regardless of the presence of PS/non-SRS.

Methods
Following the usual Bayesian argument, the disease-risk model implies that On the other hand, the joint frequency distribution of G and H in the control population is given by