Retrospective analysis of main and interaction effects in genetic association studies of human complex traits

Background The etiology of multifactorial human diseases involves complex interactions between numerous environmental factors and alleles of many genes. Efficient statistical tools are demanded in identifying the genetic and environmental variants that affect the risk of disease development. This paper introduces a retrospective polytomous logistic regression model to measure both the main and interaction effects in genetic association studies of human discrete and continuous complex traits. In this model, combinations of genotypes at two interacting loci or of environmental exposure and genotypes at one locus are treated as nominal outcomes of which the proportions are modeled as a function of the disease trait assigning both main and interaction effects and with no assumption of normality in the trait distribution. Performance of our method in detecting interaction effect is compared with that of the case-only model. Results Results from our simulation study indicate that our retrospective model exhibits high power in capturing even relatively small effect with reasonable sample sizes. Application of our method to data from an association study on the catalase -262C/T promoter polymorphism and aging phenotypes detected significant main and interaction effects for age-group and allele T on individual's cognitive functioning and produced consistent results in estimating the interaction effect as compared with the popular case-only model. Conclusion The retrospective polytomous logistic regression model can be used as a convenient tool for assessing both main and interaction effects in genetic association studies of human multifactorial diseases involving genetic and non-genetic factors as well as categorical or continuous traits.


Background
The changing disease pattern has brought complex diseases as one of the significant challenges for the 21 st cen-tury medicine. As the etiology of complex diseases involves both multiple genetic and environmental factors combined with their interactions, statistical methods for efficiently measuring the main and interaction effects are demanded. In the literature of genetic epidemiology, the linkage disequilibrium (LD) based genetic association study, advantaged by the recent development of highthroughput SNP genotyping technology, has been the workhorse and holds the promise of mapping out susceptibility genes to complex diseases [1]. The case-control design, a retrospective design by nature, has been popular in establishing the genetic associations in single locus and haplotype analyses [2] as well as in assessing gene-environment interactions [3,4]. Recently, this approach has been extended to handle both dichotomous and continuous traits by introducing the retrospective logistic regression model [5] that treats alleles or genotypes as dependent variables. For example, the idea has been used by Waldman et al. [6] to model the probability of allele transmission as a function of offspring's trait value in family-based transmission disequilibrium test (TDT). The same idea has been used for single locus analysis in both unmatched and matched case-control studies [7] and for haplotype analysis [8].
Another contribution to genetic epidemiology by the retrospective case-control design is the introduction of nontraditional case-only design [9] for assessing geneenvironment and gene-gene interactions. Measuring the interaction effects in complex disease study is important because many of the susceptibility genes act through modification of disease risk associated with other genes or environmental factors. Unfortunately, application of the case-only method is restricted to dichotomous or binary traits. Otherwise one needs to set up a cut-off on a continuous trait to define cases [10]. Although other statistical models for measuring interactions [11][12][13] exist, Glaser et al. [14] recently reported that different methods can give different results for the same data due to underlying assumptions.
In this paper, we introduce a retrospective polytomous logistic regression model to measure both the main and the interaction effects in genetic association studies of human complex traits which can be discrete or continuous. In this model, combinations of genotypes at two interacting loci or of environmental exposure and genotypes at one locus are treated as nominal outcomes of which the proportions are modeled as a function of the disease trait assigning both main and interaction effects. The performance of our method in detecting interaction effect is compared with that of the case-only model. A limited simulation study is performed to assess the power and type I error rate for given parameters settings. Application of our method is exemplified using data from our association study on catalase -262C/T promoter polymorphism and aging phenotypes [15] to look for both main and interaction effects of genetic variation and aging in affecting cognitive function in the Danish population.

Methods
Suppose we are interested in assessing the main and interaction effects of one genetic variant G (allele or genotype, G = 1 for carriers and 0 for non-carriers) and environmental exposure E (E = 0 for non-exposed and 1 for exposed). The combination of G and E leads to four nominal categories. The purpose of our retrospective polytomous logistic regression model is to model the proportion of each of the categories, p, as a function of the disease trait by treating the proportions as responses for given trait value x, i.e. we model where a and b are the intercept and slope parameters assigned to G and E respectively. In (1), the association of G and E with trait x are measured by the slope parameters with b G and b E for the main effects from G and E and b G×E for their interaction effect. Rewriting (1) as the conditional probability of each outcome category given the trait value x, we have When I = J = 0, the numerator of (2) becomes 1 so that we have Since (3) is for the group of individuals who are neither carriers of the genetic variant nor exposed, it can serve as the reference or baseline. With that, we are able to derive the relative risks (RR) for the main and interaction effects at a given trait value x and then define the relative risk ratios (RRR) for comparing RR at two given trait values x 1 and x 2 . To obtain RR for the main genetic effect, we set I = 1 and J = 0 so that (2) becomes The RR for the main genetic effect is then calculated as Based on (5), we can calculate RRR for comparing RRs at two given trait values x 1 and x 2 as Note that, when k = 1 such as in a case-control study, we have RRR G = exp(b G ). In the same manner, we obtain RRR E = exp(kb E ).
In order to estimate the risk of interaction effect, we set I = J = 1 so that (2) becomes The RR for comparing exposed carriers to unexposed noncarriers at x is From (9) we obtain RRR G×E as the departure from the multiplicative effects of RRR G RRR E , i.e.RRR G×E = exp(kb G×E ).
In order to estimate the parameters, we construct the following likelihood function [16] Here, n denotes individual observations from 1 to N, is an indicator function.
Since our interest is only in the slope parameters, the two intercept parameters are just nuisance parameters. To obtain an overall significance of both main and interaction effects, we use the log-likelihood ratio test with df = 3, where L(a G ,a E ) is the likelihood of the intercept only model. Statistical test on single slope parameters can be done likewise or by introducing the Wald statistic [16].

Simulation
In order to examine the performance of our method, we perform a limited computer simulation study to assess the power and type I error rate (α) when given different parameter settings. The data were simulated using a linear Here y i is a continuous phenotype value for individual i. G and E are the indictors for the genetic (1 for carriers and 0 for non-carriers, we set frequency of carriers to 0.2) and environmental (1 for exposed and 0 for nonexposed, exposure rate set to 0.3) variants. β 0 is the intercept which we set to 0.1. β 1 , β 2 and β 3 are the slope parameters for the genetic, environmental and their interaction effects respectively. For simplicity, we assume that there is no main effect in the model, but there is an interaction effect that lowers the phenotype value for carriers for the genetic variant and who are exposed to the environment. The power for capturing the interaction effect is estimated as the frequency for rejecting a null hypothesis of β 3 = 0 for a given type I error rate (we set α = 0.05). The last term e i is the error part for individual i which follows a standard normal distribution N(0,1). To assess the model performance, we specify different sample sizes and assign different values for β 3 . We set β 3 to -0.65, -0.95 and -1.2 so that the interaction effect accounts for about 5, 10 and 15 percent of the total variance in the data.
In Table 1, we show the power estimates using 500 replicates for given type I error rate of α = 0.05. It can be seen that for an interaction effect that explains only 5 percent  of the total variance (β 3 = -0.65), a sample size of above 400 is needed in order to achieve reasonable power. For an effect responsible for 15 percent of the overall variance (β 3 = -1.2), a sample size of 150 will give acceptable power (about 0.8). By setting β 3 to zero, we further our simulation study to assess the type I error rate for a given nominal α = 0.05 using again 500 replicates. The estimated empirical type I error rate is shown in the right most column in Table 1. It can be seen that, although there is a slight fluctuation, the estimates of empirical type I error rate are centered at the nominal α of 0.05. Overall, results from our simulation study indicate that our retrospective model exhibits high power in capturing even relatively small interaction effect with reasonable sample sizes.

Results
The effect of catalase -262C/T promoter polymorphism on human aging phenotypes (cognitive and physical functioning) has been investigated by Christiansen et al. [15]. In this study, a modest protective effect of the T allele on cognitive and physical function was observed although a statistical significance was not reached. Here we apply our retrospective logistic regression model to the data to look for both main and interaction effects on individual's cognitive score (a continuous trait measuring fluency, forand backward digit span and a modified 12-word learning test) by the genetic variant (T allele carrier = 1, non-carrier = 0) and age-group (equal or above age 65 = 1, below age 65 = 0) in males (N = 789). A combination of the two variants forms four nominal response categories among which non-carriers below age 65 serve as the reference group. In Table 2, we show the parameter estimates for the main and interaction effects by our logistic regression model. The model identified a highly significant effect of age-group that is negatively correlated with individual's cognitive function (RRR = 0.630, p-value = 0.001). Moreover, we found a modest main effect of the T allele (RRR = 0.948, p-value = 0.037) and a modest interaction effect between the T allele and age-group (RRR = 1.083, p-value = 0.033). It is interesting to see that, although the overall effect of allele T reduces carrier's cognitive score, the interaction effect indicates that the effect of the allele is agedependent which means that the T allele conveys beneficial effect that improves carries' cognitive performances at old ages.
By dichotomizing the cognitive score it is possible to apply the case-only model to assess the interaction effect of allele T and aging on cognitive functioning. To do that, we selected all individuals with cognitive score above 4 (about 24% of the top scores) and defined them as cases (186 individuals). The case-only model gave an odds ratio of 2.386 with a p-value of 0.008 indicating that allele T significantly enhances carrier's cognitive function at old ages. For comparison, we applied our retrospective logistic regression model to the dichotomized cognitive score. Parameter estimates in Table 2 also reveal the negative association with cognitive functioning by aging (RRR = 0.060, p = 0.000) and allele T (RRR = 0.663, p = 0.041). Meanwhile our model also reports a highly significant interaction effect even with exactly the same estimate of the risk parameter (RRR = 2.386, p = 0.009) as from the case-only model (OR = 2.386, p = 0.008) meaning that our retrospective logistic regression model yields valid estimate of the interaction effect. Consistent estimates on the interaction effect by the case-only and our models were also obtained when varying the cut-off for dichotomizing the cognitive score. This is understandable since the case-only model measures the deviation from the multiplication of main effects [9] which is exactly the definition of interaction effect in our model. However, since the maximum likelihood from the dichotomized trait is lower than the continuous trait (Table 2), the model using cognitive score as a continuous trait should be preferred.

Discussion
The etiology of multifactorial human disease involves complex interactions between numerous environmental factors and alleles of many genes. Efficient statistical tools are demanded for identification of the genetic and environmental variants that affect the risk of disease development. Through example application, we have shown that our retrospective polytomous logistic regression model can capture both main and interaction effects and produce consistent results in estimating the interaction effect as compared with the popular case-only model. The distinct feature in our model is that the disease trait is treated as an independent variable so that our model is capable of accommodating both categorical and continuous traits. Different from the existing models [12,13], no assumption of normal distribution of the trait value is needed in our method. Furthermore, genotype or allele effects can be easily estimated by coding 1 and 0 to carriers and noncarriers to assess dominant or recessive effects.
Since our relative risk ratio is estimated from a retrospective model, it is necessary to study its connection with the relative risk parameter in a general prospective model. In additional file-1, we derive the relationship between the risk parameters in the retrospective model and that in the prospective model when studying a binary disease trait. It is shown that, when the disease is rare, the relative risks in a prospective model can be approximated by the relative risk ratios estimated from our model. This is important because, as long as the disease incidence is low in the population, our model estimates the risk parameters that can be interpreted in terms of trait penetrance as in a prospective model. As shown by equation (6), testing the null hypothesis of b = 0 is equivalent to testing H o : RRR = 1. This is also shown by the 95% confidence intervals for the estimated RRRs in Table 2. Since the slope parameters for the main and interaction effects are all statistically different from zero, none of the 95% confidence intervals of RRR covers the null risk of one.
It is necessary to point out that, as in any interaction model, it is critical that the interacting variants be independent. By independent we mean that the interacting variants are not correlated or in association. This is especially relevant in studying gene by gene interactions. It is important to make sure that the two loci under testing are not in LD if they reside on the same chromosome. In case of LD between the two genetic variants, a haplotype-based analysis is more appropriate [8]. Our experience showed that violation of independence can result in unreliable estimates on the risk parameters. Independence between interacting variables is also required by the case-only model to ensure reliable estimates [17]. For case-control studies, if the main interest is interaction effect, the caseonly model should be preferred because it is more effi-cient than the traditional case-control model [18]. Note also that our model is limited to discrete exposure variables when applied to gene-environment interaction although it is no longer a problem for measuring gene by gene interactions because all genotypes are discrete.
Finally, although our model is proposed for genetic association studies (gene by gene or gene by environment), the same model can be applied to study the main and interaction effects of non-genetic variants. Perhaps the biggest advantage of our approach is that it can easily be implemented by using any programming statistical package to fit the multinomial logistic regression model. Considering all these advantages, we hope that our proposed method can be of use for epidemiologists who are interested in studying multifactorial or complex human diseases.

Conclusion
Our proposed retrospective polytomous logistic regression model can be used as a convenient tool for assessing both main and interaction effects in genetic association studies of human complex diseases involving both genetic and non-genetic factors.