 Methodology article
 Open Access
 Published:
A nonparametric approach for detecting genegene interactions associated with ageatonset outcomes
BMC Geneticsvolume 15, Article number: 79 (2014)
Abstract
Background
Coxregressionbased methods have been commonly used for the analyses of survival outcomes, such as ageatdiseaseonset. These methods generally assume the hazard functions are proportional among various risk groups. However, such an assumption may not be valid in genetic association studies, especially when complex interactions are involved. In addition, genetic association studies commonly adopt casecontrol designs. Direct use of Cox regression to casecontrol data may yield biased estimators and incorrect statistical inference.
Results
We propose a nonparametric approach, the weighted NelsonAalen (WNA) approach, for detecting genetic variants that are associated with agedependent outcomes. The proposed approach can be directly applied to prospective cohort studies, and can be easily extended for populationbased casecontrol studies. Moreover, it does not rely on any assumptions of the disease inheritance models, and is able to capture highorder genegene interactions. Through simulations, we show the proposed approach outperforms Coxregressionbased methods in various scenarios. We also conduct an empirical study of progression of nicotine dependence by applying the WNA approach to three independent datasets from the Study of Addiction: Genetics and Environment. In the initial dataset, two SNPs, rs6570989 and rs2930357, located in genes GRIK2 and CSMD1, are found to be significantly associated with the progression of nicotine dependence (ND). The joint association is further replicated in two independent datasets. Further analysis suggests that these two genes may interact and be associated with the progression of ND.
Conclusions
As demonstrated by the simulation studies and real data analysis, the proposed approach provides an efficient tool for detecting genetic interactions associated with ageatonset outcomes.
Background
For most common complex diseases, if not all, the currently identified genetic loci only explain a small percentage of the disease heritability [1, 2]. The search for genetic variants underlying complex diseases remains to be a major goal and challenge for the coming decades. While genetic variants, such as rare variants and structure variation, may contribute to the remaining heritability, part of the missing heritability could be explained by the interplay of genetic variants through complicated mechanisms [3, 4]. The study of genegene interaction may reveal novel findings explaining missing heritability, and shed light on the underlying etiological and pathophysiological processes that result in complex diseases. Although substantial efforts are given to identification of genegene (GG) interactions related to binary disease outcomes, less attention has been given to GG interaction research on other clinical features, such as ages at disease onset, which may more closely reflect the dynamic process of disease development [5]. A number of studies show that the intensity of natural selection on a gene declines with the age at which it is expressed [6–8]. In addition, practical evidence suggests that many complex diseases, such as Alzheimer’s diseases and bipolar diseases, have separate pools of genetic variants that contribute to the earlyonset and lateonset cases, indicating distinctive biological pathways involved in the disease development process [9, 10]. Though GG interactions are ubiquitous in various biological pathways related with the disease development process [3], identification of these GG interactions presents continuing challenges.
Coxregression is a powerful tool for the genetic analysis of ageatdiseaseonset outcomes. It has been used in genetic association studies for detecting single genetic variant associated with disease progression [11–13]. To better characterize the association in a candidate gene/region, Coxregression has also been extended to handle multiple loci via haplotype analysis [14–16]. However, Coxregression is less suitable for the analysis of a large number of genetic variants and possible GG interactions, because of the rapidly increasing number of parameters. In addition, Cox regression assumes the hazard rates are proportional among various risk groups. Such an assumption may be questionable in genetic studies, especially when complex interactions are involved. Further, most of the current genetic association studies adopt casecontrol designs, by which cases are usually oversampled from the source population. Applying Cox regression to casecontrol studies may raise another issue of biased estimation of hazard ratios [17, 18]. To address this issue, Nan et al. proposed to analyze ageatdiseaseonset for casecontrol studies using a modified casecohort approach (MCC) [19]. It was shown that this approach had very small bias when the disease prevalence was low. However, the bias would increase with the disease prevalence, which may limit its application to studies of common diseases.
As a nonparametric alternative to Cox regression, NelsonAalen estimator has been widely used to analyze survival outcomes. It was first introduced by Nelson and was later on rediscovered by Aalen, who derived the estimator using the modern counting process techniques [20, 21]. It was shown to have a number of nice properties, such as requiring less assumptions and better smallsamplesize performance than other standard approaches [22]. Considering these advantages, we propose a weighted NelsonAalen (WNA) approach for detecting genetic variants associated with ageatdiseaseonset, considering possible interactions. The proposed approach searches for the best combination of loci forwardly, and tests their joint association with the ageatdiseaseonset. Our approach has the following advantages. First, it is a multilocus approach that is applicable to a set of genetic variants (i.e. from a number of candidate genes or a genetic pathway) with the consideration of highorder interaction. Second, it is a nonparametric approach, which does not make any assumption of the disease models. Third, it can be directly applied to prospective cohort studies, and can be easily extended for populationbased casecontrol studies. Through simulations, we compare the performance of the proposed approach with Coxregressionbased approaches under both perspective cohort and casecontrol study designs. We further illustrate our approach with a real data application to smokers’ progression to nicotine dependence (ND), using three independent datasets from the Study of Addiction: Genetics and Environment (SAGE).
Methods
Suppose we have a study population of n subjects, which is a subcohort from the source population. We denote T_{ i }, δ_{ i } and X_{ i } as the observed age, the disease status and the genetic markers for subject i, respectively. Let ${\mathit{T}}_{\mathit{i}}=min\left({\mathit{T}}_{\mathit{i}}^{\mathit{O}},{\mathit{T}}_{\mathit{i}}^{\mathit{S}}\right)$, where ${\mathit{T}}_{\mathit{i}}^{\mathit{O}}$ and ${\mathit{T}}_{\mathit{i}}^{\mathit{S}}$ are the ageatdiseaseonset and ageatsurvey for subject i. Then ${\mathit{T}}_{\mathit{i}}^{\mathit{O}}$ is either observed or rightcensored, and we assume the cause of censoring is independent with either ageatdiseaseonset or any genetic variants. Further denote ${\mathrm{\delta}}_{\mathit{i}}=\mathit{I}\left({\mathit{T}}_{\mathit{i}}^{\mathit{O}}\le {\mathit{T}}_{\mathit{i}}^{\mathit{S}}\right)$, where I(.) is an indicator function. Without loss of generality, we assume K genetic markers, X_{i} = (X_{i1}, X_{i2}......, X_{ iK }), are single nucleotide polymorphisms (SNP) with three possible genotypes, X_{ ij } ∈ {AA, Aa, aa}, 1 ≤ j ≤ K. Our hypothesis is that the K SNPs and their possible interactions may be associated with the ageatdiseaseonset outcome. In the following, we introduce the WNA approach for perspective cohort studies, and then extend it for populationbased casecontrol studies.
Weighted nelsonAalen for prospective cohort studies
a. Association testing of multiple genotype groups with ageatdiseaseonset
Assume k diseasesusceptibility SNPs are associated with ages of disease onset, by forming L GG groups, G_{1}, G_{2},......,G_{ L }, each representing a different hazard rate for the disease onset. Given these GG groups, we can partition all subjects into L groups, S_{1}, S_{2} ......,S_{ L }, where S_{1} = {iX_{ i } = G_{ l }}, 1 ≤ l ≤ L. Suppose the onsets of a disease are observed at M distinct ages, t_{1} < t_{2} < ...... < t_{ M }. Let Y (t) be the number of subjects who are at risk at age t, and N(t) be the number of subjects who have diseaseonset by age t. The cumulative hazard function H and survival function S are estimated as,
In a similar manner, we can also define the groupspecific cumulative hazard functions H_{ l } (t), 1 ≤ l ≤ L, based on the subjects within group S_{ l }. To examine the joint association of k SNPs with the ageatdiseaseonset, we test the following hypothesis: H_{0} : H_{1}(t) = H_{2}(t)=...... = H_{ L }(t) for all t ≤ τ versus H_{ A } : at least one H_{ l }(t) is different for some t ≤ τ.
Here τ is the largest observed age of onset in the study.
Let Y_{ l } (t) be the number of subjects in group S_{ l } who are at risk at age t; N_{ l } (t) be the number of subjects in group S_{ l } who have the diseaseonset by age t. The test statistic can be formed based on Z = (Z_{1}, Z_{2},...... Z_{ L }), where
In Equation (2), W (t) is a weight function for the ages at disease onset. A variety of weight functions have been proposed in the literature. For example, W (t) = 1 for any t, would lead to the widely used logrank test [23]; W (t) = Y(t) would lead to MannWhitneyWilcoxon test and the generalization of KruskalWallis test [24, 25]. In our study, we suggest using the weight function,
which has a form similar to the KaplanMeier estimator in the entire study population, and gives the most weight to early disease onset. This weight function was first proposed by Peto et al.[26], and was also suggested in a series of articles [27, 28].
The variance of Z_{ l } and the covariance between Z_{ l } and Z_{ l }′ can then be calculated as,
The components of Z = (Z_{1}, Z_{2},......, Z_{ L }) are linearly dependent because $\sum _{\mathit{l}=1}^{\mathit{L}}{\mathit{Z}}_{\mathit{l}}=0$. Therefore, the test statistics can be calculated based on any L  1 components of Z, such as Z_{1}, Z_{2},......, Z_{L1}. The test statistic can be formed as,
where Σ is the variancecovariance matrix for (Z_{1}, Z_{2},......, Z_{L1}). Under the null hypothesis of no association, the above test statistic has asymptotically a Chisquare distribution with L  1 degrees of freedom. The theoretical details of the test can be found elsewhere [29].
b. Selection of multiSNP combinations by recursive partitioning
In genetic association studies with a set of SNPs, it is expected that only a small subset of SNPs are associated with the disease. To determine the diseasesusceptibility SNPs and the associated GG groups, we adopt a recursive partitioning algorithm. The algorithm starts with a null model by treating all study samples as one group. In each of the following steps, it gradually selects diseasesusceptibility markers, and then divides samples into different GG groups. In step one, we search among all available SNPs for a single SNP that can best divide samples into two groups. Assuming the minor allele a is the risk allele of jth SNP, we consider three partitioning strategies:

1)
Dominant effect:
$$\left\{{\mathit{S}}_{1}^{\mathit{j}}=\left\{\mathit{i}{\mathit{X}}_{\mathit{i},\mathit{j}}=\mathrm{AA}\right\},{\mathit{S}}_{2}^{\mathit{j}}=\left\{\mathit{i}{\mathit{X}}_{\mathit{i},\mathit{j}}=\mathrm{Aa}\phantom{\rule{0.12em}{0ex}}\mathrm{or}\phantom{\rule{0.12em}{0ex}}\mathrm{aa}\right\}\right\}$$;

2)
Recessive effect:
$$\left\{{\mathit{S}}_{1}^{\mathit{j}}=\left\{\mathit{i}{\mathit{X}}_{\mathit{i},\mathit{j}}=\mathrm{aa}\right\},{\mathit{S}}_{2}^{\mathit{j}}=\left\{\mathit{i}{\mathit{X}}_{\mathit{i},\mathit{j}}=\mathrm{AA}\phantom{\rule{0.12em}{0ex}}\mathrm{or}\phantom{\rule{0.12em}{0ex}}\mathrm{Aa}\right\}\right\}$$;

3)
Heterozygote effect:
$$\left\{{\mathit{S}}_{1}^{\mathit{j}}=\left\{\mathit{i}{\mathit{X}}_{\mathit{i},\mathit{j}}=\mathrm{Aa}\right\},{\mathit{S}}_{2}^{\mathit{j}}=\left\{\mathit{i}{\mathit{X}}_{\mathit{i},\mathit{j}}=\mathrm{AA}\phantom{\rule{0.12em}{0ex}}\mathrm{or}\phantom{\rule{0.12em}{0ex}}\mathrm{aa}\right\}\right\}$$.
For each partitioning strategy, we calculate Δ_{WNA} and its corresponding pvalue. We repeat the partitioning process for all SNPs, and choose the most significant group partitioning for step one, denoted it as $\left\{{\mathit{S}}_{1}^{\left(1\right)},{\mathit{S}}_{2}^{\left(1\right)}\right\}$. In step two, a second SNP j′ is considered to further partition the existing two groups into four GG groups, denoted as $\left\{{\mathit{S}}_{1}^{\left(2\right)}={\mathit{S}}_{1}^{\left(1\right)}\cap {\mathit{S}}_{1}^{{\mathit{j}}^{\prime}},{\mathit{S}}_{2}^{\left(2\right)}={\mathit{S}}_{1}^{\left(1\right)}\cap {\mathit{S}}_{2}^{{\mathit{j}}^{\prime}},{\mathit{S}}_{3}^{\left(2\right)}={\mathit{S}}_{2}^{\left(1\right)}\cap {\mathit{S}}_{1}^{{\mathit{j}}^{\prime}},{\mathit{S}}_{4}^{\left(2\right)}={\mathit{S}}_{2}^{\left(1\right)}\cap {\mathit{S}}_{2}^{{\mathit{j}}^{\prime}}\right\}$. Again, the group partitioning that most significantly related to the ageatdiseaseonset is selected in the step two. In a similar fashion, the diseasesusceptibility SNPs can be selected forwardly into the model to partition samples into different GG groups. Tenfold crossvalidation (CV) is then used to determine the most parsimonious model with an optimal number of GG groups. In this procedure, all the subjects are randomly divided into 10 subsets. Then 9 of the 10 subsets are used as the training set, while the remaining one is used as the testing set. The process is repeated 10 times to make sure all subsets have served as a testing set. In each testing set, a test statistic is calculated based on the GG groups selected from the corresponding training set. The final model with an optimal number of GG groups is the chosen to be the one that attains the highest significance level of the averaged testing statistic from 10 testing sets. After the final model is determined, an overall test statistic based on entire samples, Δ_{WNA}, is obtained, and is used to evaluate the association of the selected GG groups with the ageat diseaseonset. In order to account for the inflated Type I error due to selection of GG groups, a permutation test is used to assess the significance level. In the permutation process, the outcomes of individuals (i.e. the ageofonset and censoring status outcomes) in each permutation replicate are simultaneously permuted. The forward selection algorithm is then applied to the permuted data to choose the best GG group and calculate Δ_{WNA}. By repeating the same process on a large number of permutation replicates (e.g. 1,000), the empirical null distribution of Δ_{WNA} is generated and an empirical pvalue can be obtained. In a replication study where the GG group is predetermined, the asymptotic test based on the Chisquare distribution can be used.
Modification for populationbased casecontrol studies
Most of existing and ongoing genetic association studies adopt casecontrol design, where controls are not matched to the cases by age. To facilitate the survival analysis of genetic data from casecontrol studies, we also propose a modified WNA for casecontrol data. Suppose the study includes n subjects with n_{1} cases and n_{0} controls (n = n_{1} + n_{0}). Assuming the disease has a prevalence of ρ in the general population, we modify the hazard function by adjusting the number of subjects at risk, Y (t):
Correspondingly, we modify the groupwise hazard function by adjusting the number of subjects in group S_{ l },
where f_{ l } denotes the frequency of genotype G_{ l } among controls, and ${\sum}_{\mathit{l}=1}^{\mathit{L}}{\mathit{f}}_{l}}=1$. With this adjustment, we expect to retrieve a pseudocohort population with a number of unobserved controls, who are expected to be at risk throughout the study.
Results
Simulation studies
In the simulation studies, we evaluated the performance of the proposed WNA approach and compared it with those of Coxregressionbased approaches. Two series of simulations were conducted for perspective cohort studies and casecontrol studies, respectively. In the simulations, we assumed a subject’s ageatsurvey, ${\mathit{T}}_{\mathit{i}}^{\mathit{S}}$, followed a normal distribution, N(60,10^{2}), and its ageatdiseaseonset, ${\mathit{T}}_{\mathit{i}}^{\mathit{O}}$, might follow various disease models described below. Each subject had an observed age ${\mathit{T}}_{\mathit{i}}=min\left({\mathit{T}}_{\mathit{i}}^{\mathit{O}},{\mathit{T}}_{\mathit{i}}^{\mathit{S}}\right)$, and a censoring status, δ_{ i }, determined by ${\mathit{\delta}}_{\mathit{i}}=\mathit{I}\left({\mathit{T}}_{\mathit{i}}^{\mathit{O}}\le {\mathit{T}}_{\mathit{i}}^{\mathit{S}}\right)$. Two causal SNPs with an interaction effect were simulated for each disease scenario. We also assumed each SNP had two alleles, A and a, and the minor allele a had a frequency of 0.3 leading to an early onset of the disease. In addition, eight noise SNPs were also simulated with minor allele frequencies sampled from a uniform distribution, Unif[0.1, 0.5]. For each disease model, we simulated one million subjects as the source population, and the disease prevalence was calculated by $\mathit{\rho}={\displaystyle \sum _{\mathit{i}}{\mathit{\delta}}_{\mathit{i}}}/{10}^{6}$. The parameters were chosen to ensure the disease prevalence was within the range of [0.15, 0.25] in the source population. For prospective cohort studies, 1,000 subjects were selected as the study population, while 500 cases (δ_{ i } = 1) and 500 controls (δ_{ i } = 0) were selected for casecontrol studies.
Disease model 1
The underlying disease model was simulated to mimic an ideal scenario where the proportional hazard (PH) assumption was satisfied and the hazard ratio was linear. In the model, we assumed the hazard function h_{ i }(t) for an individual i had a semiparametric form,
where
and
In the simulation, we specified the baseline hazard h_{0}(t) by a Weibull distribution with a shape parameter θ = 3 and a scale parameter λ = 100.
Disease model 2
The underlying disease model was simulated to mimic a scenario where the PH assumption was met, but the hazard ratio was nonlinear. For this model, we assumed the hazard function for an individual i had the form:
In such a model, we assumed individuals with risk alleles at both loci had a high risk of disease than the remaining individuals.
Disease model 3
This disease model mimicked a scenario where the PH assumption was violated. In this model, we assumed the ageatdiseaseonset for a subject with a GG combination of (x_{i1}, x_{i2}) followed a Weibull distribution W(λ,θ_{xi1}, _{xi12}), where the scale parameter λ was fixed at 100 and the shape parameter varied by the GG combinations of two causal SNPs. In such a model, we assumed the hazard functions increased over time for all GG combinations (i.e.,θ > 1).
Disease model 4
This disease model also assumed the PH assumption was violated. Different from Disease Model 3, we assumed the hazard functions may remained constant or decreased overtime for certain GG combinations (i.e. θ ≤ 1), mimicking an early onset disease scenario.
Simulation results
Simulations were conducted to compare WNA with the conventional Cox regression (COX) approach under the perspective cohort studies. Additional simulations were also perform to evaluate WNA with the modified case cohort (MCC) approach proposed by Nan et al (Nan and Lin, 2008) under casecontrol studies. To be consistent with WNA, we also adopted a forward selection strategy for the COX and MCC, and used the Akaike information criterion (AIC) as the criteria for model selection. For each disease model, we started the analysis with two causal SNPs, and gradually added noise SNPs into the analysis. The simulation was repeated for 1,000 times for each disease model. In each replicate, a final model was selected by each approach, and was then evaluated on an independent subcohort of 1,000 subjects from the source population. Because an independent dataset was used for validation in each replicate, the asymptotic test based on the Chisquare distribution was used. The Type I error and power were thus calculated as the probability for the selected final model to have a pvalue less than 0.05 on the independent subcohort of 1,000 subjects.
We first evaluated the Type I errors for both approaches. In this case, we evaluated a null model with ten SNPs, simulated independently from the ageatdiseaseonset outcome. As shown in Table 1, the Type I errors were well controlled for COX/MCC approach, and reasonably controlled for WNA. We then evaluated the power and the sensitivity (specificity) of both approaches. The sensitivity was calculated as the probability of selecting a causal SNP in the final model, while the specificity was calculated as the probability of not selecting a noise SNP in the final model. The results for perspective cohort studies were summarized in Table 2. Under the Disease Model 1, while COX had a higher power than WNA (i.e., 0.702 vs. 0.738) when only causal SNPs were considered, its power decreased much faster than that of WNA as the number of noise SNPs increased. When the noise SNPs reaches 8, WNA attained a higher power than COX (i.e., 0.567 vs. 0.427). When the hazard ratio was nonlinear (Disease Model 2) or PH assumption was violated (Disease Models 3 and 4), WNA had a consistently higher power than COX. WNA showed the most advantages over COX under Disease Model 4 when the hazard functions do not follow monotonic patterns. In all scenarios, COX tended to have a higher sensitivity, but a lower specificity than WNA. This indicated COX tended to include more noise SNPs in the model than WNA, which partially explained its relatively lower power with the increasing number of noise SNPs. Compared to COX, WNA had relatively more robust performance, especially when the PH assumption was violated.
The simulation results for casecontrol studies were summarized in Table 3. WNA attained a higher power than MCC under all simulated models, which could be explained by the biased estimation of MCC under common disease scenarios. The power of MCC also decreased more rapidly as the number of noise SNPs increased, compared to the power of COX in cohort studies. Similar with COX, MCC tended to have a higher sensitivity, but a lower specificity than WNA.
For casecontrol studies, we assumed that the disease prevalence was accurately estimated. However, in practical, the prevalence of a disease might be estimated inaccurately. Therefore, additional simulation was conducted to evaluate the performance of WNA when disease prevalence was estimated inaccurately. For simplicity, we used the Disease Model 2 as an example, and considered both accurate and inaccurately estimated disease prevalence values, including ρ, ρ ± 5%, ρ ± 10%. The results were summarized in Table 4. The power of WNA increased slightly as the disease prevalence decreased. However, the type I errors also increased as the disease prevalence decreased, and would be inflated when disease prevalence was underestimated. Further, the underestimation of disease prevalence appeared to increase the sensitivity of SNP selection, but reduced the specificity. By adjusting the disease prevalence, we expect to retrieve a pseudocohort population with a number of unobserved controls, who are expected to be at risk throughout the study. Underestimating the disease prevalence would artificially increase the number of controls in the study, and thus cause bias in the followup studies with an independent subcohort data from the source population.
Application to study the progression to nicotine dependence (ND)
Previous studies have indicated that the progression to ND could be influenced by the interplay of genetic variants [30, 31]. Detecting GG interactions contributing to the development of ND would help to understand the transition process from first cigarette use to nicotine dependence, and to promote the development of early prevention and intervention strategies. For such a purpose, we initiated an interaction search among known NDassociated genetic variants by applying the WNA approach to the Study of Addiction: Genetics and Environment (SAGE) GWAS dataset. The participants of the SAGE were unrelated individuals selected from three large, complementary casecontrol studies: the Family Study of Cocaine Dependence (FSCD), the Collaborative Study on the Genetics of Alcoholism (COGA), and the Collaborative Genetic Study of Nicotine Dependence (COGEND) [32]. The SAGE included standardized diagnostic assessments of ND by Diagnostic and Statistical Manual of Mental Disorders (DSM) IV, and its assessment plans for ageatonset variables were also guided by standardized interview protocols and assessments, as described in prior SAGE publications [33, 34]. We considered two ageatonset variables, ageatonset of ND and ageatinitiation of tobacco uses, and defined progression to ND as their difference. The study subjects under the investigation were limited to those who ever smoked cigarettes daily for a month or more. For nonND subjects, ageatsurvey was used as rightcensoring values for ageatonset of ND. After removing the subjects with missing outcomes, there were 706, 727, and 1,232 subjects in FSCD, COGA and COGEND, respectively. From the literature, we selected 150 SNPs across 64 candidate genes that have been reported for potential association with ND. Among those 150 SNPs, genotypes for 124 SNPs were available in the SAGE dataset, while genotypes for the remaining 26 SNPs were imputed by using PLINK [35]. The HapMap phase III founders of the CEU and ASW populations were used in the imputation as the reference panels for the white and black subjects [36].
We applied WNA to FSCD for an initial GG interaction search and then replicated the initial findings in COGA and COGEND. While applying WNA, we fixed the disease prevalence at ρ = 0.24, which was estimated according to national survey [37]. Two SNPs, rs6570989 (A/G) and rs2930357 (C/T), located in gene GRIK2 and CSMD1, were identified in the initial search to be jointly associated with progression of ND with a nominal pvalue of 9.68e13. Permutation test was then conducted to estimate the empirical pvalue, accounting for overestimation due to the model selection. The empirical pvalue obtained from permutation test indicated a significant association (i.e., pvalue < 0.001). Further validation of the finding in COGA (pvalue = 0.034) and COGEND (pvalue = 7.85e04) showed the association remained significant at 5% level (Table 5). Survival curves in Figure 1 showed that the effect of rs6570989 was modified by the genotypes of rs2930357 in FSCD, which indicated a possible GG interaction between two SNPs (Figure 1 A1A2). Similar patterns were also observed in COGA (Figure 1 B1B2) and COGEND (Figure 1 C1C2). To account for the possible bias estimation of disease prevalence, we further examined the joint association for the identified two SNPs with the disease prevalence rates of 0.19 and 0.29 (i.e. 0.24 ± 0.05). The results showed that the significance level decreased as the disease prevalence increased (Table 6), but all joint association remained at least marginally significant with the disease prevalence of 0.29 (pvalues were 4.08e12, 0.054 and 2.00e03 in FSCA, COGA and COGEND, respectively).
We also applied the Cox regression method to the same datasets. Only pairwise interactions among SNPs were considered in the selection. The final model was determined by forward selection to minimize the AIC value. In FSCD, Cox regression with forward selection picked up nine SNPs involving a complicated model with a total number of 45 parameters. This association could not be replicated in either COGA (pvalue = 0.703) or COGEND (pvalue = 0.218).
Discussion
Complex diseases, manifesting with various clinical features, are believed to be caused by the joint action of multiple genetic variants through distinctive biological pathways. If two genes are jointly involved in producing the variability of a disease feature, whether additively or not, biological interaction between them is involved [38]. Although there is growing interest in detecting genetic variants that characterize disease progression, relatively few approaches have been proposed to evaluate the interaction among multiple genetic variants. In this article, we have proposed a nonparametric approach, referred to as WNA, for testing the joint association of multiple genetic variants with the ageat onset outcomes, taking possible GG interactions into account. The approach can be applied to both prospective cohort studies and casecontrol studies. Through simulations and an empirical study, we have shown that our approach had a comparable or better performance than Cox regression under various scenarios. The outperformance of WNA over Cox regression can be explained by the following reasons: 1) WNA does not assume any patterns of the hazard functions among GG groups, which makes it more robust under various disease scenarios. While we expect Cox regression to have a better performance than WNA when the underlying disease model is known, in reality our understanding of the mode of inheritance for complex diseases is very limited. In such a case, nonparametric approaches, such as WNA, would have more advantages for the search of genegene interactions. 2) When a set of SNPs are involved, Cox regression tends to select highly complex genetic models with a large number of parameters. Unlike Cox regression, WNA is a nonparametric approach, and does not assume any parametric model for the selected GG combinations. When the assumptions are violated, WNA likely captures the underlying GG combinations, which can be replicated in independent studies. 3) Given the disease prevalence in the source population, WNA can be easily extended to studies with casecontrol designs. For casecontrols studies, it has been suggested that Cox regression should be used with great caution [17, 18], due to the biased estimation of the effect size and incorrect statistical inference.
In the simulation, one of our aims is to examine robustness of the proposed WNA approach under casecontrol studies of common diseases, where the COX/MCC approach has biased estimates. Therefore, we have evaluated the performance of two approaches with an independent subcohort dataset from the source population, mimicking an independent followup study in a real application. The results have shown WNA approach had an improved power by adjusting for the disease prevalence in casecontrol studies, but in the meantime, underestimating disease prevalence may lead to an inflated type I error for WNA. Caution should be taken in specifying the disease prevalence in real applications.
Another important consideration is the choice of weight function in WNA. The weight function in this study gives the most weight to departures of hazard functions at early ages. Alternatively, we could also adopt a general class of weight functions based on the KaplanMeier estimator [28]. The weight function has the form,
where $\widehat{\mathit{S}}\left(.\right)$ is the KaplanMeier estimator of the survival function. The values of p and q should be chosen appropriately according to the hypothesis of interest. For instance, when p = 0, q > 0, the above weight gives more consideration to departures of hazard functions at late ages.
In the real data application, we identified two SNPs, rs6570989 and rs2930357, jointly associated with ND. In a recent GWAS of 3497 Dutch subjects, both of these two SNP were found to be significantly associated with current smoking [39]. These two SNPs are located in gene GRIK2 and CSMD1, respectively. Both GRIK2 and CSMD1 have been suggested to be functionally related with ND. GRIK2 belongs to the kainate family of glutamate receptors, which are actively involved in a variety of neurophysiologic processes [40, 41]. GRIK2 has also been reported to be associated with smoking cessation [42]. Gene CSMD1 was shown to be highly expressed in the central nerve system [43], and to be related to smoking cessation [44]. A number of studies have also suggested that early smoking initiation and the development of nicotine dependence are associated with greater difficulty to quit smoking [45–47]. Nonetheless, relatively few studies have been conducted to evaluate ND ageat onset outcomes, and our knowledge regarding the genetic contribution to the progression of ND is still lacking. While it is biologically plausible that the two identified genes may have a joint association contributed to the progression of ND, further studies are required to replicate this result.
We are aware that the proposed approach has some limitations. First, the test statistic of WNA follows an asymptotic Chisquare distribution when evaluating common genetic variants. However, if a genetic variant has a very low minor allele frequency, it may form certain GG groups with a small number of subjects. In such a case, the asymptotic property of the test may not hold [48, 49]. Therefore, for the rare variants, we suggest that an exact test be used to evaluate the significance [50]. Second, the proposed approach used a forward selection strategy, and we expect the power to decrease if none of the genetic variants has any marginal effect. In this specific case, exhaustive selection will be needed to detect a GG interaction, but at a much higher computation cost. Third, the proposed approach is implemented in R with model selection, crossvalidation and permutation procedures. It is less computationally efficient than applying a Cox regression model available in R. On a dual core 3.20GHz desktop, the average computation time for applying the WNA approach and Cox regression were 23.2 second and 0.26 second, respectively. On replication datasets when the optimal GG model was predetermined, the computation time for applying the WNA approach was significantly reduced to 0.028 second, which was comparable to those for Cox regression (0.031 second).
One major advantage of the proposed WNA approach is its capability of handling multiple genetic variants with the consideration of possible highorder GG interactions [51]. It is worthwhile to note that WNA is a nonparametric approach developed for both cohort studies and casecontrol studies, which differs from other approaches, such as the kernelmachine based approach [52]. Further, we limited the application of WNA approach to populationbased casecontrol studies in which the cases and controls were not matched by age. If controls are matched to cases and are randomly selected from all those at risk at the ageofonset of the cases, Cox regression can estimate the effect size of by a conditional likelihood method without bias [53, 54].
Conclusions
We have proposed a statistical approach for detecting genetic interactions associated with ageatonset outcomes. The approach is able to capture highorder genegene interactions, and can be applied to both prospective cohort studies and casecontrol studies. Through simulations, we showed that the new approach had comparable or better performance than the conventional Coxregressionbased methods. The empirical data applications to nicotine dependence also identified two genes, GRIK2 and CSMD1, joint associated with the progression of nicotine dependence. In addition to conventional statistical approaches for survival outcomes, the new approach provides an alternative way to model genetic interactions related to survival outcomes.
References
 1.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747753. 10.1038/nature08494.
 2.
So HC, Gui AH, Cherny SS, Sham PC: Evaluating the heritability explained by known susceptibility variants: a survey of ten complex diseases. Genet Epidemiol. 2011, 35 (5): 310317. 10.1002/gepi.20579.
 3.
Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003, 56 (1–3): 7382.
 4.
Nagel RL: Epistasis and the genetics of human diseases. C R Biol. 2005, 328 (7): 606615. 10.1016/j.crvi.2005.05.003.
 5.
Pendergrass SA, BrownGentry K, Dudek SM, Torstenson ES, Ambite JL, Avery CL, Buyske S, Cai C, Fesinmeyer MD, Haiman C, Heiss G, Hindorff LA, Hsu CN, Jackson RD, Kooperberg C, Le Marchand L, Lin Y, Matise TC, Moreland L, Monroe K, Reiner AP, Wallace R, Wilkens LR, Crawford DC, Ritchie MD: The use of phenomewide association studies (PheWAS) for exploration of novel genotypephenotype relationships and pleiotropy discovery. Genet Epidemiol. 2011, 35 (5): 410422. 10.1002/gepi.20589.
 6.
Fisher RA: The genetical theory of nature selection. 1930, Oxford: The Clarendon Press
 7.
Partridge L, Gems D: Mechanisms of ageing: public or private?. Nat Rev Genet. 2002, 3 (3): 165175. 10.1038/nrg753.
 8.
Wright A, Charlesworth B, Rudan I, Carothers A, Campbell H: A polygenic basis for lateonset disease. Trends Genet. 2003, 19 (2): 97106. 10.1016/S01689525(02)000331.
 9.
Lin PI, McInnis MG, Potash JB, Willour VL, Mackinnon DF, Miao K, Depaulo JR, Zandi PP: Assessment of the effect of age at onset on linkage to bipolar disorder: evidence on chromosomes 18p and 21q. Am J Hum Genet. 2005, 77 (4): 545555. 10.1086/491602.
 10.
Price DL, Sisodia SS, Borchelt DR: Alzheimer disease–when and why?. Nat Genet. 1998, 19 (4): 314316. 10.1038/1196.
 11.
Azzato EM, Pharoah PD, Harrington P, Easton DF, Greenberg D, Caporaso NE, Chanock SJ, Hoover RN, Thomas G, Hunter DJ, Kraft P: A genomewide association study of prognosis in breast cancer. Cancer Epidemiol Biomarkers Prev. 2010, 19 (4): 11401143. 10.1158/10559965.EPI100085.
 12.
Pillas D, Hoggart CJ, Evans DM, O'Reilly PF, Sipila K, Lahdesmaki R, Millwood IY, Kaakinen M, Netuveli G, Blane D, Charoen P, Sovio U, Pouta A, Freimer N, Hartikainen AL, Laitinen J, Vaara S, Glaser B, Crawford P, Timpson NJ, Ring SM, Deng G, Zhang W, McCarthy MI, Deloukas P, Peltonen L, Elliott P, Coin LJ, Smith GD, Jarvelin MR: Genomewide association study reveals multiple loci associated with primary tooth development during infancy. PLoS Genet. 2010, 6 (2): e100085610.1371/journal.pgen.1000856.
 13.
van Manen D, Delaneau O, Kootstra NA, BoeserNunnink BD, Limou S, Bol SM, Burger JA, Zwinderman AH, Moerland PD, van't Slot R, Zagury JF, Wout AB V 't, Schuitemaker H: Genomewide association scan in HIV1infected individuals identifying variants influencing disease course. PLoS One. 2011, 6 (7): e2220810.1371/journal.pone.0022208.
 14.
Scheike TH, Martinussen T, Silver JD: Estimating haplotype effects for survival data. Biometrics. 2010, 66 (3): 705715. 10.1111/j.15410420.2009.01329.x.
 15.
Souverein OW, Zwinderman AH, Jukema JW, Tanck MW: Estimating effects of rare haplotypes on failure time using a penalized Cox proportional hazards regression model. BMC Genet. 2008, 9: 9
 16.
Tregouet DA, Tiret L: Cox proportional hazards survival regression in haplotypebased association analysis using the StochasticEM algorithm. Eur J Hum Genet. 2004, 12 (11): 971974. 10.1038/sj.ejhg.5201238.
 17.
Lubin JH, Gail MH: Biased selection of controls for casecontrol analyses of cohort studies. Biometrics. 1984, 40 (1): 6375. 10.2307/2530744.
 18.
Robins JM, Gail MH, Lubin JH: More on "Biased selection of controls for casecontrol analyses of cohort studies". Biometrics. 1986, 42 (2): 293299. 10.2307/2531050.
 19.
Nan B, Lin X: Analysis of casecontrol ageatonset data using a modified casecohort method. Biom J. 2008, 50 (2): 311320. 10.1002/bimj.200710406.
 20.
Nelson W: Theory and applications of hazard plotting for censored failure data. Technometrics. 1972, 14: 945965. 10.1080/00401706.1972.10488991.
 21.
Aalen OO: Nonparametric inference for a family of counting process. Ann Stat. 1978, 6: 701726. 10.1214/aos/1176344247.
 22.
Pena EA, Rohatgi VK: Small sample and efficiency results for the NelsonAalen estimator. J Stat Plann Infer. 1993, 37: 193202. 10.1016/03783758(93)90088N.
 23.
Mantel N: Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep. 1966, 50 (3): 163170.
 24.
Breslow NE: A generalized KruskalWallis test for comparing K samples subject to unequal patterns of censorship. Biometrika. 1970, 57: 579594. 10.1093/biomet/57.3.579.
 25.
Gehan EA: A generalized Wilcoxon test for comparing arbitrarily singly consored samples. Biometrika. 1965, 53: 203223.
 26.
Peto R, Peto J: Asymptotically efficient rank invariant test procedures. J Roy Stat Soc Ser Gen. 1972, 135: 18510.2307/2344317.
 27.
Andersen PK: Testing goodness of fit of cox regression and life model. Biometrics. 1982, 38: 6777. 10.2307/2530289.
 28.
Fleming TR, Harrington DP: A Class of Hypothesis Tests for One and Two Samples of Censored Survival Data. Comm Stat. 1981, 10: 763794. 10.1080/03610928108828073.
 29.
Klein JP, Moeschberger ML: Survival Analysis: Techniques for Censored and Truncated Data. 2003, New York: Springer
 30.
Grucza RA, Johnson EO, Krueger RF, Breslau N, Saccone NL, Chen LS, Derringer J, Agrawal A, Lynskey M, Bierut LJ: Incorporating age at onset of smoking into genetic models for nicotine dependence: evidence for interaction with multiple genes. Addict Biol. 2010, 15 (3): 346357. 10.1111/j.13691600.2010.00220.x.
 31.
LessovSchlaggar CN, Kristjansson SD, Bucholz KK, Heath AC, Madden PA: Genetic influences on developmental smoking trajectories. Addiction. 2012, 107 (9): 16961704. 10.1111/j.13600443.2012.03871.x.
 32.
Bierut LJ, Agrawal A, Bucholz KK, Doheny KF, Laurie C, Pugh E, Fisher S, Fox L, Howells W, Bertelsen S, Hinrichs AL, Almasy L, Breslau N, Culverhouse RC, Dick DM, Edenberg HJ, Foroud T, Grucza RA, Hatsukami D, Hesselbrock V, Johnson EO, Kramer J, Krueger RF, Kuperman S, Lynskey M, Mann K, Neuman RJ, Nöthen MM, Nurnberger JI, Porjesz B, et al: A genomewide association study of alcohol dependence. Proc Natl Acad Sci U S A. 2010, 107 (11): 50825087. 10.1073/pnas.0911109107.
 33.
Bierut LJ, Strickland JR, Thompson JR, Afful SE, Cottler LB: Drug use and dependence in cocaine dependent subjects, communitybased individuals, and their siblings. Drug Alcohol Depend. 2008, 95 (1–2): 1422.
 34.
Johnson C, Drgon T, Liu QR, Zhang PW, Walther D, Li CY, Anthony JC, Ding Y, Eaton WW, Uhl GR: Genome wide association for substance dependence: convergent results from epidemiologic and research volunteer samples. BMC Med Genet. 2008, 9: 11310.1186/147123509113.
 35.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81 (3): 559575. 10.1086/519795.
 36.
Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Gibbs RA, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, et al: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467 (7311): 5258. 10.1038/nature09298.
 37.
Breslau N, Johnson EO, Hiripi E, Kessler R: Nicotine dependence in the United States: prevalence, trends, and smoking persistence. Arch Gen Psychiatry. 2001, 58 (9): 810816. 10.1001/archpsyc.58.9.810.
 38.
Wang X, Elston RC, Zhu X: The meaning of interaction. Hum Hered. 2010, 70 (4): 269277. 10.1159/000321967.
 39.
Vink JM, Smit AB, de Geus EJ, Sullivan P, Willemsen G, Hottenga JJ, Smit JH, Hoogendijk WJ, Zitman FG, Peltonen L, Kaprio J, Pedersen NL, Magnusson PK, Spector TD, Kyvik KO, Morley KI, Heath AC, Martin NG, Westendorp RG, Slagboom PE, Tiemeier H, Hofman A, Uitterlinden AG, Aulchenko YS, Amin N, van Duijn C, Penninx BW, Boomsma DI: Genomewide association study of smoking initiation and current smoking. Am J Hum Genet. 2009, 84 (3): 367379. 10.1016/j.ajhg.2009.02.001.
 40.
Tzschentke TM, Schmidt WJ: Glutamatergic mechanisms in addiction. Mol Psychiatry. 2003, 8 (4): 373382. 10.1038/sj.mp.4001269.
 41.
O'Donnell CJ, Cupples LA, D'Agostino RB, Fox CS, Hoffmann U, Hwang SJ, Ingellson E, Liu C, Murabito JM, Polak JF, Wolf PA, Demissie S: Genomewide association study for subclinical atherosclerosis in major arterial territories in the NHLBI's Framingham Heart Study. BMC Med Genet. 2007, 8 (Suppl 1): S410.1186/147123508S1S4.
 42.
Uhl GR, Liu QR, Drgon T, Johnson C, Walther D, Rose JE, David SP, Niaura R, Lerman C: Molecular genetics of successful smoking cessation: convergent genomewide association study results. Arch Gen Psychiatry. 2008, 65 (6): 683693. 10.1001/archpsyc.65.6.683.
 43.
Kraus DM, Elliott GS, Chute H, Horan T, Pfenninger KH, Sanford SD, Foster S, Scully S, Welcher AA, Holers VM: CSMD1 is a novel multiple domain complementregulatory protein highly expressed in the central nervous system and epithelial tissues. J Immunol. 2006, 176 (7): 44194430. 10.4049/jimmunol.176.7.4419.
 44.
Drgon T, Montoya I, Johnson C, Liu QR, Walther D, Hamer D, Uhl GR: Genomewide association for nicotine dependence and smoking cessation success in NIH research volunteers. Mol Med. 2009, 15 (1–2): 2127.
 45.
Breslau N, Peterson EL: Smoking cessation in young adults: age at initiation of cigarette smoking and other suspected influences. Am J Public Health. 1996, 86 (2): 214220. 10.2105/AJPH.86.2.214.
 46.
Chen J, Millar WJ: Age of smoking initiation: implications for quitting. Health Rep. 1998, 9 (4): 3946. Eng); 3948(Fre
 47.
Kandel DB, Hu MC, Griesler PC, Schaffran C: On the development of nicotine dependence in adolescence. Drug Alcohol Depend. 2007, 91 (1): 2639. 10.1016/j.drugalcdep.2007.04.011.
 48.
Kellerer AM, Chmelevsky D: Smallsample properties of censoreddata rank tests. Biometrics. 1983, 39: 675682. 10.2307/2531095.
 49.
Latta RB: A Monte Carlo study of some two sample rank tests with censored data. J Am Stat Assoc. 1981, 76: 713719. 10.1080/01621459.1981.10477710.
 50.
Heinze G, Gnant M, Schemper M: Exact logrank tests for unequal followup. Biometrics. 2003, 59 (4): 11511157. 10.1111/j.0006341X.2003.00132.x.
 51.
Wei C, Schaid DJ, Lu Q: Trees Assembling MannWhitney approach for detecting genomewide joint association among lowmarginaleffect loci. Genet Epidemiol. 2013, 37 (1): 8491. 10.1002/gepi.21693.
 52.
Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC: Kernel machine SNPset analysis for censored survival outcomes in genomewide association studies. Genet Epidemiol. 2011, 35 (7): 620631. 10.1002/gepi.20610.
 53.
Prentice RL, Breslow N: Retrospective studies and failure time models. Biometrika. 1978, 65: 153158. 10.1093/biomet/65.1.153.
 54.
Wacholder S: Bias in full cohort and nested casecontrol studies?. Epidemiology. 2009, 20 (3): 339340. 10.1097/EDE.0b013e31819ec966.
Acknowledgements
We would like to thank three anonymous reviewers for their critical comments that improved this manuscript. This work was supported, in part, by the University of Arkansas for Medical Sciences College of Medicine Children’s University Medical Group Fund Grant Program, the National Institute on Drug Abuse under Award Number K01DA033346 and K05DA015799, and the National Institute of Dental & Craniofacial Research under Award Number R03DE022379. Funding support for the Study of Addiction: Genetics and Environment (SAGE) was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01 HG004422). The SAGE datasets used for the analyses were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000092.v1.p1 through dbGaP accession number phs000092.v1.p1.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ML designed the study, performed the analysis and wrote the manuscript; JCG participated in methodology development and wrote the manuscript; NB participated in data interpretation and wrote the manuscript; JCA participated in data interpretation and wrote the manuscript; QL conceived the idea, designed the study and wrote the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Weighted NelsonAalen
 Cox regression
 Progression of nicotine dependence
 Joint association