Neural networks for modeling genegene interactions in association studies
 Frauke Günther^{1}Email author,
 Nina Wawro^{1} and
 Karin Bammann^{1}
DOI: 10.1186/147121561087
© Günther et al; licensee BioMed Central Ltd. 2009
Received: 4 March 2009
Accepted: 23 December 2009
Published: 23 December 2009
Abstract
Background
Our aim is to investigate the ability of neural networks to model different twolocus disease models. We conduct a simulation study to compare neural networks with two standard methods, namely logistic regression models and multifactor dimensionality reduction. One hundred data sets are generated for each of six twolocus disease models, which are considered in a low and in a high risk scenario. Two models represent independence, one is a multiplicative model, and three models are epistatic. For each data set, six neural networks (with up to five hidden neurons) and five logistic regression models (the null model, three main effect models, and the full model) with two different codings for the genotype information are fitted. Additionally, the multifactor dimensionality reduction approach is applied.
Results
The results show that neural networks are more successful in modeling the structure of the underlying disease model than logistic regression models in most of the investigated situations. In our simulation study, neither logistic regression nor multifactor dimensionality reduction are able to correctly identify biological interaction.
Conclusions
Neural networks are a promising tool to handle complex data situations. However, further research is necessary concerning the interpretation of their parameters.
Background
The investigation of complex diseases plays an important role in genetic epidemiology where the identification of genetic risk factors is of great interest. Besides the study of main effects, the interplay of two or more genetic risk factors gains more and more attention. The identification of such a biological interaction or epistasis, however, is linked to new challenges for statistical methods. A major problem is the discrepancy between statistical and biological interaction. Statistical interaction is commonly defined as the deviation from an additive effect of single risk factors on the outcome, respectively on the transformed outcome. In logistic regression models, for example, a multiplicative structural model is applied and an additive effect on the logittransformed outcome implies a multiplicative effect on the untransformed outcome. Therefore, statistical interaction in a logistic regression model is understood as deviation from a multiplicative effect.
On the contrary, biological interaction is present if one gene is influencing the effect of another one [1]. Both terms do not coincide as was shown for example by North et al. [2] or Foraita et al. [3]. Nevertheless, a meaningful interpretation of genetic studies requires the detection of biological interaction with statistical methods (cf. [4, 5]).
A variety of parametric and nonparametric methods has been proposed for modeling and detecting genegene interaction, e.g. supportvector machines [6], random forests [7, 8], multifactor dimensionality reduction (MDR, [9, 10]), combinatorial partitioning methods [11], focused interaction testing framework [12], classification and regression trees (CART, [13]), logic regression [14], and lasso regression [15]. A useful classification is given by Musani et al. [16], who distinguish between regressionbased methods, data reductionbased methods, and pattern recognition methods in their overview.
Despite the wealth of these approaches, none of the proposed methods is optimal for all twolocus disease models (see e.g. [17–19]). Consequently, there is no established method for analyzing genegene interactions so far [20]. Since parametric methods have problems to detect interaction in the absence of main effects and nonparametric approaches are ineffective when main effects are present [16, 21], it might well be that there is no single approach appropriate for all types of biological interaction. Currently, generalized linear models, and here logistic regression models, as well as MDR are predominantly applied (see e.g. [22–27]). Another tool that has been employed in genetic epidemiology during the last 15 years is the neural network approach (see e.g. [28–32]). Neural networks are a flexible statistical tool to model any functional relationship between covariates and response variables. Therefore, they represent a promising approach to deal with the difficulties associated with modeling biological genegene interactions. They have as well been successfully applied for variable selection as for example with genetic programming neural networks (GPNN, [33–36]) or grammatical evolution neural networks (GENN, [37, 38]). Both approaches were developed to identify an optimal network topology. Motsinger et al. [39] successfully applied GENN to simulated genome wide association data with 500,000 Single Nucleotide Polymorphisms (SNPs) showing the general ability of neural networks to handle such large data sets. However, variable selection is not the focus of this paper.
The aim of this paper is to explore the ability of neural networks to model different types of biological genegene interactions. For this purpose, a simulation study is conducted to investigate the behavior of neural networks in various situations. We assume a casecontrol study with equal numbers of cases and controls. Following the scenarios of Risch [40] and the concept of epistatic models as classified by Li and Reich [41], different theoretical types of genegene interactions are studied. There are exactly two loci involved, i.e. variable selection is not a problem. The results are compared with those of logistic regression models and those of MDR analyses. Finally, the advantages and disadvantages of using a neural network approach are discussed.
Methods
Neural networks
A feedforward multilayer perceptron (MLP) is chosen as neural network [42]. The general idea of an MLP is to approximate arbitrary functional relationships between covariates and response variables.
Data is passing the neural network as signals. These signals travel the synapses and pass the neurons where the signals are processed. All incoming signals are added and the activation function σ is applied to the resulting sum. Additionally, a weight is attached to each of the synapses. A positive weight indicates an amplifying, a negative weight a repressing effect on the signal. During the training process, the weights are modified by a learning algorithm. The learning algorithm minimizes an error function that depends on the difference between the given output and the output estimated by the neural network. In general, the strength of the modification depends on a specified learning rate.
In this case, all weights w_{ i }of the MLP correspond to the regression coefficients β_{ i }of the logistic regression model.
and is capable to model any piecewise continuous function [44]. Here, there is a lack of interpretation of the parameters.
In the present paper, we investigate MLPs with at most one hidden layer. Resilient backpropagation [45] and cross entropy are chosen as learning algorithm and error function, respectively. The latter choice guarantees equivalence of the trained weights to maximumlikelihood estimation (see e.g. [46]). The employment of resilient backpropagation as learning algorithm does not require a transformation of continuous data. It solves the problem of choosing an appropriate learning rate for each data situation.
Design of the simulation study
We conduct a simulation study, where neural network models are used to fit different twolocus disease models in a casecontrol design. For each of these models, one low risk and one high risk scenario is simulated. Unconditional logistic regression models are fitted to the same data sets to compare the results with an established method. For judging the ability to model the underlying disease model, the estimated penetrance matrices are compared to the theoretical penetrance matrices.
Twolocus disease models
Six different twolocus disease models are considered: three models introduced by Risch [40] and three different epistatic models. They can be distinguished by the structure of their penetrance matrices f = [f_{ ij }]_{i, j}, where i, j ∈ {0, 1, 2} represent the genotype at the two loci.
where Y denotes the casecontrol status and G_{ A }and G_{ B }, G_{ A }, G_{ B }∈ {0, 1, 2}, the genotypes at the two involved loci. The penetrance terms a_{ i }and b_{ j }are restricted to 0 ≤ a_{ i }, b_{ j }≤ 1 and a_{ i }+ b_{ j }≤ 1. This model represents biological independence of both loci.
Like the additivity model, the heterogeneity model describes a model of biological independence for 0 ≤ a_{ i }, b_{ j }≤ 1. However, in this case no further constraints on the penetrance terms are necessary.
The multiplicative model represents biological interaction.
where the constant term c denotes the baseline risk of getting the disease and r the risk increase or decrease. This model assumes that both genes have a recessive effect on the disease, since there is only an increased or decreased risk if both loci carry two mutated alleles.
i.e. both loci are assumed to be dominant. In this setting, an increased or decreased risk is only observed if both loci carry at least one mutated allele.
In this situation, one gene (A) has a recessive and one gene (B) has a dominant effect on the disease.
All epistatic models represent genegene interaction. By choosing the parameters r, r_{1}, r_{2} and the ratios a_{1}/a_{0}, a_{2}/a_{0}, b_{1}/b_{0}, and b_{2}/b_{0}, respectively, different risk scenarios can be generated.
Data generation
Risk scenarios.
Twolocus disease model  Low risk scenario  High risk scenario 

ADD, HET, MULT  a_{1} = 2·a_{0}  a_{1} = 5·a_{0} 
a_{2} = 4·a_{0}  a_{2} = 10·a_{0}  
b_{1} = 5·b_{0}  b_{1} = 5·b_{0}  
b_{2} = 10·b_{0}  b_{2} = 10·b_{0}  
EPI RR  r = 5  r = 10 
EPI DD, EPI RD  r_{1} = 2  r_{1} = 5 
r_{2} = 4  r_{2} = 10 
As a second step, 100 casecontrol samples with 1,000 cases and 1,000 controls are drawn randomly from each basic population, i.e. each combination of twolocus disease model and risk scenario. Overall, this results in 12 times 100 casecontrol samples that will be analyzed.
Modeling the data
Modelbuilding with neural networks is done using six different network topologies from zero neurons in the hidden layer (i.e. no hidden layer) up to five neurons in the hidden layer. Each topology is trained five times with synaptic weights initialized with random numbers drawn from a standard normal distribution to avoid local minima. From these fitted models, the best model for each data set, i.e. the network topology, is chosen using Akaike's Information Criterion (AIC, [47]).
Number of parameters.
Neural network  

0 hidden neurons  3  
1 hidden neuron  5  
2 hidden neurons  9  
3 hidden neurons  13  
4 hidden neurons  17  
5 hidden neurons  21  
Logistic regression  Logistic regression (DV)  
Null model (NM)  1  1 
One main effect (SiA/SiB)  2  3 
Both main effects (ME)  3  5 
Full model (FM)  4  9 
where i, j ∈ {0, 1, 2}, and f_{ ij }and denote the entries of the theoretical and estimated penetrance matrix of the k th sample, respectively. Furthermore, the sum of the mean absolute differences ∑_{i, j}E_{ ij }is considered.
The data generation and the statistical analyses for neural network and logistic regression are performed using R [48]. The package for the MLP, neuralnet, was newly implemented by our group and is published on CRAN [49].
Additionally, the MDR approach is applied to the data. The analyses are conducted by the javabased open source software MDR release 1.2.5 with default configurations [50]. In particular, analysis configurations are specified as follows: the random seed is set to zero, the attribute count maximum is set to two and the crossvalidation count to ten. The MDR identifies a set of functional variables that is best for classifying cases and controls. Due to the number of simulated loci, the software can only select one of three sets: either locus A or locus B only or both loci. Additionally, it provides a dendrogram to distinguish between redundant and synergistic variables based on information theory [51].
Results
Additive model (ADD).
Low risk  High risk  

a_{1} = 2 a_{0}; a_{2} = 4 a_{0}  a_{1} = 5 a_{0}; a_{2} = 10 a_{0}  
b_{1} = 5·b_{0}; b_{2} = 10 b_{0}  b_{1} = 5 b_{0}; b_{2} = 10 b_{0}  
Theoretical penetrance matrix 


Neural network  
Mean absolute difference E 


Sum  0.2313  0.2059 
Logistic regression  
Mean absolute difference E 


Sum  0.2530  0.2544 
Logistic regression (design variables)  
Mean absolute difference E 


Sum  0.2897  0.2804 
Multiplicative model (MULT).
Low risk  High risk  

a_{1} = 2· a_{0}; a_{2} = 4· a_{0}  a_{1} = 5· a_{0}; a_{2} = 10· a_{0}  
b_{1} = 5· b_{0}; b_{2} = 10· b_{0}  b_{1} = 5· b_{0}; b_{2} = 10· b_{0}  
Theoretical penetrance matrix 


Neural network  
Mean absolute difference E 


Sum  0.2428  0.2178 
Logistic regression  
Mean absolute difference E 


Sum  0.3965  0.4887 
Logistic regression (design variables)  
Mean absolute difference E 


Sum  0.1637  0.1833 
Epistatic model  recessive (EPI RR).
Low risk  High risk  

r = 5  r = 10  
Theoretical penetrance matrix 


Neural network  
Mean absolute difference E 


Sum  0.2071  0.1410 
Logistic regression  
Mean absolute difference E 


Sum  0.4849  0.6150 
Logistic regression (design variables)  
Mean absolute difference E 


Sum  0.3503  0.2755 
Epistatic model  dominant (EPI DD).
Low risk  High risk  

r_{1} = 2; r_{2} = 4  r_{1} = 5; r_{2} = 10  
Theoretical penetrance matrix 


Neural network  
Mean absolute difference E 


Sum  0.3095  0.2524 
Logistic regression  
Mean absolute difference E 


Sum  0.3132  0.6528 
Logistic regression (design variables)  
Mean absolute difference E 


Sum  0.3071  0.2648 
Epistatic model  mixed (EPI RD).
Low risk  High risk  

r_{1} = 2; r_{2} = 4  r_{1} = 5; r_{2} = 10  
Theoretical penetrance matrix 


Neural network  
Mean absolute difference E 


Sum  0.2239  0.1563 
Logistic regression  
Mean absolute difference E 


Sum  0.5105  0.8658 
Logistic regression (design variables)  
Mean absolute difference E 


Sum  0.2799  0.2329 
The results for the epistatic model with two dominant loci are different for the two risk scenarios (see Table 6). In the low risk scenario, none of the three statistical approaches is able to satisfactorily estimate the theoretical penetrance matrix of the disease model. The sum of the mean absolute differences ranges from ∑E_{ ij }= 0.3071 to ∑E_{ ij }= 0.3132 for the three approaches. In the high risk scenario, neural networks slightly outperform the logistic regression models with design variables, whereas the regression models without design variables completely fail to detect the characteristic structure of the underlying penetrance matrix (∑E_{ ij }= 0.2524 for neural networks versus ∑E_{ ij }= 0.2648 and ∑E_{ ij }= 0.6528 for logistic regression models with respectively without design variables). The better fit of neural networks and logistic regression models with design variables is traded off by a high number of parameters: both approaches need on average about 9 parameters (results not shown).
The structure of the theoretical penetrance matrices given by the mixed epistatic model with one dominant and one recessive locus is again best modeled by neural networks (see Table 7). This can be observed for the sum and for the single entries of the mean absolute differences between the theoretical and the estimated penetrance matrices in both risk scenarios. The logistic regression models without design variables are again not able to identify this structure. The mean absolute differences are much higher as opposed to the differences of the other approaches (e.g ∑E_{ ij }= 0.8658 and ∑E_{ ij }= 0.2329 for logistic regression models without respectively with design variables and ∑E_{ ij }= 0.1563 for neural networks in the high risk scenario).
Selected logistic regression models (LRM).
LRM with design variables  

Statistical model (# parameters)  
Twolocus disease model  Risk scenario  NM (1)  SiA (3)  SiB (3)  ME (5)  FM (9)  ∑ 
ADD  low  1  39  60  100  
high  7  93  100  
HET  low  55  45  100  
high  10  90  100  
MULT  low  90  10  100  
high  88  12  100  
EPI RR  low  6  94  100  
high  100  100  
EPI DD  low  3  97  100  
high  100  100  
EPI RD  low  61  19  20  100  
high  57  14  29  100  
LRM without design variables  
Statistical model (# parameters)  
NM (1)  SiA (2)  SiB (2)  ME (3)  FM (4)  ∑  
ADD  low  1  27  72  100  
high  4  96  100  
HET  low  30  70  100  
high  6  94  100  
MULT  low  72  28  100  
high  54  46  100  
EPI RD  low  7  6  9  3  75  100 
high  100  100  
EPI DD  low  2  98  100  
high  100  100  
EPI RD  low  60  19  21  100  
high  38  23  39  100 
Different twolocus disease models representing genegene interaction lead to varying results when logistic regression models are applied. The logistic regression models do not include an interaction term in most replications when the multiplicative model is the underlying disease model. That means that the logistic regression models fail to detect the underlying biological interaction. The recessive and the dominant epistatic model are correctly represented by the full model in most situations. Only in the low risk scenario of the recessive epistatic model, the logistic regression models without design variables choose a broad variety of models in a quarter of the replications. For the mixed epistatic models, the logistic regression models perform poorly: Since model SiA is mostly selected, the main effect for the (dominant) locus B is not detected in more than half of the replications and the interaction effect is included only in about 20% of the replications.
MDR analyses: selected variables and identification as redundant or synergistic behavior.
MDR analyses  

Redundant  Synergistic  
Twolocus disease model  Risk scenario  Only A  Only B  Both  Only A  Only B  Both  ∑ 
ADD  low  82  18  100  
high  7  93  100  
HET  low  68  32  100  
high  1  6  93  100  
MULT  low  7  93  100  
high  100  100  
EPI RR  low  10  22  39  2  4  23  100 
high  18  17  59  1  2  3  100  
EPI DD  low  12  1  87  100  
high  18  82  100  
EPI RD  low  63  34  3  100  
high  97  3  100 
Additionally, the provided dendrogram can be applied to distinguish between redundancy and synergism. These concepts are related to independence and interaction in our context [52]. Both loci are categorized as redundant for most of the investigated populations. Only the dominant epistatic model is correctly identified as a synergistic model for the majority of the data sets.
No similar statement about the agreement of disease and statistical model can be made for neural networks since there is no equivalent to the concept of interaction terms. Neural networks with one or two neurons in the hidden layer (i.e. models with five or nine parameters) are the most frequent models selected in the simulation study.
Discussion
In our simulation study, we investigated whether neural networks are able to model different types of genegene interaction in casecontrol data. For this purpose, we analyzed simulated data of six different twolocus disease models in two different risk scenarios with neural networks and compared the results to logistic regression models using two different approaches for coding the genotype information. Additionally, we investigated whether logistic regression models or the MDR approach, which are two widely used methods in applications, are suitable to identify biological interaction.
For the majority of the investigated situations, the theoretical penetrance matrix is estimated most accurately by neural networks as opposed to logistic regression models. The exception is the multiplicative model in both risk scenarios and the dominant epistatic model in the low risk scenario. Although, in these situations, neural networks use two neurons in the hidden layer, i.e. nine parameters, in most replications, they are not able to exploit the flexibility to correctly represent this disease model. For the logistic regression models, it can be stated that the disease models of independence are better represented by a logistic regression model without design variables and the disease models of interaction are better represented by a logistic regression model with design variables. In situations where interaction is present using a logistic regression model without design variables might lead to wrong results. Since the underlying disease model is usually not known beforehand, no recommendation can be given whether to employ design variables or not. Both logistic regression models mostly select a main effect model to represent the multiplicative model. The inclusion of interaction terms signifies deviations from the structural model rather than from the disease model representing independence. Consequently, the underlying biological interaction represented by the multiplicative and the epistatic models cannot be read off the fitted logistic regression models. The same holds true for the MDR approach. It is not possible to correctly identify biological interaction based on the sets of selected variables or based on the dendrograms since the additive and the heterogeneity model as independence models cannot be distinguished from the four models representing biological interaction with neither of these two criteria.
The results confirm previous studies that demonstrate the excellent modeling capacities of neural networks [32]. We investigated, whether the weaker performance of the neural network especially for the multiplicative model might be due to a wrong model selection criterion. Alternatively to the AIC, we calculated Bayes Information Criterion (BIC, see [53]) for all models (results not shown). However, employing the BIC for model selection does not improve the performance of the neural network as opposed to the logistic regression models. In fact, the stronger performance of the logistic regression model is supposed to be due to the fact that the multiplicative model exactly corresponds to the structural model of the logistic regression model.
It might be disputed whether the applied risk scenarios feature too large genotype relative risks to be meaningful for realdata applications. For the recessive epistatic model as the most extreme situation, alternative scenarios were investigated employing smaller risks. All investigated approaches have difficulties detecting these smaller risks. For the logistic regression models, the null model is mostly chosen, thus, neglecting the elevated penetrance when both loci carry two mutated alleles.
Neural networks do not explicitly use interaction terms for modeling data. Unlike in logistic regression models, where an interaction term might become significant or not, there is no easy way to assess whether interaction is present using a neural network. Moreover, in models with one or more hidden layers there is no direct interpretation of the estimated parameters and the MLP is generally considered as a blackbox approach. This can be seen as the biggest drawback when employing neural networks for data analyses where interpretation is a major concern. However, the modeling capacities of a neural network allow to adjust to practically any given data structure, including any interaction structure, which makes it an extremely powerful statistical tool. This advantage might even be more pronounced when modeling continuous variables, for example when modeling geneenvironment interactions.
The use of neural networks in applications is currently still limited because of existing research gaps. Especially, the interpretability of the estimated weights is not yet given. Nevertheless, they offer a promising tool for exploratory analyses in candidate gene studies. For instance, they can well be applied when one is interested in odds ratios for single SNPs. The estimated odds ratios are more realistic than those estimated by logistic regression models in a lot of situations since the estimated output of neural networks better represents the underlying population. As initially stated, we did not explore the ability of neural networks for variable selection, which is a key problem in genomewide association (GWA) studies.
Conclusions
We explored the ability of neural networks to model different types of biological genegene interactions and compared them to logistic regression models and the MDR approach. The latter methods do not allow reading off the underlying twolocus disease models. Neural networks do not explicitly include an interaction term but they are able to model any data structure. Even though the estimated weights are not interpretable, this makes them a powerful statistical tool. Further research should be devoted to develop a framework for interpreting the parameters estimated by a neural network to allow a broader use of these tools.
Appendix
where P^{ s }indicates a probability in a casecontrol sample. There are only changes in the joint probabilities of the genotypes P^{ s }(G_{ A }= i, G_{ B }= j) because of the change of prevalence: P^{ s }(Y = 1) = P^{ s }(Y = 0) = 0.5.
This theoretical penetrance matrix of the sample is compared to the predicted penetrance matrices generated by the different models to judge the ability of neural networks and logistic regression models to model different twolocus disease models.
Declarations
Acknowledgements
The authors thank Iris Pigeot for reading preliminary versions of the paper and for giving helpful comments and kind support. Additionally, we thank five anonymous reviewers for their valuable suggestions and remarks.
We gratefully acknowledge the financial support of this research by the grant PI 345/31 from the German Research Foundation (DFG).
Authors’ Affiliations
References
 Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Gen. 2002, 11 (20): 24632468. 10.1093/hmg/11.20.2463.View ArticlePubMedGoogle Scholar
 North B, Curtis D, Sham PC: Application of logistic regression to casecontrol association studies involving two causative loci. Hum Hered . 2005, 59 (2): 7987. 10.1159/000085222.View ArticlePubMedGoogle Scholar
 Foraita R, Bammann K, Pigeot I: Modeling genegeneinteractions using graphical chain models. Hum Hered. 2008, 65: 4756. 10.1159/000106061.View ArticlePubMedGoogle Scholar
 Wade MJ, Winther RG, Agrawal AF, Goodnight CJ: Alternative definitions of epistasis: dependence and interaction. Trends Ecol Evol. 2001, 16: 498504. 10.1016/S01695347(01)022133.View ArticleGoogle Scholar
 Moore JH, Williams SM: Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. Bioessays. 2005, 27: 637646. 10.1002/bies.20236.View ArticlePubMedGoogle Scholar
 Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, Chang BL, Zheng SL, Grönberg H, Xu J, Hsu FC: A support vector machine approach for detecting genegene interaction. Genet Epidemiol. 2008, 32: 152167. 10.1002/gepi.20272.View ArticlePubMedGoogle Scholar
 Amit Y, Geman D: Shape quantization and recognition with randomized trees. Neural Comput. 1997, 9: 15451588. 10.1162/neco.1997.9.7.1545.View ArticleGoogle Scholar
 Breiman L: Random forests. Mach Learn. 2001, 45: 532. 10.1023/A:1010933404324.View ArticleGoogle Scholar
 Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138147. 10.1086/321276.PubMed CentralView ArticlePubMedGoogle Scholar
 Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction for detecting genegene and geneenvironment interactions. Bioinformatics. 2003, 19: 376382. 10.1093/bioinformatics/btf869.View ArticlePubMedGoogle Scholar
 Nelson MR, Kardia SLR, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partions that predict quantitative trait variation. Genome Res. 2001, 11: 458470. 10.1101/gr.172901.PubMed CentralView ArticlePubMedGoogle Scholar
 Millstein J, Conti DV, Gilliland FD, Gauderman WJ: A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006, 78: 1527. 10.1086/498850.PubMed CentralView ArticlePubMedGoogle Scholar
 Cook NR, Zee RYL, Ridker PM: Tree and spline based association analysis of genegene interaction models for ischemic stroke. Stat Med. 2004, 23: 14391453. 10.1002/sim.1749.View ArticlePubMedGoogle Scholar
 Ruczinski I, Kooperberg C, LeBlanc M: Logic regression. J Comput Graph Stat. 2003, 12 (3): 475511. 10.1198/1061860032238.View ArticleGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. J Roy Stat Soc B. 1996, 58: 267288.Google Scholar
 Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene × gene interactions in genomewide association studies of human population data. Hum Hered. 2007, 63: 6784. 10.1159/000099179.View ArticlePubMedGoogle Scholar
 Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der ADL, Feskens EJM: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006, 7: 2310.1186/14712156723.PubMed CentralView ArticlePubMedGoogle Scholar
 Briollais L, Wang Y, Rajendram I, Onay V, Shi E, Knight J, Ozcelik H: Methodological issues in detecting genegene interaction in breast cancer susceptibility: a populationbased study in Ontario. BMC Med. 2007, 5: 2210.1186/17417015522.PubMed CentralView ArticlePubMedGoogle Scholar
 Milne RL, Fagerholm R, Nevanlinna H, Benítez J: The importance of replication in genegene interaction studies: multifactor dimensionality reduction applied to a twostage breast cancer casecontrol study. Carcinogenesis. 2008, 29 (6): 12151218. 10.1093/carcin/bgn120.View ArticlePubMedGoogle Scholar
 Lanktree MB, Hegele RA: Genegene and geneenvironment interactions: new insights into the prevention, detection and management of coronary artery disease. Genome Med. 2009, 1: 2810.1186/gm28.PubMed CentralView ArticlePubMedGoogle Scholar
 MotsingerReif AA, Reif DM, Fanelli TJ, Ritchie MD: A comparison of analytical methods for genetic association studies. Genet Epidemiol. 2008, 32: 767778. 10.1002/gepi.20345.View ArticlePubMedGoogle Scholar
 Sáez ME, Grilo A, Morón FJ, Manzano L, MartínezLarrad MT, GonzálezPérez A, SerranoHernando J, Ruiz A, RamírezLorca R, SerranoRíos M: Interaction between Calpain 5, Peroxisome proliferatoractivated receptorgamma and Peroxisome proliferatoractivated receptordelta genes: a polygenic approach to obesity. Cardiovasc Diabetol. 2008, 7: 2310.1186/14752840723.PubMed CentralView ArticlePubMedGoogle Scholar
 Branicki W, Brudnik U, WojasPelc A: Interactions between HERC2, OCA2 and MC1R may influence human pigmentation phenotype. Ann Hum Genet. 2009, 73: 160170. 10.1111/j.14691809.2009.00504.x.View ArticlePubMedGoogle Scholar
 Liu J, Sun K, Bai Y, Zhang W, Wang X, Wang Y, Wang H, Chen J, Song X, Xin Y, Liu Z, Hui R: Association of threegene interaction among MTHFR, ALOX5AP and NOTCH3 with thrombotic stroke: a multicenter casecontrol study. Hum Genet. 2009, 125: 649656. 10.1007/s0043900906590.View ArticlePubMedGoogle Scholar
 Qi Y, Niu WQ, Zhu TC, Liu JL, Dong WY, Xu Y, Ding SQ, Cui CB, Pan YJ, Yu GS, Zhou WY, Qiu CC: Genetic interaction of Hsp70 family genes polymorphisms with highaltitude pulmonary edema among Chinese railway constructors at altitudes exceeding 4000 meters. Clin Chim Acta. 2009, 405: 1722. 10.1016/j.cca.2009.03.056.View ArticlePubMedGoogle Scholar
 Broberg K, Huynh E, Schläwicke Engström K, Björk J, Albin M, Ingvar C, Olsson H, Höglund M: Association between polymorphisms in RMI1, TOP3A, and BLM and risk of cancer, a casecontrol study. BMC Cancer. 2009, 9: 14010.1186/147124079140.PubMed CentralView ArticlePubMedGoogle Scholar
 Tang X, Guo S, Sun H, Song X, Jiang Z, Sheng L, Zhou D, Hu Y, Chen D: Genegene interactions of CYP2A6 and MAOA polymorphisms on smoking behavior in Chinese male population. Pharmacogenet Genomics. 2009, 19 (5): 345352. 10.1097/FPC.0b013e328329893c.View ArticlePubMedGoogle Scholar
 Lucek PR, Ott J: Neural network analysis of complex traits. Genet Epidemiol. 1997, 14: 11011106. 10.1002/(SICI)10982272(1997)14:6<1101::AIDGEPI90>3.0.CO;2K.View ArticlePubMedGoogle Scholar
 Ott J: Neural networks and disease association studies. Am J Med Genet. 2001, 105: 6061. 10.1002/10968628(20010108)105:1<60::AIDAJMG1062>3.0.CO;2L.View ArticlePubMedGoogle Scholar
 Flouris AD, Duffy J: Applications of artificial intelligence systems in the analysis of epidemiological data. Eur J Epidemiol. 2006, 21: 167170. 10.1007/s106540060005y.View ArticlePubMedGoogle Scholar
 McKinney BA, Reif DM, Ritchie MD, Moore JH: Machine learning for detecting genegene interactions. Appl Bioinformatics. 2006, 5 (2): 7788. 10.2165/0082294220060502000002.PubMed CentralView ArticlePubMedGoogle Scholar
 MotsingerReif AA, Ritchie MD: Neural networks for genetic epidemiology: past, present, and future. BioData Min. 2008, 1: 310.1186/1756038113.PubMed CentralView ArticlePubMedGoogle Scholar
 Koza JR, Rice JP: Genetic generation of both the weights and architecture for a neural network. Proc Int Joint Conf Neural Netw. 1991, IEEE Press, II: 397404.Google Scholar
 Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH: Optimization of neural network architecture using genetic programming improves detection and modeling of genegene interactions in studies of human diseases. BMC Bioinformatics. 2003, 4: 2810.1186/14712105428.PubMed CentralView ArticlePubMedGoogle Scholar
 Bush WS, Motsinger AA, Dudek SM, Ritchie MD: Can neural network constraints in GP provide power to detect genes associated with human disease?. Lect Notes Comput Sc. 2005, 3449: 4453.View ArticleGoogle Scholar
 Motsinger AA, Lee SL, Mellick G, Ritchie MD: GPNN: Power studies and applications of a neural network method for detecting genegene interactions in studies of human disease. BMC Bioinformatics. 2006, 7: 3910.1186/14712105739.PubMed CentralView ArticlePubMedGoogle Scholar
 Motsinger AA, Dudek SM, Hahn LW, Ritchie MD: Comparison of neural network optimization approaches for studies of human genetics. Lect Notes Comput Sc. 2006, 3907: 103114. full_text.View ArticleGoogle Scholar
 MotsingerReif AA, Fanelli TJ, Davis AC, Ritchie MD: Power of grammatical evolution neural networks to detect genegene interactions in the presence of error. BMC Res Notes. 2008, 1: 6510.1186/17560500165.PubMed CentralView ArticlePubMedGoogle Scholar
 MotsingerReif AA, Dudek SM, Hahn LW, Ritchie MD: Comparison of approaches for machinelearning optimization of neural networks for detecting genegene interactions in genetic epidemiology. Genet Epidemiol. 2008, 32: 325340. 10.1002/gepi.20307.View ArticlePubMedGoogle Scholar
 Risch N: Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet. 1990, 46: 222228.PubMed CentralPubMedGoogle Scholar
 Li W, Reich J: A complete enumeration and classification of twolocus disease models. Hum Hered. 2000, 50: 334349. 10.1159/000022939.View ArticlePubMedGoogle Scholar
 Bishop CM: Neural networks for pattern recognition. 1995, New York: Oxford University PressGoogle Scholar
 McCullagh P, Nelder JM: Generalized linear models. 1983, London: Chapman and HallView ArticleGoogle Scholar
 HechtNielsen R: Neurocomputing. 1990, Reading: AddisonWesleyGoogle Scholar
 Riedmiller M: Advanced supervised learning in multilayer perceptrons  from backpropagation to adaptive learning algorithms. Int J Comput Stand Interf. 1994, 16: 265275. 10.1016/09205489(94)900175.View ArticleGoogle Scholar
 Bammann K: Auswertung von epidemiologischen FallKontrollStudien mit künstlichen neuronalen Netzen. PhD thesis. 2001, University of BremenGoogle Scholar
 Akaike H: Information theory and an extension of the maximum likelihood principle. Second international symposium on information theory. Edited by: Petrov BN, Csaki BF. 1973, Budapest: Academiai Kiado, 267281.Google Scholar
 R Development Core Team: R: A language and environment for statistical computing. 2008, R Foundation for Statistical Computing, Vienna, Austria, [ISBN 3900051070], [http://www.Rproject.org]Google Scholar
 Fritsch S, Günther F: neuralnet: Training of neural networks. 2008, [R package version 1.2], [http://cran.rproject.org/web/packages/neuralnet/index.html]Google Scholar
 Computational Genetics Laboratory: NorrisCotton Cancer Center and Dartmouth Medical School, Lebanon, New Hampshire, [http://www.epistasis.org/]
 Jakulin A, Bratko I: Analyzing attribute dependencies. Lect Notes Comput Sc. 2003, 2838: 229240.View ArticleGoogle Scholar
 Moore JH, Gilberta JC, Tsaif CT, Chiangf FT, Holdena T, Barneya N, Whitea BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006, 241: 252261. 10.1016/j.jtbi.2005.11.036.View ArticlePubMedGoogle Scholar
 Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.