The complete compositional epistasis detection in genomewide association studies
 Xiang Wan^{1}Email author,
 Can Yang^{2},
 Qiang Yang^{3},
 Hongyu Zhao^{2} and
 Weichuan Yu^{4}
DOI: 10.1186/14712156147
© Wan et al.; licensee BioMed Central Ltd. 2013
Received: 14 November 2012
Accepted: 6 February 2013
Published: 19 February 2013
Abstract
Background
The detection of epistasis among genetic markers is of great interest in genomewide association studies (GWAS). In recent years, much research has been devoted to find diseaseassociated epistasis in GWAS. However, due to the high computational cost involved, most methods focus on specific epistasis models, making the potential loss of power when the underlying epistasis models are not examined in these analyses.
Results
In this work, we propose a computational efficient approach based on complete enumeration of twolocus epistasis models. This approach uses a twostage (screening and testing) search strategy and guarantees the enumeration of all epistasis patterns. The implementation is done on graphic processing units (GPU), which can finish the analysis on a GWAS data (with around 5,000 subjects and around 350,000 markers) within two hours. Source code is available at http://bioinformatics.ust.hk/BOOST.htmlâˆ–#GBOOST.
Conclusions
This work demonstrates that the complete compositional epistasis detection is computationally feasible in GWAS.
Keywords
Compositional epistasis SNP Genomewide association study GPUBackground
The concept of epistasis was first introduced in 1909 by Bateson and Mendel [1] to describe the masking effect of one locus over another locus. Today, it is broadly referred to as joint effects across different genes on phenotypes. The identifications of epistasis between two loci can offer insights on the complex biological pathways underlying human diseases [2]. With genomewide genotyping microarrays, it is possible to evaluate epistasis at the genomic level through the analysis of genomewide association studies (GWAS) where hundreds or thousands of subjects are genotyped at up to millions of single nuclear polymorphisms (SNPs). Because epistasis can involve markers with or without significant marginal effects [3–5], a comprehensive investigation of epistasis is a necessary step following the traditional single marker analysis in finding susceptibility markers of complex diseases. However, hundreds of billions of SNP pairs need to be considered if an exhaustive search is conducted and the significant computational cost has restrained researchers from conducting a full investigation of epistasis in GWAS.
Researchers generally distinguish three types of epistasis: functional epistasis, statistical epistasis, and compositional epistasis [6, 7]. Functional epistasis indicates molecular interactions in the biological context. Statistical epistasis [8] defines the joint behavior of two loci as the statistical deviation from their additive effects. Compositional epistasis maintains the original concepts given by Bateson and Mendel [1], which can be interpreted as twolocus epistasis models (see details in the Methods Section).
Estimating statistical epistasis between two loci requires the estimation of their additive main effects, which involves iterations (see details in the Methods Section). Because hundreds of billions of SNP pairs need to be measured for epistasis in a standard GWAS, any extra time spent on analyzing each pair will significantly increase the computational cost. To tackle this computational problem, many earlier methods [9–12] used a heuristic procedure that first removes all significant loci based on singlelocus tests and then computes the statistical epistasis of two loci with the sum of individual effects and interaction effects. Recently, [3] developed a noniterative method to approximate the likelihood ratio statistic, which make the detection of pure statistical epistasis (only the interaction effect) computationally feasible in GWAS. However, all the methods mentioned above may suffer from the issue where the underlying degree of freedom is lower than the one assumed in their statistical tests. This issue is mainly caused by the low minor allele frequency (MAF) of loci, which leads to the sparse contingency table in the test. To solve this issue, one solution is to test the compositional epistasis.
It has been argued that compositional epistasis is closer to the biological understanding of genegene interactions than statistical epistasis [6]. However, for each pair of loci, there are 512 epistatic patterns defined by compositional epistasis. There is a heavy computational burden in analyzing GWAS data if all these patterns are considered. To our knowledge, there is no method to find compositional epistasis in GWAS data.
In this article, we propose a fast approach to enable exhaustive search of compositional epistasis in GWAS. The proposed approach uses a twostage (screening and testing) search strategy. In the screening stage, only a limited number of epistatic patterns are evaluated for each pair of SNPs and those passing a specified threshold are selected. All nonsignificant pairs are filtered out and those pairs, which are significant in the test of compositional epistasis, will be kept in the remaining set. In the testing stage, we evaluate all epistatic patterns for each remaining pair. The implementation is done on graphic processing units (GPU), where the analysis of one GWAS data set (with around 5,000 subjects and around 350,000 markers) can be finished within a few hours.
Methods
SNPs are mostly biallelic genetic markers. In general, we use capital letters (e.g., A, B, ⋯) to denote the major alleles and lowercase letters (e.g., a, b, ⋯) to denote the minor alleles. For each SNP, there are three genotypes: the homozygous reference genotype (AA), the heterozygous genotype (Aa), and the homozygous variant genotype (aa). The popular way of coding the genotype is to use {1, 2, 3} to represent {A A,A a,a a}, respectively.
Epistasis tests
The statistical epistasis and the compositional epistasis are two major types of epistasis that have been considered in the literature. The statistical epistasis is defined as the statistical deviation from the additive effects of two loci on the phenotype [8]. One popular way to test the statistical epistasis is to use the likelihood ratio test. Given two SNPs X_{ p } and X_{ q }, there are three steps in such a procedure:

Fit the logistic regression model for only individual effect terms and obtain the MLE ${\widehat{L}}_{M}$$log\frac{P(Y=1{X}_{p},{X}_{q})}{P(Y=2{X}_{p},{X}_{q})}={\beta}_{0}+{\beta}_{i}^{{X}_{p}}+{\beta}_{j}^{{X}_{q}}.$(1)

Fit the logistic regression model for both individual effect terms and interaction terms and obtain the MLE ${\widehat{L}}_{F}$$log\frac{P(Y=1{X}_{p},{X}_{q})}{P(Y=2{X}_{p},{X}_{q})}={\beta}_{0}+{\beta}_{i}^{{X}_{p}}+{\beta}_{j}^{{X}_{q}}+{\beta}_{\mathit{\text{ij}}}^{{X}_{p}{X}_{q}}.$(2)

Conduct the χ^{2} test on $2\xb7({\widehat{L}}_{F}{\widehat{L}}_{M})$ with df = 4.
We call this test as interaction test. However, estimating the MLE ${\widehat{L}}_{M}$ involves iterations (the estimation of the MLE ${\widehat{L}}_{F}$has the closedform solution), which is computationally very expensive to evaluate hundreds of billions of pairs in GWAS. Therefore, many methods use a different procedure to estimate the epistasis.

Remove all significant SNPs based on the singlelocus test with a given threshold.

For every pair (X_{ p },X_{ q }) in the remaining SNPs,

Compute the loglikelihoods L_{ ∅ } of the null logistic regression model, defined as

Compute the loglikelihoods L_{ F } of the full logistic regression model in Eq.(2).

Conduct χ^{2} tests on 2·(L_{ F }−L_{ ∅ }) with 8 degrees of freedom.
$log\frac{p}{1p}={\beta}_{0}.$(3) 
We call the test with 8 degrees of freedom as full association test. In the full association test, a threshold is required to filter out the significant SNPs. Otherwise, it will produce many false epistasis involving one marginally significant SNP with an irrelevant one.
The full association test is totally different from the interaction test. It measures the sum of individual effects and interaction effects and thus its degrees of freedom is 8 while the interaction test only only measures the interaction effect with 4 degrees of freedom. Both tests have their pros and cons. In the full association test, it is very difficult to decide the threshold to filter out the significant SNPs. For a stringent threshold, many SNPs below the threshold may produce strong associations in the full model with a little interaction effect. For a loose threshold, some SNPs involved in true epistasis may be filtered out. In the interaction test, those epistasis involving SNPs having medium individual effects and meanwhile having medium interaction effect will be ignored. Most importantly, they all suffer from the issue where the underlying degree of freedom is lower than the one assumed in their statistical tests, which is caused by the low MAF. The relatively robust solution to tackle this issue is to use the test of compositional epistasis.
The definition of twolocus compositional epistasis
Two locus penetrance table
S N P_{2}=1  S N P_{2}=2  S N P_{2}=3  

S N P_{1}=1  p _{11}  p _{12}  p _{13} 
S N P_{1}=2  p _{21}  p _{22}  p _{23} 
S N P_{1}=3  p _{31}  p _{32}  p _{33} 
The dominant epistasis model
S N P_{2}=1  S N P_{2}=2  S N P_{2}=3  

S N P_{1}=1  0  0  0 
S N P_{1}=2  0  1  1 
S N P_{1}=3  0  1  1 
The trivial p_{ i j } in Table 1 will decrease the power of both the full association test and the interaction test. The compositional epistasis can solve this issue by reducing the 3×3 penetrance table into the 2×2 risk table according to the model definition.
The test of twolocus compositional epistasis
The genotype counts in controls ( Y = 0 ) and cases ( Y = 1 )
Y= 0  S N P_{ j }= 1  S N P_{ j }= 2  S N P_{ j }= 3  Y= 1  S N P_{ j }= 1  S N P_{ j }= 2  S N P_{ j }= 3 

S N P_{ i }= 1  n _{110}  n _{120}  n _{130}  S N P_{ i }= 1  n _{111}  n _{121}  n _{131} 
S N P_{ i }= 2  n _{210}  n _{220}  n _{230}  S N P_{ i }= 2  n _{211}  n _{221}  n _{231} 
S N P_{ i }= 3  n _{310}  n _{320}  n _{330}  S N P_{ i }= 3  n _{311}  n _{321}  n _{331} 
Risk table for testing the fit of an epistasis model
Low risk  High risk  

Control (Y=0)  a  b 
Case (Y=1)  c  d 
For S N P_{ i } and S N P_{ j } and each of 51 possible compositional epistasis models, the chisquare test statistic is calculated using Eq.(4). Those models with test statistics passing a given significance threshold will be considered as the possible interaction patterns of S N P_{ i } and S N P_{ j }.
Compositional epistasis detection in GWAS
In a typical GWAS, there are hundreds of billions of pairs of SNPs to be tested. It is computationally expensive to evaluate every possible compositional epistasis for all pairs of SNPs. However, it is widely believed that among the very large number of SNP pairs, only a small portion may be relevant with the disease trait. Therefore, it is a huge waste to test all SNP pairs to find significant compositional epistasis. If we can quickly compute the best fit of compositional epistasis model given the observed data for a SNP pair, we can first remove those pairs unlikely to be significant and then focus on evaluating all possible compositional epistasis model for the remaining SNP pairs. By doing so, the entire process will be substantially sped up. The approach in selecting the best splits for classification trees with categorical variables provides a solution to identify the compositional epistasis model best fitting the observe data.
In classification trees, leaves represent class labels, internal nodes represent features and branches represent conjunctions of features that induce class labels. In this work, class labels are phenotypes and features are genotypes. To construct a binary classification tree, a typical method iteratively searches all features for the best split. If the feature is categorical with M items, the number of all the possible splits is 2^{M−1}. However, for a twoclass classification problem, [14] proved the following theorem that reduces the search complexity into O(M).
Theorem 1. Suppose there is a categorical variable X taking categorical values from {1,2,⋯,M} in two classes, class Y = 0 and class Y= 1. The categories are arranged in the ascending order of P(Y = 1X = i). Then one of M − 1 splits, L = {1,⋯,m} and R = {m+1,⋯,M} where 1 ≤ m < M, minimize the misclassification rate.
Theorem 1 only holds for the twoclass problem. Some extensions to the multiclass problem have been proposed on the basis of Theorem 1 but they are only locally optimal.
The sorted ratio table for finding the maximum of chisquare statistics in the test of compositional epistasis
Y=0  s _{1}  s _{2}  s _{3}  s _{4}  s _{5}  s _{6}  s _{7}  s _{8}  s _{9} 
Y=1  t _{1}  t _{2}  t _{3}  t _{4}  t _{5}  t _{6}  t _{7}  t _{8}  t _{9} 
Ratio  r _{1}  r _{2}  r _{3}  r _{4}  r _{5}  r _{6}  r _{7}  r _{8}  r _{9} 
Based on Theorem 1, we propose a twostage (screening and testing) search method to find compositional epistasis in GWAS data.

In the screening stage, the method evaluates all SNP pairs by checking 8 splits to find an upper bound and remove pairs with the upper bound less than τ. The threshold τ corresponds to the significant threshold (with the Bonferroni correction) specified by users. Because the Bonferroni correction tends to be conservative, a smaller threshold can be used to put more SNP pairs into the testing stage. We set τ=20 in our method, which corresponds to the unadjusted pvalue 7.744×10^{−6}, which is a relatively liberal significance level for a genomewide study.

In the testing stage, the method checks each selected pair using all nonredundant compositional epistasis models. The p value for each model tested is adjusted by the Bonferroni correction, with the number of tests L(L−1)/2 where L is the total number of SNPs before screening.
GPU implementation
To accelerate the analysis process in GWAS, the proposed method is implemented using the parallel computation of graphical processing units (GPUs) (http://docs.nvidia.com/cuda/). The development of GPUs enables modern display cards to have hundreds of core at a low price, which can be easily set up for the largescale data analysis. To achieve a good speedup, our GPU implementation maximizes the coalesced memory access and makes full use of the texture memory. The coalesced memory access groups 16 consecutive global memory transactions into a single memory transaction. It is the key technique to save memory access time in CUDAenabled GPU. The texture memory is used for tasks with random memory access to improve the memory access speed. Our GPU implementation chooses the bit data structure and then fits the entire data into the GPU memory, which minimizes the overhead between the device and the host. The kernel program in our GPU implementation is designed with only a few registers being used and allows for a large number of concurrent threads. Without using GPU computing, our method needs around 120 hours to finish the genomewide compositional epistasis analysis of a typical data set (with around 5,000 subjects and around 350,000 markers) on a single workstation. The GPU enabled implementation can finish the same analysis in two hours.
Results
The compositional epistasis and statistical epistasis are two most commonly considered epistasis. In general, there are two types of statistical epistasis, named ‘Interaction’ and ‘Full Association’. In this section, we will evaluate these three types of epistasis using both simulated data and real data. To compare the statistical power among them, we have another issue of multiple test correction to consider. For each pair of SNPs, both the interaction test and the full association test compute one statistic and conduct the hypothesis test with the corresponding degrees of freedom. In the test of compositional epistasis, each SNP pair is associated with multiple epistatic patterns and thus with multiple statistics. In our comparison experiments, we choose the maximum one. Since we need to check 8 patterns to get the maximum statistic (see Theorem 1), we need to multiply the computed Pvalue with 8.
Simulation 1: epistasis with main effects
Data generation
The odds tables for four epistasis models
model 1  BB  Bb  bb  model 2  BB  Bb  bb 

AA  α  α  α  AA  α  α(1+θ)  α(1+θ) 
Aa  α  α(1+θ)  α(1+θ)^{2}  Aa  α(1+θ)  α  α 
aa  α  α(1+θ)^{2}  α(1+θ)^{4}  aa  α(1+θ)  α  α 
model 3  BB  Bb  bb  model 4  BB  Bb  bb 
AA  α  α  α(1+θ)  AA  α  α(1+θ)  α 
Aa  α  α(1+θ)  α  Aa  α(1+θ)  α  α(1+θ) 
aa  α(1+θ)  α  α  aa  α  α(1+θ)  α 
Performance comparison
Simulation 2: epistasis without main effects
Simulation 3: type1 error rate
Experiments on seven data sets from WTCCC
The number of SNP pairs identified from the WTCCC data sets of seven diseases under different tests
BD  CAD  CD  HT  RA  T1D  T2D  

Compositional Epistasis  0  0  17  0  47  234  3 
Interaction  0  0  1  0  0  317  0 
Full Association  0  0  0  0  10  346  0 
T1D
RA
For RA, the test of compositional epistasis reports 47 pairs, which includes the 10 pairs reported by the test of full association. The test of interaction does not report any significant pairs. A careful inspection of these pairs reveals that the epistatic effect of these pairs consists of partial individual effects and partial interaction effects. Among 47 reported pairs, 43 pairs involve SNP rs2107191 and the paired SNPs are all located in a very generich region (the genome location is from 29,778,109 to 30,363,351). There are about 31 pairs involving SNP rs2107191 displaying a recessiveinterference pattern (M2) [13]. The SNP rs2107191 is located very closely with gene OR2H1, which has been reported as a susceptibility locus for RA [18]. The bottom panel of Figure 6 provides all identified compositional epistasis patterns in RA. It can be observed that T1D and RA have different epistasis patterns. A further investigation on these patterns may reveal a new direction on the study of the etiology of RA and T1D.
Discussions
In this work, we have focused on the genomewide casecontrol studies; i.e., the disease phenotype can be represented as a binary variable. In its current testing, the compositional epistasis can not be easily extended to consider continuous phenotypes. Moreover, the current work only detect twoway compositional epistasis. However, we note that there is no widely accepted definition of highorder compositional epistasis. These issues are worth pursuing in the future.
Conclusions
Studying the epistasis between two loci is a natural step following traditional and wellestablished single locus analysis. In this paper, we have proposed a computationally efficient and statistically sound method to test compositional epistasis in GWAS data. The method is applicable to casecontrol studies and consists of a twostep (screening and testing) process. In the screening stage, only a limited number of epistatic patterns are evaluated for each pair of SNPs and those passing a specified threshold are selected to be more thoroughly studied in the testing stage, where all epistatic patterns for each selected pair are evaluated. The method is implemented using the parallel computational capability of commercially available GPUs to greatly reduce the computation time involved. We have successfully applied our method to analyze seven data sets from the WTCCC. Our experimental results demonstrate that the complete compositional epistasis detection is computationally feasible in GWAS.
Declarations
Acknowledgements
This work was partially supported with grants RPC10EG04 from the Hong Kong University of Science and Technology.
Authors’ Affiliations
References
 Bateson W, Mendel G: Mendel’s Principles of Heredity. 1909, Cambridge: Cambridge University PressView Article
 Carlborg Ö, Haley C: Epistasis: too often neglected in complex trait studies?. Nat Rev Genet. 2004, 5 (8): 618625. 10.1038/nrg1407.View ArticlePubMed
 Wan X, Yang C, Yang Q, Xue H, Fan X, Tang N, Yu W: BOOST: a boolean representationbased method for detecting SNPSNP interactions in genomewide association studies. Am J Human Genet. 2010, 87 (3): 325340. 10.1016/j.ajhg.2010.07.021.View Article
 Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Human Genet. 2001, 69: 138—147View Article
 Ritchie M, Hahn L, Moore J: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003, 24 (2): 150—157View ArticlePubMed
 Phillips PC: Epistasisthe essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008, 9 (11): 855867. 10.1038/nrg2452.PubMed CentralView ArticlePubMed
 Wan X, Yang C, Yu W: Comments on ’An empirical comparison of several recent epistatic interaction detection methods’. Bioinformatics. 2012, 28: 145146. 10.1093/bioinformatics/btr596.View ArticlePubMed
 Fisher RA: The correlations between relatives on the supposition of Mendelian inheritance. Philos Trans R Soc Edinb. 1918, 52: 399433.View Article
 Zhang Y, Liu J: Bayesian inference of epistatic interactions in casecontrol studies. Nat Genet. 2007, 39: 11671173. 10.1038/ng2110.View ArticlePubMed
 Schwarz D, König I, Ziegler A: On safari to random jungle: a fast implementation of random forests for highdimensional data. Bioinformatics. 2010, 26 (14): 17521758. 10.1093/bioinformatics/btq257.PubMed CentralView ArticlePubMed
 Zhang X, Huang S, Zou F, Wang W: TEAM: efficient twolocus epistasis tests in human genomewide association study. Bioinformatics. 2010, 26 (12): 217227. 10.1093/bioinformatics/btq186. [http://bioinformatics.oxfordjournals.org/content/26/12/i217.abstract]View Article
 Wu J, Devlin B, Ringquist S, Trucco M, Roeder K: Screen and clean: a tool for identifying interactions in genomewide association studies. Genet Epidemiol. 2010, 34 (3): 275285.PubMed CentralPubMed
 Li W, Reich J: A complete enumeration and classification of twolocus disease models. Hum Hered. 2000, 50: 334349. 10.1159/000022939.View ArticlePubMed
 Breiman L, Friedman J, Olshen R, Stone C: Classification and Regression Trees. 1984, Belmont, CA: Wadsworth & Brooks
 Shih YS: Families of splitting criteria for classification trees. Stat Comput. 1999, 9: 309315. 10.1023/A:1008920224518.View Article
 Lechler R, Warrens A: HLA in Health and Disease. 2000, San Diego, CA: Academic Press
 The Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nat. 2007, 447 (7145): 661678. 10.1038/nature05911.View Article
 Orozco G, Barton A, Eyre S, Ding B, Worthington J, Ke X, Thomson W: HLADPB1COL11A2 and three additional xMHC loci are independently associated with RA in a UK cohort. Genes Immun. 2011, 12 (3): 169175. 10.1038/gene.2010.57.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.