 Methodology article
 Open Access
 Published:
Identifying and exploiting genepathway interactions from RNAseq data for binary phenotype
BMC Genetics volume 20, Article number: 36 (2019)
Abstract
Background
RNA sequencing (RNAseq) technology has identified multiple differentially expressed (DE) genes associated to complex disease, however, these genes only explain a modest part of variance. Omnigenic model assumes that disease may be driven by genes with indirect relevance to disease and be propagated by functional pathways. Here, we focus on identifying the interactions between the external genes and functional pathways, referring to genepathway interactions (GPIs). Specifically, relying on the relationship between the garrote kernel machine (GKM) and variance component test and permutations for the empirical distributions of score statistics, we propose an efficient analysis procedure as Permutation based gEnepAthway interaction identification in binary phenotype (PEA).
Results
Various simulations show that PEA has wellcalibrated type I error rates and higher power than the traditional likelihood ratio test (LRT). In addition, we perform the gene set enrichment algorithms and PEA to identifying the GPIs from a pancancer data (GES68086). These GPIs and genes possibly further illustrate the potential etiology of cancers, most of which are identified and some external genes and significant pathways are consistent with previous studies.
Conclusions
PEA is an efficient tool for identifying the GPIs from RNAseq data. It can be further extended to identify the interactions between one variable and one functional set of other omics data for binary phenotypes.
Background
RNA sequencing (RNAseq) technology has identified amounts of significant genes and given some evidence for the diagnosis and treatment of complex disease, especially cancers [1, 2]. Most of existing statistical methods focus on identifying the differentially expressed (DE) genes and heritability estimation by the RNAseq count data [3,4,5,6,7]. However, with the assumption that only minority of genes associate with phenotypes, these models inevitably lose the regulation information from DE genes, thus are hard to elucidate the etiology and mechanism [8,9,10]. Systematic characterization of the biological function of genes represents an important step for investigating the molecular mechanisms underlying the identified disease associations. Enrichment analysis methods are based on different ideas: some only including genes participating in pathways and some considering the regulations between genes in networks [11,12,13,14].
Furthermore, omnigenic model assumes that disease may be driven by genes with indirect relevance to disease and be propagated by functional pathways. These external genes may cause the disease by distantly regulating significant pathways and they may explain most heritability [15]. For transcriptome data, one common sense is that the core gene effects can be understood by their interactions within underlying pathways or any expressed gene [16, 17]. As a result, identifying the interactions between external genes and significant pathways (GPIs) holds for further understanding of etiology and improving the prediction ability [18].
Considering the potential importance of interactions in defining the genetic architecture of complex traits and inefficiency of traditional methods in highdimensional data, emerging statistical methods have been implemented to identify interactions with low calculation resources [19]. Different algorithms use different ideas, such as set tests [20, 21] and searching algorithms (exhaustive searching and prioritization based on the gene set) [22, 23]. Due to lacking confidence and biological priority and highdimensional searching spaces, these methods may lose power.
Moreover, the omnibus test is widely used to identify the sets from both single and multiple levels, even from different datasets [24,25,26,27,28,29]. The joint test is more efficient and scalable because of low computational consumption, reduction of the degree of freedom and no estimation of variance components. On the other hand, kernelbased methods have been proposed to estimate association of genetic variants with complex traits [20, 28, 30,31,32]. A general kernel machine method can account for complex nonlinear genes and interactions effects. Though the application of kernelbased methods in genome wide association studies (GWASs) has been reported in the literature, our method applies the idea to identify GPIs of the transcriptome data [32, 33].
Here, noting the similarity between the mixed model and kernel function, we develop a statistical test to identify the GPIs for binary phenotypes. The model possibly solves the two challenges. First, our model is testing the GPIs in the binary phenotype framework. To do so, we firstly use two enrichment analysis of RNAseq data, including gene set enrichment analysis (GSEA), DAVID and MinePath [11,12,13]. Second, the model is quite similar to the garrote kernel machine (GKM), but the parameter estimation procedure is quite different [20]. We refer to the statistical method as the Permutation based gEnepAthway interaction identification in binary phenotypes (PEA). We provide a method overview of PEA, including the parameter estimation and hypothesis testing. Extensive simulations show that compared to the traditional likelihood ratio test (LRT) for generalized linear models, PEA has higher areas under curve (AUCs) with controllable type I error rates. In addition, the parameter estimation is more accurate. We apply our method to analyze platelet RNAseq data from a casecontrol study (GSE68086) [1]. PEA can also be applied to analyze other interactions in binary phenotypes, such as pathwayenvironment interactions. PEA is implemented as a Rcpp function, freely available at https://github.com/biostat0903/RNAseqDataAnalysis.
Methods
Model
We model GPIs in binary phenotypes as following:
where y indicates the binary phenotypes (i = 1, … , N), X, an N × m matrix, is supposed as the covariates, P, an N × P matrix, is assumed as the expression levels in a significant pathway, which can be calculated from some gene enrichment analyses, P_{j} ∙ G indicate the GPIs (Fig. 1). We also suppose γ = P(y = 1). β_{C}, β_{P}, β_{G} and ξ_{G} are the coefficients of the covariates, functional pathway genes, external gene and GPIs, respectively.
LRT based on chisquared statistics is a traditional method for testing of interactions for generalized linear models. The chisquared statistic is the multiplication of − 2 by the logarithm of the ratio of likelihood of the full model and that of the model without interactions. Unfortunately, as the highdimensional and complicated relation of the variables, traditional methods are always inefficient.
Garrote kernel machine (GKM)
A kernel function is suitable to suggest the complicated relationship, including both linear and nonlinear relations, between genes and phenotypes. Here, we extend the linear GKM for the binary phenotype to identify the GPIs, although many other kernel functions can be selected. The kernel function is shown as following:
where Z_{k} = (G_{k}, P_{k}), K(Z_{k}, Z_{l}) is the kernel matrix of kth and lth individuals. We then test for the effect of the GPIs by considering the null hypothesis H_{0} : δ = 0.
Parameter estimation
With the kernel function, the Eq. (1) can be rewritten as a semiparametric model as follows:
where h = (h_{1}, h_{2}, … , h_{N})^{T} is an unknown centered smooth function vector. h can be parameterized for different forms of GPIs, such as the Gaussian kernel and d th ploynomial kernel. As the similarity between the semiparametric model and mixed effect model, the h can be assumed as the random effects following a multivariate normal distribution \( \mathcal{N}\left(0,\tau \mathbf{K}\left(\updelta \right)\right) \). The relationship between the unknown function and the kernel function is as follows:
where \( {\kappa}_i^T=\left(\mathbf{K}\left({\mathbf{Z}}_i,{\mathbf{Z}}_1\right),\mathbf{K}\left({\mathbf{Z}}_i,{\mathbf{Z}}_2\right),\dots, \mathbf{K}\left({\mathbf{Z}}_i,{\mathbf{Z}}_N\right)\right) \) and α is an unknown scale parameter vector.
Integrating the kernel function and logistic regression, the loglikelihood function is as following:
Furthermore, as the loss of the degree of freedom due to the maximum likelihood estimation of fixed effects, the estimate of the variance component τ is obtained by optimizing the restricted maximum loglikelihood (REML) function as following:
where \( \overset{\sim }{\mathbf{y}}=\mathbf{X}{\boldsymbol{\upbeta}}_{\boldsymbol{C}}+\mathbf{K}\boldsymbol{\upalpha } +{\mathbf{D}}^{\mathbf{1}}\left(\mathbf{y}\boldsymbol{\upgamma} \right) \) and V = D^{−1} + τK. τ is estimated by the NewtonRaphson algorithm with damping factor ω = 0.5. The iteration formula is as following:
where Q = V^{−1} − V^{−1}X(X^{T}V^{−1}X)^{−1}X^{T}V^{−1} and t is the iteration time.
α and β_{C} is estimated by the NewtonRaphson algorithm with normal equations as following:
Hypothesis testing
As the similarity between the mixed model and semiparametric model, we propose a score test statistic U to test δ:
As the \( \frac{\partial \mathbf{V}}{\mathrm{\partial \updelta }} \) is not a semidefinite matrix, U is not identically greater than zero. Its distribution is hard to use the mixed chisquared distribution to approximate. Therefore, we propose a permutation test to obtain the empirical distribution of U. Since the null hypothesis is δ = 0, we permute the expression level of the external gene and calculate the test statistic without reestimation of the β_{C}, α and τ. The computation consumption is not large, although we use a permutation test.
Simulation
Compared to the traditional LRT, simulations investigate the statistical property of U and the estimation of β_{C}. Type I error rates and AUCs show the statistical property of statistics. Means and standard errors indicate the estimation property. The expressed levels of the functional pathway are generated from correlated uniform distributions by a Copula function, and that of the external gene is generated from a uniform distribution (\( \mathcal{U}\left(0,1\right) \)). We study five parameters in simulations: sample size (N), number of genes in the functional pathway (v), correlation between genes in the pathway (c), proportion of interaction genes (p) and odds ratio (OR) of the external gene (s). Details are shown in the Table 1. The interaction function h for the i th individual is defined by a function g as following:
h can be defined as linear and nonlinear settings: h(Z_{i}) = g(Z_{i}) and h(Z_{i}) = g(Z_{i})^{2}.
We simulate 1000 times at the null hypothesis (s = 0) and 100 times at the alternative hypothesis (s ≠ 0). Combing the 900 P values randomly selected from the null and 100 P values at the alternative, we can calculate the AUC. For each loop, we permute the label 50,000 times to obtain robust results.
Real data analysis
We use our method to analyze RNAseq data of 283 platelet samples (55 healthy donors and 228 tumor samples) [1]. The tumor data is collected from six different cancers, including breast cancer (BrCa, n = 35), nonsmall cell lung carcinoma (NSCLC, n = 60), glioblastoma (GBM, n = 39), colorectal cancer (CRC, n = 41), pancreatic cancer (PAAD, n = 35) and hepatobiliary cancer (HBC, n = 14). The covariates are gender and age. We delete the individuals with missing data, and the final sample size is 274. The tumor samples are regarded as cases to perform pancancer analysis [1].
qTo ensure validity and reasonability, we follow the data processing steps of the original paper. First, we exclude the genes with total counts less than 5 and with logarithmic counts per million (LogCPM) less than 3. We only select 5003 genes for subsequent analysis. Second, we use the trimmed mean of Mvalue (TMM) normalization data for following analysis. Final, edgeR with tagwise and common dispersion is applied to select the external genes with the threshold of false discovery rate (FDR) < 0.001. We can obtain 2500 DE genes, including 1231 deregulated and 1269 upregulated genes.
After the filtering and normalizing, we perform three gene enrichment methods to obtain the functional pathways, including DAVID, GSEA and MinePath. DAVID and GSEA are performed by both Gene Ontology (GO) dataset and Kyoto Encyclopedia of Genes and Genomes (KEGG) dataset [34, 35]. MinePath is only based on KEGG dataset. The significant pathways are defined with FDR < 0.001 of DAVID, FDR < 0.25 of GSEA and FDR < 0.05 of MinePath. biomaRt package is used to map Ensembl ID to corresponding Entez ID and gene symbol [36]. For the PEA model, we normalize RNASeq data to the numbers between 0 and 1.
Results
Simulations
Simulations evaluate the performances of PEA and traditional LRT by type I error rates and AUCs. In the type I error simulations, we use four different settings: different interaction function settings, N, v and c. In the power simulations, we add s and p. For the type I error simulations, the response variable is not affected by the effects of the external gene and its interactions. All the results are shown in Figs. 2, 3, 4 and 5 and Additional files 1 and 2. The significant level is set to be 0.05.
As expected, PEA controls type I error rates in all parameter settings, but the traditional LRT fails. Especially, in the nonlinear setting, the type I error rates of PEA are close to the significant level, which shows this method properly controls the type I error rates. When v = 30, the traditional LRT is inefficient because of the unbalance between the number of variables and the sample size setting.
As the traditional method is uncontrollable for type I error rates, AUC is used to evaluate the performances of the two methods in the alternative simulations. When v = 10, it is clear that PEA is better than LRT in each setting, especially for the nonlinear scenarios. When v = 20 and v = 30, the traditional LRT is better than PEA with the nonlinear settings and N = 100. Increasing s and N improves power of PEA, but increasing p and c decreases power of PEA. The linear settings possibly lead to higher power. The minimum sample size is suggested as 100 by the different N settings.
Real data analysis
After the enrichment analyses of DAVID and GSEA, we obtain 67 functional pathways (Additional file 3). We identify the GPIs between 2500 DE genes and four biological pathways: RNA processing (GO:0006396, size = 285), RNA splicing (GO:0008380, size = 127), cytosolic ribosome (GO:0022626, size = 80) and cytoplasmic translation (GO:0002181, size = 25). We identified 116, 109, 89 and 124 significant external genes for the different pathways at the nominal level 0.05, respectively. After filtering by the FDR < 0.2, the significant gene numbers are 10, 17, 54 and 83.
After the enrichment analyses of DAVID, GSEA and MinePath, we obtain 2 functional pathways (Additional file 4). We identify the GPIs between 2500 DE genes and two biological pathways: Platelet activation (hsa04611, size = 70) and Fc gamma Rmediated phagocytosis (hsa04666, size = 49). We identified 468 and 475 significant external genes for the two pathways at nominal level 0.05, respectively. After filtering by the FDR < 0.2, the significant gene numbers are 119 and 120.
As the different sizes of the four pathways, the power is different for the four pathways. The result of FDR controlling is similar to that of the simulations. Interestingly, we define multiple same genes for the four GO pathways, such as TUBB, BPI and CA1. Two KEGG pathways also interact with 11 common genes, such as SDCBP, TRAT1 and FABP4. The summary of the result is shown in Table 2.
Discussion
Here, we present an effective semiparametric generalized linear model, together with a computationally efficient parameter estimation method and software implementation of PEA, for identifying potential GPIs of RNAseq data in binary phenotypes. PEA models the complicated relationship between gene expression and traits using the kernel function. Because the kernel machine can be adaptive for both linear and nonlinear interactions, PEA controls type I error rates in the presence of individual relatedness, and PEA achieves higher power than the traditional method, LRT, across a range of settings. In addition, PEA is available to other interactions of different molecules, such as methylation and gene expression interactions, and biological or technical covariates interactions. We have demonstrated the benefits of PEA using both simulations and applications to recently published RNAseq datasets.
While PEA is an extension of the GKM, we note that PEA exploits the GPIs in binary phenotypes and estimates the parameters using the damping NewtonRaphson algorithm. As most of medical studies are casecontrol design, PEA identifies the GPIs for the binary traits. For example, samples are collected from tumor tissue and normal tissue, along with some covariates, such as age, gender and so on. Two NewtonRaphson iterations accurately estimate the coefficients of covariates (Fig. 5). GKM estimates τ by the optimx function in R using the NelderMead method, which is not suitable for PEA.
In the real data analysis, the result of PEA can support the assumption of omnigenic model. Although PEA goes beyond the scope of enrichment analysis, efficient enrichment analysis methods, such as MinePath, can essentially provide the robust and reliable pathways before conducting PEA. PEA identify not only some significant GPIs, but also the external genes for different pathways. Some significant genes are verified by multiple biological studies. For example, using the data from The Cancer Genome Atlas (TCGA), TUBB expression level influences the survival time of renal and liver cancer [37]. Kelly et al. demonstrate that the mutation of TUBB possibly cause the tumor cell growth and taxane resistance for the patients with NSCLC [38]. From the result of PEA, TUBB might be an important external gene, which associates cancers by the interactions with amount significant pathways.
Currently, despite the newly developed computationally efficient statistical testing method, applications of PEA can still be limited by its relatively heavy computational cost of permutations. Traditional permutation tests increase the computation consumption with the shuffling times, but the permutations of PEA are faster than the standard permutation test because of no parameter reestimation in each shuffling. PEA can still take close to 7 h with 5 CPUs to analyze a dataset of the size of the GSE68086 which we considered here (274 individuals ~ 2500 genes for one pathway). Therefore, PEA will be used to analyze other datasets that have large sizes.
Conclusions
PEA is an efficient and powerful statistical method for identifying the GPIs from RNAseq data. It can be further extended to identify the interactions between one variable and one functional set of other omics data for binary phenotypes. Further work is needed to make its widely use in more highdimensional genomics data analysis practice.
Abbreviations
 AUC:

Area under curve
 BrCa:

Breast cancer
 CRC:

Colorectal cancer
 DE:

Differentially expressed
 FDR:

False discovery rate
 GBM:

Glioblastoma
 GKM:

Garrote kernel machine
 GO:

Gene Ontology
 GPI:

Genepathway interaction
 GSEA:

Gene set enrichment analysis
 GWAS:

Genome wide association study
 HBC:

Hepatobiliary cancer
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
 LogCPM:

Logarithmic counts per million
 LRT:

Likelihood ratio test
 NSCLC:

Nonsmall cell lung carcinoma
 PAAD:

Pancreatic cancer
 PEA:

Permutation based gEnepAthway interaction identification for binary phenotypes
 REML:

Restricted maximum likelihood
 RNAseq:

RNA sequencing
 TCGA:

The Cancer Genome Atlas
 TMM:

Trimmed mean of Mvalue
References
 1.
Best MG, Sol N, Kooi I, Tannous J, Westerman BA, Rustenburg F, Schellen P, Verschueren H, Post E, Koster J. RNASeq of tumoreducated platelets enables bloodbased pancancer, multiclass, and molecular pathway cancer diagnostics. Cancer Cell. 2015;28(5):666–76.
 2.
Chen H, Li C, Peng X, Zhou Z, Weinstein JN, Liang H, Network CGAR. A pancancer analysis of enhancer expression in nearly 9000 patient samples. Cell. 2018;173(2):386–99 e312.
 3.
Sun S, Hood M, Scott L, Peng Q, Mukherjee S, Tung J, Zhou X. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res. 2017;45(11):e106.
 4.
Sun S, Zhu J, Mozaffari S, Ober C, Chen M, Zhou X. Heritability estimation and differential analysis of count data with generalized linear mixed models in genomic sequencing studies. Bioinformatics. 2018;35(3):487–96.
 5.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
 6.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.
 7.
Yang S, Shao F, Duan W, Zhao Y, Chen F. Variance component testing for identifying differentially expressed genes in RNAseq data. PeerJ. 2017;5:e3797.
 8.
Barabási AL, Gulbahce N, Loscalzo J. Network medicine: a networkbased approach to human disease. Nat Rev Genet. 2011;12(1):56.
 9.
Yuan Z, Ji J, Zhang X, Xu J, Ma D, Xue F. A powerful weighted statistic for detecting group differences of directed biological networks. Sci Rep. 2016;6:34159.
 10.
Yuan Z, Ji J, Zhang T, Liu Y, Zhang X, Chen W, Xue F. A novel chisquare statistic for detecting group differences between pathways in systems epidemiology. Stat Med. 2016;35(29):5512–24.
 11.
Koumakis L, Kanterakis A, Kartsaki E, Chatzimina M, Zervakis M, Tsiknakis M, Vassou D, Kafetzopoulos D, Marias K, Moustakis V. MinePath: mining for phenotype differential subpaths in molecular pathways. PLoS Comput Biol. 2016;12(11):e1005187.
 12.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.
 13.
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2008;4(1):44.
 14.
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim Js, Kim CJ, Kusanovic JP, Romero R. A novel signaling pathway impact analysis. Bioinformatics. 2008;25(1):75–82.
 15.
Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169(7):1177–86.
 16.
Ideker T, Nussinov R. Network approaches and applications in biology. PLoS Comput Biol. 2017;13(10):e1005771.
 17.
Peedicayil J, Grayson DR. An epigenetic basis for an omnigenic model of psychiatric disorders. J Theor Biol. 2018;443:52.
 18.
Visscher PM. Challenges in understanding common disease. Genome Med. 2017;9(1):112.
 19.
Crawford L, Zeng P, Mukherjee S, Zhou X. Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 2017;13(7):e1006869.
 20.
Maity A, Lin X. Powerful tests for detecting a gene effect in the presence of possible gene–gene interactions using garrote kernel machines. Biometrics. 2011;67(4):1271–84.
 21.
Ma L, Clark AG, Keinan A. Genebased testing of interactions in association studies of quantitative traits. PLoS Genet. 2013;9(2):e1003321.
 22.
Zhang X, Huang S, Zou F, Wang W. TEAM: efficient twolocus epistasis tests in human genomewide association study. Bioinformatics. 2010;26(12):i217–27.
 23.
Lewinger JP, Morrison JL, Thomas DC, Murcray CE, Conti DV, Li D, Gauderman WJ. Efficient twostep testing of genegene interactions in genomewide association studies. Genet Epidemiol. 2013;37(5):440–51.
 24.
Huang YT. Integrative modeling of multiple genomic data from different types of genetic association studies. Biostatistics. 2014;15(4):587–602.
 25.
Huang YT, Liang L, Moffatt MF, Cookson WO, Lin X. iGWAS: integrative genomewide association studies of genetic and genomic data for disease susceptibility using mediation analysis. Genet Epidemiol. 2015;39(5):347–56.
 26.
Broadaway KA, Duncan R, Conneely KN, Almli LM, Bradley B, Ressler KJ, Epstein MP. Kernel approach for modeling interaction effects in genetic association studies of complex quantitative traits. Genet Epidemiol. 2015;39(5):366–75.
 27.
Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet. 2008;82(2):386–97.
 28.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
 29.
Huang YT, VanderWeele TJ, Lin X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann Appl Stat. 2014;8(1):352.
 30.
Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered. 2010;70(2):109–31.
 31.
Schaid DJ. Genomic similarity and kernel methods II: methods for genomic information. Hum Hered. 2010;70(2):132–40.
 32.
Li S, Cui Y. Genecentric gene–gene interaction: a modelbased kernel machine method. Ann Appl Stat. 2012;6(3):1134–61.
 33.
Rhodes DH, Hoffmann L, Rooney WL, Herald TJ, Bean S, Boyles R, Brenton ZW, Kresovich S. Genetic architecture of kernel composition in global sorghum germplasm. BMC Genomics. 2017;18(1):15.
 34.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.
 35.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25.
 36.
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21(16):3439–40.
 37.
Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A. Tissuebased map of the human proteome. Science. 2015;347(6220):1260419.
 38.
Kelley MJ, Li S, Harpole DH. Genetic analysis of the βtubulin gene, TUBB, in nonsmallcell lung cancer. J Natl Cancer Inst. 2001;93(24):1886–8.
Acknowledgements
We are grateful to the associate editor and the reviewers for their valuable suggestions and comments which have improved our manuscript. We thank all group members of pancancer study (GSE68086) for making the clinical and RNAseq data publicly available.
Funding
This work was supported by the National Natural Science Foundation of China (Nos. 81703321, 81502888 and 81872709), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (17KJB330005), the Jiangsu Shuangchuang Plan, and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). None of the funding bodies played any part in the design of the study or the collection, analysis, or interpretation of data, nor in writing of the manuscript.
Availability of data and materials
The pancancer data is publicly available from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68086. The PEA software is available from https://github.com/biostat0903/RNAseqDataAnalysis.
Author information
Affiliations
Contributions
FS, YW, YZ and SY conceived and designed the experiment, FS and SY built the model and wrote the manuscript, SY implemented the software, YW tested the software and prepared the real data. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Sheng Yang.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
AUCs of PEA and traditional LRT in different interaction function settings, N, c, s, p and v = 20. (A) linear interaction function settings with p = 0.8 (left) and p = 1 (right); (B) nonlinear interaction function settings with p = 0.8 (left) and p = 1 (right). (PDF 27 kb)
Additional file 2:
AUCs of PEA and traditional LRT in different interaction function settings, N, c, s, p and v = 30. (A) linear interaction function settings with p = 0.8 (left) and p = 1 (right); (B) nonlinear interaction function settings with p = 0.8 (left) and p = 1 (right). (PDF 27 kb)
Additional file 3:
The enriched GO pathways from both GSEA and DAVID. (XLSX 13 kb)
Additional file 4:
The enriched KEGG pathways from three algorithms. (XLSX 9 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Shao, F., Wang, Y., Zhao, Y. et al. Identifying and exploiting genepathway interactions from RNAseq data for binary phenotype. BMC Genet 20, 36 (2019) doi:10.1186/s1286301907397
Received:
Accepted:
Published:
Keywords
 Genepathway interaction
 Garrote kernel machine
 Variance component testing
 Binary phenotype