 Research article
 Open Access
 Published:
Integrative Bayesian variable selection with genebased informative priors for genomewide association studies
BMC Geneticsvolume 15, Article number: 130 (2014)
Abstract
Background
Genomewide Association Studies (GWAS) are typically designed to identify phenotypeassociated single nucleotide polymorphisms (SNPs) individually using univariate analysis methods. Though providing valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally account for only a small proportion of the total heritability for complex diseases. To solve this “missing heritability” problem, we implemented a strategy called integrative Bayesian Variable Selection (iBVS), which is based on a hierarchical model that incorporates an informative prior by considering the gene interrelationship as a network. It was applied here to both simulated and real data sets.
Results
Simulation studies indicated that the iBVS method was advantageous in its performance with highest AUC in both variable selection and outcome prediction, when compared to Stepwise and LASSO based strategies. In an analysis of a leprosy case–control study, iBVS selected 94 SNPs as predictors, while LASSO selected 100 SNPs. The Stepwise regression yielded a more parsimonious model with only 3 SNPs. The prediction results demonstrated that the iBVS method had comparable performance with that of LASSO, but better than Stepwise strategies.
Conclusions
The proposed iBVS strategy is a novel and valid method for Genomewide Association Studies, with the additional advantage in that it produces more interpretable posterior probabilities for each variable unlike LASSO and other penalized regression methods.
Background
Over the last decade, Genomewide Association Studies (GWAS) have identified genetic loci associated for a variety of diseases [1][5]. Most studies aim to identify single nucleotide polymorphisms (SNPs) individually using univariate analysis methods [6]. Although current GWAS analyses have provided valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally only account for a small proportion of the total heritability of complex diseases, illustrating a problem commonly referred to as “missing heritability” [7],[8]. Potential explanations for this problem include the underestimation of the effects of alleles identified, the existence of genegene joint effects, the contribution of rare variation, the possibility that inherited epigenetic factors lead to correlated phenotypes between relatives, and the possible overestimation of heritability of the complex traits [7],[9],[10]. Many diseases or phenotypes are likely caused by or associated with multiple SNPs, each having small effects individually, but collectively contributing a more significant genetic effect [11]. Therefore, multilocus SNP models would offer one appealing solution in capturing the underlying genotypicphenotypic relationship [12][14].
A typical GWAS study measures thousands or millions of SNPs, but the number of subjects is usually much smaller. This is known as the P > > N problem [15],[16]. One solution to this problem resorts to dimension reduction by identification of the optimal subset of predictors associated with the outcome variable. Determining the best model or selecting a subset of variables becomes an important statistical task for this method. Bayesian variable selection (BVS) provides a natural approach to solve this problem [17],[18]. Unlike penalized regression approaches, BVS naturally produces easily interpretable measures of confidence for variable selection, i.e., posterior selection probabilities. This is an appealing advantage in GWAS because the primary goal of the analysis is to identify the joint effect of SNPs, and to utilize these identifications to learn about underlying biology. BVS has been successfully applied to GWAS data; see the discussion by Guan [19]. The flexibility of the BVS approach allows for straightforward extensions to analyze both quantitative and qualitative data [20][23]. Alternative techniques such as the singleSNP test, Stepwise regression, and LASSO (Least Absolute Shrinkage and Selection Operator) were developed to address this statistical challenge. LASSO is a regression method that involves penalizing the absolute size of the regression coefficients, and Stepwise is a classic scheme for sequentially adding to or removing variables from the model. Many studies show that BVS has better performance than these alternative methods in other contexts [19],[23][25].
Diseases with complex inheritance may be influenced by multiple genes that interplay within genetic networks or pathways [26],[27]. Gene products interact with one another and work collaboratively within interconnected pathways explaining or associating with certain diseases. This idea led to the concept of networkbased molecular biomarkers. Stingo et al. [28] proposed a Bayesian modeling strategy that addressed this concept by incorporating biological information, which was based on the structure or topology of regulatory genegene networks in the analysis of DNA microarray data. The method was further generalized into a 2step framework, STS (screening, then selection) by our research team [29] where standard methods were applied in the screening step to identify a set of candidate genes which were further explored in the selection step using the BVS strategy. In addition to these two coherent steps, our strategy involves the mapping of genotype data to genegene networks constructed from various sources such as proteinprotein interaction networks [30],[31]. We call this strategy of Bayesian biomarker discovery “integrated BVS” or iBVS. A partial least squares (PLS) gprior for regression coefficients is also incorporated to solve the problem of nonpositive deterministic covariance matrix when the sample size is smaller than the number of genes selected.
In this paper, we develop the strategy of iBVS for analyzing high dimensional GWAS data sets. The strategy is built upon a threelevel hierarchical model as seen in Figure 1, where at the top level the PLS method is used to summarize the joint effect of selected SNPs within each gene. In the middle level, Markov Random Field (MRF) is employed to model the selection of genes in prediction of association with disease status. A focus of this article is on discovering SNPs within specific genes incorporating gene network information in GWAS under case–control design. Identification and prediction performance of this iBVS approach are then compared with those of LASSO and Stepwise selection strategies through simulation studies. We then apply iBVS to a GWAS data set for the prediction of leprosy, a skin disease, among Han Chinese.
Methods
We denote Y = (Y _{1}, ⋯ Y _{ n }) ' as independently observed binary outcomes in a GWAS data set, where n is the number of subjects and ${Y}_{i}=\left\{\begin{array}{c}\hfill 1\hfill \\ \hfill 0\hfill \end{array}\right.\begin{array}{c}\hfill \mathrm{if}\phantom{\rule{0.2em}{0ex}}i\mathrm{t}\mathrm{h}\phantom{\rule{0.2em}{0ex}}\mathrm{subject}\phantom{\rule{0.2em}{0ex}}\mathrm{has}\phantom{\rule{0.25em}{0ex}}\mathrm{target}\phantom{\rule{0.25em}{0ex}}\mathrm{disease}\phantom{\rule{0.1em}{0ex}}\hfill \\ \hfill \mathrm{otherwise}\hfill \end{array}$. Each outcome is associated with a set of predictor variables, which correspond to the genotype data. We denote x _{ ijk } as the genotype of the k ^{th} SNP of the j ^{th} gene on the i ^{th} subject, for i = 1, ⋯, n, j = 1, ⋯, J, k = 1, ⋯, P _{ j }, where P _{ j } is the number of SNPs mapped to the j ^{th} gene, and $P={\displaystyle \sum _{j=1}^{J}{P}_{j}}$ denotes the total number of SNPs in the GWAS data set. Let A and a be the major and minor alleles at a SNP. The genotype may be coded according to different types of genetic effects: additive with 0, 1, 2 coding for the genotypes AA, Aa/aA, aa; dominant (with respect to the minor allele) with 0, 1, 1 coding for AA, Aa/aA, aa; Recessive (with respect to the minor allele) with 0, 0, 1 coding for AA, Aa/aA, aa.
iBVS with hierarchical modeling for GWAS data
Figure 1 shows the proposed threelevel hierarchical model structure. iBVS for binary phenotypes is accomplished by introducing the latent variable vector Z to the linear regression model. Each component Z _{ i } ~ N(0, 1) is defined such that
In order to select genetic variants in both gene and SNP level simultaneously, we introduce two binary vectors, ξ = (ξ _{1}, ⋯, ξ _{ J }) and γ = (γ _{1}, ⋯, γ _{ P }), to indicate the selection of genes and SNPs respectively into a model for predicting Z _{ i }, i. e,
and
For GWAS data with iBVS analysis, we propose the following hierarchical model,
where ${\beta}_{\left(\xi ,\gamma \right)}=\left({\beta}_{{J}_{1},\gamma},\cdots ,{\beta}_{{J}_{\left\xi \right},\gamma}\right),{T}_{\left(\xi ,\gamma \right)}=\left({T}_{\left({J}_{1},\gamma \right)},\cdots ,{T}_{\left({J}_{\left\xi \right},\gamma \right)}\right),\left\xi \right$ denotes the number of selected genes in predicting ${Z}_{i},{T}_{\left({J}_{l},\gamma \right)}$ denotes the vector of the first PLS component of ${X}_{{J}_{l},\gamma}\text{,}$ and J _{ l }(l = 1, ⋯, ξ) indexes the selected gene. Note that ${X}_{{J}_{l},\gamma}$ is a submatrix of X, consisting of only the columns that correspond to selected SNPs in the selected genes.
Prior specification
The indicator γ for SNP selection is assumed to follow an independent Bernoulli prior distributions with the same parameter π over all the γ _{ i } values.
Choice of π reflects a user’s prior belief in terms of the numbers of causal SNPs out of P candidates.
Zellner’s gprior is commonly used for the regression coefficient parameters β [32]. Yang and Song [33] generalized the gprior by modifying the matrix inverse to the MoorePenrose generalized matrix inverse. Since the multicollinearity problem is commonly encountered in GWAS data because of strong linkage disequilibrium between SNPs, we took a similar prior as that of Yang and Song,
where ${\left({T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}\right)}^{+}$ denotes the MoorePenrose generalized inverse of ${T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}$. This idea was first adopted by our research team for microarray gene expression data [29].
To take into account the pathway membership information for each gene as well as the biological relationships between genes within pathways, we follow Li and Zhang [34] and Stingo et al. [28] to use an MRF to describe the prior distribution on each component of the gene selection indicator, i.e.,
where μ and η are tuning parameters, and Nb _{ j } is the set of neighbors of gene j within the selected pathway. The corresponding multivariate form is given by:
where ${1}_{J}^{\prime}$ is the vector consisting of J 1’s. We denote matrix R to reflect genegene network topological structure, where elements R _{ ij } = 1 if there is a direct edge between the i ^{th} and j ^{th} genes, and R _{ ij } = 0 otherwise.
Posterior distributions
The joint posterior distribution of θ = (γ, ξ, β _{(ξ,γ)}, Z) given (Y, X) is
where I(A _{ i }) is the indicator function and A _{ i } is either equal to {Z _{ i } : Z _{ i } > 0} or {Z _{ i } : Z _{ i } ≤ 0} corresponding to Y _{ i } = 1 or Y _{ i } = 0, respectively, and ${\lambda}_{1},\cdots {\lambda}_{{m}_{\xi}}\left({m}_{\xi}\le J\right)$ are the nonzero eigenvalues of ${\left({T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}\right)}^{+}$.
Since β is a nuisance parameter, we can integrate it out to obtain the joint posterior distribution of (Z, ξ, γY, X) as follows:
with ${\Sigma}_{\left(\xi ,\gamma \right)}^{1}=c{T}_{\left(\xi ,\gamma \right)}{\left({T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}\right)}^{+}{T}_{\left(\xi ,\gamma \right)}^{\prime}+{I}_{n}$. From this posterior joint distribution, we can derive the posterior conditional distributions
which is a multivariate truncated normal distribution, and
Posterior inference via MCMC sampling
Markov chain Monte Carlo (MCMC) sampling is used to generate samples for the posterior distribution of the model parameters. The MCMC sampling procedure consists of the following two steps:

I.
Sample ξ and γ from P(ξ, γY, T, Z): the selection parameters (ξ, γ) are updated using a MetropolisHastings algorithm which is modified from Stingo’s method [28] (details of the MCMC moves to update (ξ, γ) are given in Additional file 1). The method consists of randomly picking one of the following moves: (1) change the inclusion status of SNP and gene by randomly choosing from adding or removing a gene and a SNP at the same time; (2) change the inclusion status of SNP only by randomly choosing from adding or removing a SNP.

II.
Sample Z from P(ZY, T, γ, ξ): directly sampling from this distribution is known to be difficult. In this article, we follow the method given in Devroye [35] to simulate samples from the univariate truncated normal distribution P(Z _{ i }Z _{− i}, Y, T, γ, ξ), where Z _{− i} is the vector of Z without the i ^{th} element.
Simulation
Simulation studies were conducted to assess the performances of iBVS, LASSO regression, and Stepwise regression using a proprietary set of MatLab codes, an R package glmnet, and the R package lars. We simulated three scenarios of varying different proportion of variance of Z explained by the genetic factors, labeled as follows: (1) H70: genes with network, 70% explained variance; (2) H50: genes with network, 50% explained variance; and (3) H30: genes with network, 30% explained variance.
For each scenario, 50 sets of genotypes without disease status were simulated using software Hapgen2 [36] based on the genotype data from Hapmap project (http://hapmap.ncbi.nlm.nih.gov/). We subsequently generated phenotypes corresponding to each scenario, with 400 individuals and 300 SNPs assorted into 22 genes for each data set. The first 200 individuals were used to fit the iBVS hierarchical model and to evaluate the performance of identifying causal SNPs of the three methods, while the other 200 individuals were used to assess the prediction performance of each method. All the simulations were run under the additive and dominant genetic model respectively to check the model flexibility of the proposed iBVS.
We simulated sets of phenotypes in the following way: First, we specified 10 SNPs x _{ j }(j = 1 … 10) as causal SNPs, which were positioned within 5 genes. In order to add network information to gene levels, a network was simulated between the genes. We then conducted a precision matrix Ω which contains the network relationship between the genes. If there is an edge between p ^{th} gene and q ^{th} gene in the network, ω _{ pq } and ω _{ qp } would be assigned with a nonzero value, otherwise 0. The vector t _{ i } was generated from a multivariate normal distribution with zero mean vector and covariance matrix Σ = Ω ^{− 1}. Then the causal gene score T _{ k } (k = 1…5) was calculated by considering both genotype data and gene network information, ${T}_{ik}={\displaystyle \sum _{SNP\phantom{\rule{0.1em}{0ex}}j\in \mathit{gene}\phantom{\rule{0.2em}{0ex}}k}{b}_{j}{x}_{kj}+}\phantom{\rule{0.1em}{0ex}}{t}_{ik}\text{,}$ where b _{ j }’s were carefully chosen to indicate different PLS configurations. We subsequently simulated the latent phenotypes score for each individual using z _{ i } = ∑ϕ _{ j }T _{ ik } + ε _{ i }, ε _{ i } ~ N(0, 1). Finally, binary phenotypes Y _{ i } for each individual was generated using ${Y}_{i}=\left\{\begin{array}{c}\hfill 1\hfill \\ \hfill 0\hfill \end{array}\right.\begin{array}{c}\hfill \mathrm{if}\phantom{\rule{0.2em}{0ex}}{z}_{i}\ge 0\phantom{\rule{0.1em}{0ex}}\hfill \\ \hfill \mathrm{if}\phantom{\rule{0.25em}{0ex}}{z}_{i}<0\hfill \end{array}$.
Application
We applied iBVS, LASSO and Stepwise approaches to analyze a GWAS data set designed to identify genetic variants associated with leprosy [37]. The genotype data consisted of 492,109 SNPs from 706 cases and 514 controls after removing genetically unmatched controls, to obviate the need for correction for population stratification. All subjects were Han Chinese from eastern China. In order to select variables and assess the performance of the three variable selection strategies, we randomly divided the data set into two parts: a training set and a testing set, each with 610 samples. The training set was used for SNP selection, while the testing set for validating the selection results and comparing the three methods. The genotype was first coded under the additive genetic model, and we recoded the genotypes following dominant genetic components and reanalyzed this real data set to check the model flexibility under different genetic effects.
Results
Simulation studies
Performance of variable selection
The average area under the curve (AUC) was calculated to evaluate the performance of casual SNP identification in each scenario. For SSVS, the AUC is calculated using the following formula [38],[39]. $AUC=\frac{1}{{n}^{D}{n}^{C}}{\displaystyle \sum _{i\in D,j\in C}I}\left\{{\gamma}_{i}>{\gamma}_{j}\right\}\text{,}$ where D is the set of the causal SNPs and C is the set of the noncausal SNPs; n ^{D} and n ^{C} are the number of causal and noncausal SNPs, respectively.
For the LASSO method, a simple modification of the Least Angle Regression (LAR) algorithm was implemented that calculates the entire LASSO path, which is an efficient way of computing the solution to any LASSO problem especially when P ≫ N [40]. Using the modified LARS algorithm, one may generate all LASSO solutions corresponding to different values of the penalty parameter. Selecting the active model at a given iteration would give one LASSO solution corresponding to a particular value of the penalty parameter. Hence, one can control the penalty parameter using a cutoff for the number of iterations [38]. For LASSO and Stepwise approaches, the following formula was used to calculate the AUC: $AUC=\frac{1}{{n}^{D}{n}^{C}}{\displaystyle \sum _{i\in D,j\in C}I}\left\{{s}_{i}<{s}_{j}\right\}\text{,}$ where s is the iteration at which the i ^{th} marker enters the model [38].
Table 1 shows the averaged AUC of the three methods in variable selection under the additive genetic model. It can be seen that the AUC of the three methods all increase monotonically by the proportion of variance of Z explained by the genetic factors. Obviously, under scenario H70 and H50, the iBVS has the highest averaged AUC (0.911 and 0.894) followed by Lasso (0.891 and 0.882), while the AUC of Stepwise is relative low (0.869 and 0.853). The AUC of iBVS drops to 0.792 with low explained variance (H30), with LASSO and Stepwise both approximating 0.77. The results revealed that iBVS has superior performance compared with that of LASSO and Stepwise regression. Similar trends can also be found under the dominant genetic model (Additional file 1: Table S2).
Performance of outcome prediction
We subsequently assessed the prediction performance of the three methods using the remaining 200 individuals. Prediction performances were evaluated by considering correctly/incorrectly predicted positive/negative outcomes. We calculated specificity and sensitivity for thresholds from 0.01 to 0.99, with steps of 0.01. Then the ROC was plotted using mean specificity and sensitivity for a given threshold. For iBVS, we use a twostage strategy. First the posterior probabilities of all the SNPs were estimated by iBVS. The top i SNPs (i = 1…300) were subsequently fitted into a PLS logistic regression model, and 10fold crossvalidation was conducted to choose the optimal number of predictors i. For LASSO and Stepwise approaches, the optimal model was determined by a 10fold crossvalidation.
Figure 2 demonstrates the ROC of the three methods in scenarios H70 and H50. One can see that the prediction performance of iBVS was slightly greater than that of LASSO, and had an obvious superiority to Stepwise regression.
Application
We first conducted screening of all SNPs, one by one by fitting the singleSNP logistic regression model with additive coding. By sorting on the pvalues from the univariate analysis, we identified 100 genes, each containing at least one SNP with Pvalue smaller than 0.0004. We subsequently extracted all of the SNPs in each significant gene to comprise the joint effect of SNPs per one gene. The 100 genes selected above contained a total of 3,388 SNPs.
iBVS was applied using the above 3,388 SNPs. First, we constructed the R matrix using the KEGG database (details please see the Additional file 1). We subsequently specified prior distributions as described in the Methods Section, with hyper parameters set as: π = 0.01, μ = − 2, η = 0.8. Finally, the MCMC was conducted to make posterior inferences with 10,000 iterations as burnin and 50,000 additional runs. Figure 3a shows the posterior SNP selection probabilities via our iBVS strategy with integrated biological priors.
A tenfold crossvalidation approach was employed to set a cutoff for determining the optimal prediction model. The top i SNPs were added into a PLS logistic model one by one, to estimate the crossvalidation error, shown in Figure 3b. It can be seen that the smallest classification error appeared when the top 94 SNPs were selected as predictors. The classification error stabilizes after 94 SNPs were selected into the PLS logistic model, with some slight increase shown. Therefore the top 94 SNPs were selected as ‘significant’ predictors whose information was listed in (see the Additional file 1: Table S1).
In addition, we ran LASSO and Stepwise regression on this leprosy GWAS dataset, with the optimal model determined by 10fold crossvalidation. The LASSO selected 100 SNPs, which only included 24 SNPs selected by iBVS. The Stepwise regression approach yielded a more parsimonious model with only 3 SNPs. Table 2 shows the detailed information of 24 common SNPs selected by both iBVS and Lasso. Specifically, the three SNPs selected by Stepwise also had high corresponding posterior probability in iBVS and relative large coefficient in LASSO, and SNP rs9270984 is most significant in all three methods. Finally, we assessed the ability of the three methods to correctly predict binary responses (case versus control) of the test data set. Figure 4a shows the ROC curves of the three selected models under the additive genetic model, while Figure 4b demonstrates that under the dominant model. This indicates that iBVS has comparable performance to the LASSO model, but has an performance advantage over the Stepwise regression method no matter what the genetic model is.
Discussion
GWAS analyses typically approach data as a list of single SNPs, a strategy which has yielded a catalog of susceptibility loci for complex diseases. However, the statistically significant variants detected so far account for only a small proportion of the total phenotypic variation. Genebased tests for association are increasingly being seen as useful complements to GWASs, demonstrating several attractive features compared with traditional SNPbased analysis [12][14]. Beyond genebased methods, there is an increasing recognition of the potential contributions of pathwaybased analyses, in which variants in groups of genes within specific pathways are considered together to predict the phenotype [41][43]. In this paper, we followed an integrative biomarker identification scheme to conduct a novel hierarchical model incorporating a genegene network or pathway information for SNP identification in GWAS via the Bayesian inference paradigm.
Three scenarios of data sets were simulated, each considering different proportions of variance of outcome explained by genetic factors. Simulation results show our iBVS method outperformed the LASSO and Stepwise methods in identifying causal SNPs in each scenario (Table 1). In addition, we also evaluate the prediction ability of the three methods using additional data sets, and show that iBVS had advantageous performance (Figure 2). The advantages of the proposed iBVS strategy is verified when the real network is known and explicitly employed through a prior specification with the MRF distribution.
After applying iBVS to an actual GWAS dataset, a panel of 94 SNPs were selected as predictors of leprosy. The LASSO method selected 100 SNPs, which included only 24 SNPs in common with iBVS. The Stepwise regression yielded a very parsimonious model with only 3 SNPs. The results indicate that the iBVS method has comparable prediction performance with LASSO, and advantageous with Stepwise. Moreover, all the results are quite stable under the different genetic models. Stepwise regression searches the model space by adding or removing one SNP at a time and therefore the searching is partial, leading to convergence at a local optimum. The reason iBVS does not outperform LASSO may be due to deficiencies in pathway information from existing databases that do not reflect complete signaling pathways. The performance of iBVS can be improved by developing a stochastic inference of the genegene networks from the data and merging it into the current iBVS MCMC algorithm, which remains a future goal. Compared with LASSO and other penalized regression methods, which lack appropriate interpretation, iBVS has an additional advantage in that it produces posterior probabilities for each variable. This is a particularly important advantage in GWAS because the primary goal of the analysis is to identify the effect of SNPs. Comparing to metrics like pvalues, posterior probabilities have clear interpretation. A reviewer of this article suggested that many important traits are generally quantitative and are often controlled by multiple genes in shared biology pathways; our model could be naturally extended to analyze quantitative data by removing the latent variable Z from the hierarchical model.
Efficiency of stochastic algorithms often diminishes as the total number of variables increases [19],[21]. It would be appealing to remove noisy data points or those with lower quality before using a stochastic search. Therefore, we first screened the total number of SNPs using a conventional SNPbased model to filter the number of SNPs included in the Bayesian hierarchical model. This set of candidate SNPs and their associated genes is called the ‘signature set’ in the sense that they are possibly signaling SNPs/genes (in other words, causal or marker SNPs/genes). We extracted all the SNPs in each significant gene to comprise the joint effect of a gene, leaving the weaker candidate genes out of the iBVS. The screening step should be viewed as a general step not only for dimension reduction but for constructing a functional context before conducting BVS.
A few issues regarding our model choices and computation should be highlighted. We mainly adopted the perspective of subjective Bayesian analysis due to the fact that we want to incorporate informative priors from relevant scientific sources. Choosing an objective prior that satisfies some fundamental principles as summarized in Bayarri et al. [44] would be theoretically appealing in future work. Another issue concerns computational burden. With a large number of parameters in the model, the inference is mainly based on Monte Carlo simulation, which may take a prolonged time. Running over a single computer with 3.3GH CPU computer and 8GB memory, 6 days were required to finish the leprosy data analysis. With the advent of highspeed cluster computers and the existence of cloud computing technologies, it is becoming increasingly feasible to apply full iBVS methods for biomarker identification.
Conclusions
We proposed an iBVS method to analyze high dimensional GWAS data sets based on a hierarchical model that incorporates an informative prior on networked gene interrelationships. Simulation studies showed that our iBVS method had better performance in both biomarker identification and disease prediction than LASSO and Stepwise models. A leprosy GWAS analysis showed iBVS method demonstrated a comparable performance with LASSO, and better than Stepwise methods. iBVS did not outperform LASSO, which may be due to deficiencies in existing signaling pathway databases that are likely to be improved as the knowledge base increases. In summary, the proposed iBVS strategy is a valid method for GWAS, having an additional advantage in the production of posterior probabilities for each variable that are again subject to continued refinement.
Additional file
Abbreviations
 iBVS:

Integrative Bayesian variable selection
 LASSO:

Least absolute shrinkage and selection operator
 STS:

Screening, then selection
 AUC:

The average area under the curve
References
 1.
Kettunen J, Tukiainen T, Sarin AP, OrtegaAlonso A, Tikkanen E, Lyytikäinen LP, Kangas AJ, Soininen P, Würtz P, Silander K, Dick DM, Rose RJ, Savolainen MJ, Viikari J, Kähönen M, Lehtimäki T, Pietiläinen KH, Inouye M, McCarthy MI, Jula A, Eriksson J, Raitakari OT, Salomaa V, Kaprio J, Järvelin MR, Peltonen L, Perola M, Freimer NB, AlaKorpela M, Palotie A, et al: Genomewide association study identifies multiple loci influencing human serum metabolite levels. Nat Genet. 2012, 44 (3): 269276. 10.1038/ng.1073.
 2.
Chasman DI, Schürks M, Anttila V, de Vries B, Schminke U, Launer LJ, Terwindt GM, van den Maagdenberg AM, Fendrich K, Völzke H, Ernst F, Griffiths LR, Buring JE, Kallela M, Freilinger T, Kubisch C, Ridker PM, Palotie A, Ferrari MD, Hoffmann W, Zee RY, Kurth T: Genomewide association study reveals three susceptibility loci for common migraine in the general population. Nat Genet. 2011, 43 (7): 695698. 10.1038/ng.856.
 3.
Goode EL, ChenevixTrench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kjaer SK, Birrer MJ, Berchuck A, Schildkraut J, Tomlinson I, Kiemeney LA, Cook LS, Gronwald J, GarciaClosas M, Gore ME, Campbell I, Whittemore AS, Sutphen R, Phelan C, AntonCulver H, Pearce CL, Lambrechts D, Rossing MA, ChangClaude J, Moysich KB, Goodman MT, et al: A genomewide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet. 2010, 42 (10): 874879. 10.1038/ng.668.
 4.
Smyth DJ, Cooper JD, Bailey R, Field S, Burren O, Smink LJ, Guja C, IonescuTirgoviste C, Widmer B, Dunger DB, Savage DA, Walker NM, Clayton DG, Todd JA: A genomewide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferoninduced helicase (IFIH1) region. Nat Genet. 2006, 38 (6): 617619. 10.1038/ng1800.
 5.
Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, Allen JE, Downes K, Barrett JC, Healy BC, Mychaleckyj JC, Warram JH, Todd JA: Metaanalysis of genomewide association study data identifies additional type 1 diabetes risk loci. Nat Genet. 2008, 40 (12): 13991401. 10.1038/ng.249.
 6.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9 (5): 356369. 10.1038/nrg2344.
 7.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747753. 10.1038/nature08494.
 8.
Visscher PM: Sizing up human height variation. Nat Genet. 2008, 40 (5): 489490. 10.1038/ng0508489.
 9.
Gibson G: Hints of hidden heritability in GWAS. Nat Genet. 2010, 42 (7): 558560. 10.1038/ng0710558.
 10.
Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010, 11 (6): 446450. 10.1038/nrg2809.
 11.
Stranger BE, Stahl EA, Raj T: Progress and promise of genomewide association studies for human complex trait genetics. Genetics. 2011, 187 (2): 367383. 10.1534/genetics.110.120907.
 12.
Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Investigators AMFS, Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S: A versatile genebased test for genomewide association studies. Am J Hum Genet. 2010, 87 (1): 139145. 10.1016/j.ajhg.2010.06.009.
 13.
Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ, Frayling TM, McCarthy MI, Hirschhorn JN, Goddard ME, Visscher PM, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Metaanalysis (DIAGRAM) Consortium: Conditional and joint multipleSNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012, 44 (4): 369375. 10.1038/ng.2213. S1S3
 14.
Li M, Gui H, Kwan JS, Sham PC: GATES: a rapid and powerful genebased association test using extended Simes procedure. Am J Hum Genet. 2011, 88 (3): 283293. 10.1016/j.ajhg.2011.01.019.
 15.
Prentice RL, Lihong QI: Aspects of the design and analysis of highdimensional SNP studies for disease risk estimation. Biostatistics. 2006, 7 (3): 339354. 10.1093/biostatistics/kxj020.
 16.
Sölkner J: Very many variables and limited numbers of observations; The p>> n problem in current statistical applications. Information Technology Interfaces (ITI). Proceedings of the ITI 2012 34th International Conference 2528 June 2012. 2012, 1314.
 17.
Tadesse MG, Sha N, Vannucci M: Bayesian variable selection in clustering highdimensional data. J Am Stat Assoc. 2005, 100 (470): 602617. 10.1198/016214504000001565.
 18.
Mitchell TJ, Beauchamp JJ: Bayesian variable selection in linear regression. J Am Stat Assoc. 1988, 83 (404): 10231032. 10.1080/01621459.1988.10478694.
 19.
Guan Y, Stephens M: Bayesian variable selection regression for genomewide association studies and other largescale problems. Ann Appl Stat. 2011, 5 (3): 17801815. 10.1214/11AOAS455.
 20.
Fridley BL: Bayesian variable and model selection methods for genetic association studies. Genet Epidemiol. 2008, 33 (1): 2737. 10.1002/gepi.20353.
 21.
Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM: Bayesian model search and multilevel inference for SNP association studies. Ann Appl Stat. 2010, 4 (3): 134210.1214/09AOAS322.
 22.
Banerjee S, Yandell BS, Yi N: Bayesian quantitative trait loci mapping for multiple traits. Genetics. 2008, 179 (4): 22752289. 10.1534/genetics.108.088427.
 23.
Russu A, Malovini A, Puca AA, Bellazzi R: Stochastic model search with binary outcomes for genomewide association studies. J Am Med Inform Assn. 2012, 19 (e1): e13e20. 10.1136/amiajnl2011000741.
 24.
Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genet. 2008, 4 (7): e100013010.1371/journal.pgen.1000130.
 25.
Kwon S, Wang D, Guo X: Application of an iterative Bayesian variable selection method in a genomewide association study of rheumatoid arthritis. BMC Proc. 2007, 1 (Suppl 1): S10910.1186/175365611s1s109.
 26.
Torkamani A, Schork NJ: Pathway and network analysis with highdensity allelic association data. Methods Mol Biol. 2009, 563: 289301. 10.1007/9781607611752_16.
 27.
Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BMJ, Kappos L, Polman CH: Pathway and networkbased analysis of genomewide association studies in multiple sclerosis. Hum Mol Genet. 2009, 18 (11): 20782090. 10.1093/hmg/ddp120.
 28.
Stingo FC, Chen YA, Tadesse MG, Vannucci M: Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011, 5 (3): 19782002. 10.1214/11AOAS463.
 29.
Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X: An Integrative Framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLoS One. 2013, 8 (7): e6767210.1371/journal.pone.0067672.
 30.
Chuang H, Lee E, Liu Y, Lee D, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 14010.1038/msb4100180.
 31.
Lee E, Chuang H, Kim J, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008, 4 (11): e100021710.1371/journal.pcbi.1000217.
 32.
Zellner A: On assessing prior distributions and Bayesian regression analysis with gprior distributions. Bayesian Inference Decision Techniques. 1986, 6: 233243.
 33.
AiJun Y, XinYuan S: Bayesian variable selection for disease classification using gene expression data. Bioinformatics. 2010, 26 (2): 215222. 10.1093/bioinformatics/btp638.
 34.
Li F, Zhang NR: Bayesian variable selection in structured highdimensional covariate spaces with applications in genomics. J Am Stat Assoc. 2010, 105 (491): 12021214. 10.1198/jasa.2010.tm08177.
 35.
Devroye L: Samplebased Nonuniform random variate generation. Proceedings of the 18th conference on Winter simulation. ACM. 1986, 260265. 10.1145/318242.318443.
 36.
Su Z, Marchini J, Donnelly P: HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011, 27 (16): 23042305. 10.1093/bioinformatics/btr341.
 37.
Zhang FR, Huang W, Chen SM, Sun LD, Liu H, Li Y, Cui Y, Yan XX, Yang HT, Yang RD: Genomewide association study of leprosy. New Engl J Med. 2009, 361 (27): 26092618. 10.1056/NEJMoa0903753.
 38.
Srivastava S, Chen L: Comparison between the stochastic search variable selection and the least absolute shrinkage and selection operator for genomewide association studies of rheumatoid arthritis. BMC Proc. 2009, 3 (Suppl 7): S2110.1186/175365613s7s21.
 39.
Ma S, Huang J: Combining multiple markers for classification using ROC. Biometrics. 2007, 63 (3): 751757. 10.1111/j.15410420.2006.00731.x.
 40.
Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Appl Stat. 2004, 32 (2): 407499. 10.1214/009053604000000067.
 41.
Ramanan VK, Shen L, Moore JH, Saykin AJ: Pathway analysis of genomic data: concepts, methods, and prospects for future development. Trends Genet. 2012, 28 (7): 323332. 10.1016/j.tig.2012.03.004.
 42.
Consortium IMSG: Networkbased multiple sclerosis pathway analysis with GWAS data from 15,000 cases and 30,000 controls. Am J Hum Genet. 2013, 92 (6): 85410.1016/j.ajhg.2013.04.019.
 43.
Mukherjee S, Kim S, Ramanan VK, Gibbons LE, Nho K, Glymour MM, ErtekinTaner N, Montine TJ, Saykin AJ, Crane PK: Genebased GWAS and biological pathway analysis of the resilience of executive functioning. Brain Imaging Behav. 2014, 8 (1): 110118. 10.1007/s1168201392597.
 44.
Bayarri MJ, Berger JO, Forte A, GarcíaDonato G: Criteria for Bayesian model choice with application to variable selection. Ann Appl Stat. 2012, 40 (3): 15501577. 10.1214/12AOS1013.
Acknowledgments
This research was supported by Eunice Kennedy Shriver National Institute of Child Health & Human Development, NIH, USA [5R01HD061404]; National Institute on Drug Abuse, NIH, USA [R44DA026683]; National Natural Science Foundation of China, China [81273177]. The authors would like to thank the editor and reviewers for their valuable comments and suggestions to improve the quality of the paper.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
XSZ, FZX, DWZ, BP and XWY conceptualized the study, analyzed the data and prepared for the manuscript. HL acquired and interpreted the data. JLW contributed on the manuscript revision and thoroughly copyedited the manuscript. All authors approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Biomarker discovery
 Bayesian hierarchical modeling
 Genebased biomarkers
 Bayesian variable selection
 Integrative biomarker identification