- Methodology article
- Open Access
Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping
BMC Genetics volume 14, Article number: 5 (2013)
Complex binary traits are influenced by many factors including the main effects of many quantitative trait loci (QTLs), the epistatic effects involving more than one QTLs, environmental effects and the effects of gene-environment interactions. Although a number of QTL mapping methods for binary traits have been developed, there still lacks an efficient and powerful method that can handle both main and epistatic effects of a relatively large number of possible QTLs.
In this paper, we use a Bayesian logistic regression model as the QTL model for binary traits that includes both main and epistatic effects. Our logistic regression model employs hierarchical priors for regression coefficients similar to the ones used in the Bayesian LASSO linear model for multiple QTL mapping for continuous traits. We develop efficient empirical Bayesian algorithms to infer the logistic regression model. Our simulation study shows that our algorithms can easily handle a QTL model with a large number of main and epistatic effects on a personal computer, and outperform five other methods examined including the LASSO, HyperLasso, BhGLM, RVM and the single-QTL mapping method based on logistic regression in terms of power of detection and false positive rate. The utility of our algorithms is also demonstrated through analysis of a real data set. A software package implementing the empirical Bayesian algorithms in this paper is freely available upon request.
The EBLASSO logistic regression method can handle a large number of effects possibly including the main and epistatic QTL effects, environmental effects and the effects of gene-environment interactions. It will be a very useful tool for multiple QTLs mapping for complex binary traits.
Quantitative traits are usually influenced by multiple quantitative trait loci (QTLs), environmental factors and their interactions . Although complex binary traits only show binary phenotypic variation different from the continuous variation in quantitative traits, they do not follow a simple Mendelian pattern of inheritance and also have a polygenic basis similar to that of quantitative traits. Therefore, like QTL mapping for quantitative traits, mapping for complex binary traits aims to identify multiple genomic loci that are associated with the trait and to estimate the genetic effects of these loci possibly including any of the following effects: main effects, gene-gene interactions (epistatic effects) and effects of gene-environment interactions.
A number of statistical methods have been developed to identify QTLs for binary traits in experimental crosses. Single-QTL mapping methods [2–8] analyze the association between each individual genetic locus and the trait independently. However, single-QTL mapping method can only find the main effect of a QTL and cannot detect epistatic effects involving more than one locus. Moreover, it has been shown both theoretically and empirically that multiple-QTL methods can improve power in detecting QTLs and eliminate possible biases in the estimates of QTL locations and genetic effects introduced by a single-QTL model [9, 10]. Therefore, several multiple QTL mapping for binary traits have been developed. These include Bayesian methods [11–16] that rely on Markov Chain Monte Carlo (MCMC) simulation to infer a multiple QTL threshold model for binary, ordinal or longitudinal traits, the multiple-interval mapping (MIM) methods [17, 18] that use the expectation-maximization (EM) algorithm to infer the threshold model or a generalized linear model for binary or ordinal traits, and a method  that employs a probability model based on classical transmission genetics to identify binary trait loci. However, all these methods require large computation when the QTL model includes a relatively large number of loci.
Since hundreds or thousands of genomic loci or markers are usually genotyped and involved in QTL mapping studies, including all these markers and their possible interactions in a single model for multiple-QTL mapping leads to a huge number of model variables, typically much larger than the sample size. This not only entails huge computation that is not affordable to existing QTL mapping methods mentioned earlier but also may reduce power of detection and/or increase false discover rate. Two techniques have been proposed to handle the second problem: variable or model section and shrinkage. More specifically, model selection using the Akaike or Bayesian information criterion (AIC or BIC) and variable selection based on stepwise logistic regression and the BIC have been proposed in  and , respectively, to restrict the model space and to reduce the number of variables included in the model. Bayesian shrinkage method that was first applied to QTL mapping for continuous traits [15, 20–24] has also been used in QTL mapping for binary traits [15, 22], which employed MCMC for the inference of the QTL model. To reduce computational burden, more efficient methods [25, 26] were developed to infer Bayesian QTL models for binary traits. Another well-known method for shrinking variables is the least absolute shrinkage and selection operator (LASSO) , which has been applied to QTL mapping for continuous traits  and investigated for genome-wide association study (GWAS) of complex diseases [28, 29].
Recently, we developed an efficient empirical Bayesian LASSO (EBLASSO) algorithm for multiple-QTL mapping for continuous traits, which is capable of handling a large number of markers and their interactions simultaneously . In this paper, we extend the linear Bayesian LASSO model [23, 30, 31] to logistic regression to map multiple QTLs for binary traits. We consider a three-level and a two-level hierarchical model for the prior distributions of the regression coefficients. Building on the EBLASSO algorithm , we develop efficient empirical Bayesian algorithms to infer the Bayesian LASSO logistic regression model with two different priors. We then use simulations to compare the performance of our EBLASSO with that of five other QTL mapping methods for binary traits, that include the LASSO-logistic regression [27, 28], the HyperLasso , the Bayesian hierarchical generalized linear models (BhGLM) , the relevant vector machine (RVM) [32, 33], and the single-QTL mapping method based on logistic regression. Simulation results show that our EBLASSO offers best overall performance among the examined methods in terms of power of detection and false positive rate. Analysis of a real dataset with our algorithms also identifies several QTLs.
Logistic regression model
Let y i = 0 or 1 denote the binary trait of the ith sample of n individuals in a study. Let us define y = [y 1 , y 2 , ···, y n ]T as the binary phenotypes for all the n individuals. The probability of observing y i = 1 is written as p i = Pr(y i = 1), i = 1, ···, n, which are further collected into a vector p = [p 1 , p 2 , ···, p n ]T. Suppose that m genetic markers of these n individuals are genotyped and the genotype of marker j of individual i is x ij . Taking main and epistatic effects of all markers into consideration, we have the following logistic regression model for multiple-QTL mapping:
where β j and β jj’ are regression coefficients, and logit(p i ) is defined as:
The widely adopted Cockerham genetic model  will be used in this paper. For a back-cross design, the Cockerham model assigns −0.5 and 0.5 to x ij for two possible genotypes at marker j. For an intercross (F2) design, there are two possible main effects named additive and dominance effects. The Cockerham model defines the values of the additive effect as −1, 0 and 1 for the three genotypes and the values of the dominance effect as −0.5 and 0.5 for homozygotes and heterozygotes, respectively. For simplicity, we only consider additive effects in (1), although the methods developed in this paper are also applicable to the model with dominance effects.
Let us define x Gi = [xi1, xi2, ⋯, x im ]T, and . Let x GGi be a m(m − 1)/2 × 1 vector containing , and β GG be a vector consisting of corresponding regression coefficients. We further define x i = [1, x Gi T, x GGi T]T and , then (1) can be written in a more compact form:
From (3), we can express p i as follows:
Note that there are k = 1 + m(m + 1)/2 unknown regression coefficients in (1) or (3). Typically, we have k ≫ n. If dominance effects of the markers are considered, k is even larger. Simultaneously estimation of all possible genetic effects with k ≫ n is a challenging problem. However, we would expect that most elements of β are zeros and thus we have a sparse model. We will exploit this sparsity to develop efficient methods to infer β.
Prior distributions for the regression coefficients
We assign a noninformative uniform prior to β0, i.e., p(β0) ∝ 1. For β i , i = 1, 2,···, k, we will consider two hierarchical models for the prior distribution. The first model is the one used in both the linear Bayesian LASSO model [23, 30, 31] and the HyperLasso logistic regression [25, 35], which has three levels. At the first level, β i , i = 1, 2,···, k, follows an independent normal distribution with mean zero and variance σ i 2: β i ∼ N(0, σ i 2). At the second level, σ i 2, i = 1, 2, ···, k, follows an independent exponential distribution with probability density function p(σ i 2) = λexp(−λσ i 2), with parameter λ > 0. At the third level, λ follows a gamma distribution gamma(a, b) with a shape parameter a and an inverse scale parameter b. We name the logistic regression model with this normal-exponential-gamma (NEG) prior as BLASSO-NEG.
The three-level hierarchical model has two hyperparameters a and b. While these two parameters give much flexibility of adjusting the degree of shrinkage, significant computation is required for cross validation to properly choose their values. To reduce the computational burden of cross validation, we will also consider a model with only the first two levels in the BLASSO-NEG. This two level hierarchical model only has one hyperparameter λ to be adjusted, and thus, it requires less computation. We name the logistic regression model with the two-level prior as BLASSO-NE. We will next develop two empirical Bayes (EB) algorithms to infer these two models.
Empirical Bayesian algorithm for the BLASSO-NEG model (EBLASSO-NEG)
Using the EB approach , we will first estimate σ i 2, i = 1, 2,⋯, k, from the data, and then find the posterior distribution of β based on the estimated σ i 2. As shown in , the prior distribution of σ i 2 can be found as
Let us define and y = y 1 , y 2 , ···, y n T, then the posterior distribution of β and σ2 is given by:
where and is a normal distribution. Since it is difficult to integrate out β in (6) to get the marginal posterior distribution of σ2, it is difficult to estimate σ2 directly by maximizing its posterior function. To overcome this problem, we will employ an iterative approach that relies on Laplace approximation of the posterior distribution of β[32, 37, 38].
Let α i = 1/σ i 2, i = 1, 2,⋯, k, and collect them in a vector . Then, we have , where A is a diagonal matrix With α on its diagonal. Suppose in the (i-1)th iteration, we have estimated A as Â. Given y and A = Â, the posterior distribution of β can be approximated by a Gaussian distribution found with the Laplace approximation approach  as follows. Since the posterior distribution of β is given by , we have:
The gradient g and Hessian matrix H of are given by and H = − (XTBX + A), respectively, where p = p 1 , p 2 ,⋯, p n T, X = x1T, x2T, ⋯, x n TT and B is a diagonal matrix with the diagonal entries p1(1 − p1), ⋅ ⋅⋅, p n (1 − p n ). With g and H available, we can use the Newton–Raphson method to find the maximum a posterior (MAP) estimate or the mode of β, by maximizing , which is denoted as . Then the Laplace approximation of is a normal distribution , with mean and covariance given by :
where B MAP is obtained with .
If we postulate a linear model ε, where with p MAP being obtained with , and , we can show that is the posterior distribution of β in this linear model as follows. The linear model implies that the posterior distribution of β is normal with mean and covariance Σ. Hence, we need to prove . Since the gradient g = 0 at β MAP , we have From (8), we get Therefore, we have , which implies that . This leads to as desired.
Therefore, in the ith iteration we form the following linear model:
where ŷ and the distribution of ε are defined earlier, but β follows the three-level BLASSO-NEG model. Then based on the linear model (9), we use the EBLASSO algorithm  to get a new estimate of A, which replaces Â obtained in the (i-1)th iteration. The iteration process goes on until certain convergence criterion is satisfied, which gives the final estimate of A and Laplace approximation of the posterior distribution of β. The iterative process needs to be initialized. Similar to the EBLASSO , we assume that only one regression coefficient is initially nonzero or equivalently only one α i is finite. The EBLASSO-NEG algorithm is summarized as follows:
Algorithm 1 (EBLASSO-NEG)
Initialization: choose a > −1.5 and b > 0. The first basis in the model, x j , is identified by , with p 0 being the proportion of y i = 1 in the dataset. Compute with , and set and all α i , i ≠ j notionally to infinity
While the convergence criteria are not satisfied
Given Â, use the Newton–Raphson method to find
Apply the EBLASSO algorithm  to linear model (9) to update Â.
Output and covariance Σ.
Note that the algorithm starts with a logistic regression model with only one variable and then iteratively adds variables with a finite α i to the model. The number of variables in the model k m is typically much smaller than the total number of possible variables k. Since we only need to calculate the gradient g and the Hessian matrix H for the k m variables in the model, the computation required in step 3 and in the calculation of Σ in (8) is relatively small. Moreover, the EBLASSO algorithm in step 5 is very efficient due to the fact that the variance components can be estimated in a closed form and other algorithmic techniques as discussed in .
The convergence criteria in Algorithm 1 are defined as: 1) no effect can be added to or delete from the model, 2) the likelihood change between two consecutive iterations is less than a pre-specified small value and 3) the total change of α between two consecutive iterations is less than a pre-specified small value. In step 5, the EBLASSO algorithm  needs to be modified slightly. First, since the noise covariance is known as B-1, we do not estimate noise covariance. Second, since the mean of ŷ in (9) is zero, we do not need to estimate it. Third, we use the formula Σ = (XTBX + A)− 1 to update the covariance of β.
The values of hyperparameters (a, b) are determined by cross validation as will be described in the Result section. The first basis in the initialization step is determined using the method in LASSO-logistic regression . The initial value of the first basis is calculated from a linear regression model that uses y as the response variable , and the variance σ j 2 is approximated as , resulting in .
Suppose that output from the EBLASSO-NEG algorithm contains k n entries. Then the posterior distribution of the corresponding k n × 1 regression coefficient can be approximated by a normal distribution with mean and covariance Σ. For the ith entry of , we can calculate a t-statistics , and then use the student’s distribution to calculate a p-value for . Markers that correspond to those with a p-value less than a threshold, say 0.05, are then identified as QTLs and the corresponding entries of are estimated effect sizes of the QTLs.
We next derive an efficient EB algorithm for the BLASSO-NE model, which simplifies the hyperparameter selection, since we only need to determine the optimal value of one hyperparameter λ using cross validation.
Empirical Bayesian algorithm for the BLASSO-NE model (EBLASSO-NE)
The prior distribution of σ i 2, i = 1, 2, …, k or equivalently α i , i = 1, 2, …, k is used only in step 5 of the EBLASSO-NEG algorithm. Because the EBLASSO-NE model uses a different prior distribution from the one used in the BLASSO-NEG model, we will derive a new formula for estimating A in each iteration. Then the EBLASSO-NE algorithm uses the same steps in the EBLASSO-NEG except that the new formula is used in step 5 to find Â.
Suppose that the postulated linear model after step 4 is . Following the derivation in , the log marginal posterior distribution of α can be found as:
where is the covariance matrix of ŷ in the linear model. Each α i will be estimated iteratively by maximizing the log marginal posterior distribution L(α) with the other parameters fixed. Specifically, L(α) can be written as , where L(α) is a function of α i and is a function of the remaining parameters. By defining , we can write L(α) as:
where and . In the Additional file 1, we show that L(α) has a unique global maximum and that the optimal α i maximizing L(α) is given by:
where , and Δ = s i 2 + 8λq i 2.
The EBLASSO-NE algorithm has the same steps as those in Algorithm 1 but with the following two modifications. First, we need to choose a value for parameter λ instead of a and b in step 1, which can be done using cross validation. In the LASSO-logistic regression , an upper bound of λ was estimated to be . In our EBLASSO-NE, we suggest the maximum value of λ to be λmax= 1.5λ lasso based on our simulations showing that this maximum value usually only gives one nonzero regression coefficient. Second, when applying the EBLASSO algorithm in step 5 of Algorithm 1, the EBLASSO algorithm uses equation (12) instead of equation (8) in  to estimate A.
Note that since the EBLASSO-NEG model for logistic regression considered in this paper uses the same prior as the one used by the EBLASSO for linear regression considered in , the EBLASSO-NEG algorithm is essentially a combination of Laplace approximation and the EBLASSO algorithm . The EBLASSO-NE model considered here however uses a prior different from the one used by the EBLASSO . Therefore, we employ (12) to modify the EBLASSO algorithm in , and then the EBLASSO-NE algorithm is a combination of Laplace approximation and the modified EBLASSO algorithm.
A total of m = 481 genetic markers were simulated to be evenly spaced on a large chromosome of 2400 centi-Morgan (cM) with an interval of d = 5 cM, which gives rise to a correlation of R = e-2d = 0.9048 since the Haldane map function  was assumed. The simulated population was an F2 family derived from cross of two inbred lines. The dummy variable for the three genotypes, A 1 A 1 , A 1 A 2 and A 2 A 2 of individual i at marker j was defined as x ij = 1, 0, -1, respectively. Two simulation setups were employed. In the first setup, 20 markers were QTLs with main effects but without interactions. In the second setup, 10 main and 10 epistatic effects were simulated; a marker could have both main and epistatic effects, while two markers involving in an interaction effect did not necessarily have main effects. The QTLs were selected randomly with varying distances (5 cM - 500 cM) and effect sizes (in the range between −1.28 and 2.19). Note that QTLs were assumed to be coincided with markers in both simulation setups. If QTLs are not on markers, they may still be detected since correlation between a QTL and a nearby marker is high, although a slightly larger sample size may be needed to give the same power of detection.
EBLASSO-NEG and EBLASSO-NE algorithms were implemented in C and could be called from the R environment , and thus QTL mapping with these two algorithms were carried out in R. To compare the performance of our algorithms with that of other relevant algorithms, we also analyzed the simulated data with the following five QTL mapping methods: the LASSO regularized logistic regression implemented in the glmnet package , the HyperLasso , the BhGLM method , the RVM [32, 33], and the single QTL logistic regression test using the glm procedure in R. Simulation results from these methods are presented next.
Simulation result for the dataset with only main effects
The genotypes of m = 481 markers of n = 500 individuals were generated using the procedure described earlier. Twenty markers were chosen as QTLs; their IDs and effect sizes are given in Table 1. Let x i be a 20 × 1 vector containing the genotypes of 20 QTLs of individual i, and β contain corresponding effect sizes. Then probability p i was calculated from , and y i was generated from a Bernoulli random variable with parameter p i . Therefore, a simulated data set included a 500 × 1 vector y and a 500 × 481 design matrix X.
The average log likelihood (denoted as logL) from ten-fold cross validation was used to select the optimal value of the hyperparameter(s) in the EBLASSO-NE, the EBLASSO-NEG, the LASSO. Specifically, the dataset was first divided into 10 subsets. Nine subsets were used as the training data to estimate model parameters and the log likelihoods of the remaining testing data were calculated using the estimated parameters. This process was repeated ten times until every subset had been tested. The logL was the average of all the likelihoods obtained from 10 testing datasets.
For the EBLASSO-NE, we first calculated λmax as described earlier. We then chose a set of values for λ decreasing from λmax to 0.001 at a step of 0.35 on the logarithmic scale. Ten-fold cross validation using this set of values identified an optimal value denoted as λ1. We next zoomed in the interval of length 0.01 centering at λ1, and performed cross validation using ten more values equally spaced in the interval. This procedure identified the largest logL at the optimal λ = 0.050. A summary of results for several values of λ and the corresponding logL was given in Table 2. Using the optimal values of λ, the EBLASSO-NE detected 11 true and 2 false positive effects that have p-value ≤ 0.05. The estimated sizes of the true effects and their standard errors were given in Table 1.
The optimal values of parameters a and b in the EBLASSO-NEG were obtained with cross validation in three steps. In the first step, a = b = 0.001, 0.01, 0.1, 1 were examined and a pair (a1, b1) corresponding to the largest logL was obtained. In the second step, b was fixed at b1 and a was chosen from the set [−0.5, -0.4, -0.3, -0.2, -0.1, -0.01, 0.01, 0.05, 0.1, 0.5, 1], which yielded an a2 corresponding to the largest logL. Note that when fixing one of the two parameters, the degree of shrinkage is a monotonic function of the other parameter. In the third step, a = a2 was fixed and b was varied from 0.01 to 10 with a step size of one for b > 1 and a step size of one on the logarithmic scale for b < 1. This three-step procedure identified the optimal pair of parameters that maximized logL at (a, b) = (0.01, 6). A summary of results for several pairs of a and b and the corresponding logL was given in Table 2. The whole dataset was then analyzed with the optimal parameters, which identified 11 true and 1 false positive effects that have p-value ≤ 0.05. The estimated sizes of true effects and their standard errors were depicted in Table 1.
For the LASSO-logistic regression, the cross validation procedure in the glmnet package  automatically identified a maximum value λmax for λ that gave one nonzero effect and then λ was decreased from λmax to λmin = 0.001λmax with a step size of [log(λmax) − log(λmin)]/100 on the logarithmic scale. The largest λ that yielded a logL within one standard error of the maximum logL was determined as the optimal value. This gave the optimal values λ = 0.0257. The whole dataset was then analyzed using the optimal λ to estimate QTL effects, which identified 40 markers with nonzero regression coefficients. The LASSO only estimates regression coefficients without giving a confident interval or a p-value for the estimates. If we counted all nonzero coefficients as detected effects and considered them as one effect if they were in 20 cM from a QTL, the LASSO yielded 14 true positive effects and 12 false positive effects. Moreover, the LASSO typically gives a biased estimate of non-zero coefficient toward zero. To reduce the false positive rate and the bias, we refitted an ordinary logistic regression model with the markers selected by the LASSO. Among those markers with a p-value ≤ 0.05 in the refitted model, 6 markers were true positive effects 4 were false positive effects. A summary of results from cross validation for the EBLASSO-NE, the EBLASSO-NEG and LASSO was given in Table 2. The estimated sizes of true effects and their standard errors were depicted along with all 20 true effects in Table 1.
The HyperLasso employed the same Bayesian NEG hierarchical prior for the marker effects and estimated the posterior modes using a numerical algorithm . Hoggart et al. did not propose a cross validation procedure to determine the values of a and b but suggested a range from 0.01 to 10 for a and gave a formula to calculate b from the level of the type-I error controlled at α. In our simulations, we used three values for a and four values for α as listed in Table 3. The values of b calculated from α using the method in  is also included in Table 3. Similar to the LASSO, the HyperLasso outputs a point estimate of β without a confidence interval or a p-value. Therefore, we refitted the markers selected by the HyperLasso with ordinary logistic regression and identified markers with a p-values ≤ 0.05 as QTLs. The number of effects identified with different values of a and b are presented in Table 3. The best results in Table 3 include 10 true and 1 false positive effects. We would emphasize here that these best results may not be achievable in the analysis of real data because the optimal values of a and b cannot be determined. The estimated sizes of true effects and their standard errors for the best results in Table 3 were depicted along with all 20 true effects in Table 1.
The BhGLM method  employed a two-level Bayesian hierarchical prior for the marker effects: the ith entry of β follows a normal distribution N(0, σ i 2) and σ i 2 obeys an inverse- χ2 distribution Inv - χ2(υ i , τ i 2), and it used the EM algorithm to estimate the posterior mode of the Bayesian QTL model. The default value for hyperparameters ν i and τ i are 0.01 and 10-4, respectively. The variance of regression coefficients in BhGLM method was treated as missing data and estimated in the E-step of the EM algorithm. The p-value of each nonzero effect was calculated from the t-distribution of one degree of freedom using the t-statisrics calculated from the estimated regression coefficient and corresponding variance. We examined 15 different pairs of values for ν i and τ i as listed in Table 4. The values ν i = 10-3, 10-4 and 10-5 all gave the best result which includes 9 true and 0 false positive effects. The corresponding estimated sizes of the true effects and their standard errors were depicted in Table 1.
The RVM for classification  assumed a uniform prior for the variance of regression coefficients, and thus it did not involve parameter selection. We utilized the MATLAB implementation of the RVM from the authors  to analyze the datasets and identified 35 nonzero effects with a p-value ≤ 0.05. Among these effects, we have 17 true and 18 false positives. The estimated sizes of true effects and their standard errors were depicted in Table 1. The false positive rates were much higher that the EBLASSO-NEG, EBLASSO-NE, LASSO, and HyperLasso. When we reduced the number of false positive effects to three, at a level slightly higher than that of our EBLASSO-NEG and EBLASSO-NE, by reducing the cutoff for the p-value, we obtained 10 true positive effects, which were smaller than the number of true effects identified with our EBLASSO-NEG and EBLASSO-NE.
Single QTL mapping with logistic regression was performed using the glm procedure in R . After Bonferroni correction, effects with a p-value ≤ 0.05/481 = 1.04 × 10-4 were considered as significant, which identified 8 true and 25 false positive effects. The estimated sizes of true effects and their standard errors were depicted in Table 1. The small p-value cutoff used by Bonferroni correction was expected to yield a small false positive rate. However, the single QTL mapping method with Bonferroni correction still gave much more false positive effects than other methods. If we had used another popular permutation technique for multiple test correction, we effectively employed a larger p-value cutoff. Although this could increase power of detection, it would also increase the false positive rate. To see this, we increased the cutoff for the p-value to 6 × 10-4 so that the number of true positive effects detected was increased to 11, at a level same as that of our EBLASSO-NEG and EBLASSO-NE methods, but then the number of false positive effects was increased to 27.
As shown in Table 1, the EBLASSO-NE and the EBLASSO-NEG identified more true effects than other four methods except RVM, and yielded a number of false positive effects comparable to those of the LASSO, the HyperLasso and the BhGLM but much smaller than those of the RVM and the single-QTL mapping method. Note that the false negative rate can be easily calculated from 20 simulated true effects and the true positive effects detected by each method. While the EBLASSO-NE and the EBLASSO-NEG offered similar performance in terms of power of detection and the false positive rate, several true effects were detected by either of them, which implies that the power of detection could be improved if the results of two methods were combined. Similar observations can be seen from the simulation results for two more independent replicates described in Additional file 1: Table S1 and Table S2.
It is well known that the LASSO typically selects only one variable among a set of highly correlated variables. This phenomenon is indeed observed from the results in Tables 1, Additional file 1: Tables S1 and S2 for two pairs of highly correlated markers, (72,73) and (181,182). It turns out that all methods compared except the single QTL mapping method have the same problem, although the problem with the EBLASSO-NE tends to be less severe. The LASSO is also known to bias the regression coefficients toward zero. This is observed from the results in Table 1, Additional file 1: Tables S1 and S2. Since the EBLASSO-NE uses the same prior distribution as the LASSO, it exhibits the same trend. However, the RVM inflated all the detected effects likely due to its small degree of shrinkage. On the other hand, the EBLASSO-NEG, the HyperLasso and the BhGLM tend to detect only one of the two highly correlated effects with an inflated effect size.
All simulations were performed on a personal computer (PC) with a 3.4 GHz Intel PentiumD CPU and 2Gb memory running Windows XP, except that the HyperLasso was ran on an IBM BladeCenter cluster including computing nodes with 2.6 GHz Xeon or 2.2 GHz Opteron CPU running Linux. The speeds of the EBLASSO-NEG, the LASSO and the HyperLasso are comparable and faster than the other methods. The speeds of the EBLASSO-NE, the BhGLM and the single-QTL mapping method are comparable.
Simulation results for the model with main and epistatic effects
In the second simulation setup, 10 main and 10 epistatic effects were simulated. The genotypes of m = 481 markers of n = 1,000 individuals were generated using the procedure described earlier. Locations and effects of the QTLs and QTL pairs are shown in Table 5. Some of the markers had only main or epistatic effect, while the others had both main and epistatic effects. The status of the binary trait of each individual was generated using the logistic regression model as described in the first simulation setup. The QTL model contained a total of possible effects, a number about 116 times of the sample size, and the design matrix X was of size 1000 × 115,921. QTL mapping was performed with all methods described earlier. However, the BhGLM method did not converge and the RVM ran out of memory due to a large number of nonzero effects included in the model, and thus, they did not yield any result.
The same cross-validation procedures described earlier were performed to choose the optimal values of the hyperparameters for the EBLASSO-NE, the EBLASSO-NEG and LASSO, and the results for several values of hyperparameters are presented in Table 6. Optimal values of the hyperparameters were then used to analyze the whole dataset. The number of true and false positive effects identified and the estimated effect sizes of the detected true effects are given in Table 5.
For the HyperLasso, we again examined 12 pairs of values for hyperparameters a and b as listed in Table 7, and identified a = 0.1 and b = 4.9 × 10-4 as the optimal values that gave best tradeoff between the true and false positive effects. We also used a two-locus logistic regression model logit(p i ) = β0 + β1x i + β2x j + β3x i ⋅ x j , i = 1, ⋯, m-1, j > i, to test the epistatic effect of locus i and j. The logistic regression model was fitted with the glm procedure in R . Effects with a p-value ≤ 0.05/k = 4.31 × 10-7 were considered as significant. Detailed QTL mapping results for the HyperLasso and the two-locus test are also given in Table 5.
As shown in Table 5, the EBLASSO-NE and the EBLASSO-NEG detected the same or a larger number of true effects but a smaller number of false positive than the other methods, which clearly demonstrates that our EBLASSO-NE and EBLASSO-NEG offer the best overall performance in terms of power of detection and the false positive rate. The LASSO is the fastest, while the EBLASSO-NEG and the HyperLasso are the second and the third fastest. Similar observations were obtained from two more independent simulations, as shown in the results for replicates 2 and 3 whose details were presented in the Additional file 1.
Real data analysis
We used a mouse data published by Masinde et al. as an example to test our methods. The trait was wound healing speed of mice measured as a binary trait, fast healer denoted by 1 and slow healer denoted by 0. There were 633 F2 mice derived from the cross of a faster healer inbred line (MRL/MPj) and a slow healer inbred line (SJL/J). At age 3 weeks, each F2 mouse was punched a 2-mm hole in the lower cartilaginous part of each ear using a metal ear puncher. The fast healer mice completely healed in 21 days after ear punch (complete closure of the holes) while the slow healer mice remained open for the holes after 21 days of ear punch. Some of the F2 mice healed partially and these mice were phenotypically coded as 1 if the holes were < 0.7 mm and 0 if the holes were > 0.7 mm. This dataset consisted of the genotypes of 119 markers across the mouse genome from 633 samples. Samples with more than 10% of missing markers or with missing phenotype were removed, resulting in a 532 × 119 genotype matrix with 3.28% missing values. These missing genotypes were inferred from neighboring markers. Total number of possible effects is k = 7141.
We carried out QTL mapping for this dataset using the EBLASSO-NE, the EBLASSO-NEG, the LASSO and the HyperLasso, since simulation results presented earlier show these four methods offer better performance than the other methods. Ten-fold cross validation for the EBLASSO-NE, the EBLASSO-NEG and the LASSO were performed with the same procedures used in simulation studies to obtain optimal values of the hyperparameters. For the EBLASSO-NE, λ = 0.4 was determined as the optimal value, which resulted in 7 main and 4 epistatic effects with a p-value ≤ 0.05. For the EBLASSO-NEG, (a, b) = (0.01, 0.5) were the optimal values which resulted in 4 main and 4 epistatic effects with a p-value ≤ 0.05. For the LASSO, λ = 0.0715 was the optimal value, which yielded 13 non-zero effects. Refitting an ordinary logistic regression model with these 13 effects to the data identified 7 main and 3 epistatic effects with a p-value ≤ 0.05. When α = 0.05/k and 0.01/k, the HyperLasso only identified 1 or 2 main effects; when α = 0.05, it identified more than 40 effects. On the other hand, when a = 0.1 and α = 0.01, it identified 8 main and 17 epistatic effects, all having a p-value ≤ 0.05, which seemed a more reasonable result than the ones obtained with other values of a and α. Therefore, these 25 effects were regarded as the effects identified by the HyperLasso. The effects identified by at least 3 of the 4 methods are listed in Table 8, which contain 7 main and 3 epistatic effects.
Masinde et al. previously identified 10 main effects using a single-QTL mapping method, and 8 epistatic effects using two-way ANOVA. Among the 7 main effects and 3 epistatic effects identified in our analysis, 6 of the main effects are identical to those identified by Masinde et al. but all 3 epistatic effects are different from the ones identified by Masinde et al. One of the markers involved in an epistatic effect (D4mit31) identified in our analysis was a main effect identified by Masinde et al. Since our QTL model considers the main and epistatic effects jointly, it may account for both effects more reliably, comparing with the single-QTL mapping approach and the two way ANOVA used by Masinde et al. Therefore, the three epistatic effects identified in our analysis may be novel effects that are worth further experimental investigation.
Some of the identified QTLs are in positions close to the genes that are up-regulated in expression profiles obtained during the inflammation stage of wound healing . Loci D3mit217 and D9mit270 were identified as main effects in both our analysis and the study of Masinde et al. It turns out that D3mit217 (chr3, 34.7 cM) is close to genes calgranulin A (chr3, 43.6 cM), CD53 (chr3, 50.5 cM), and small proline-rich protein 1A (chr3, 45.2 cM), and that D9mit270 (chr9, 41.5 cM) is close to gene annexin A2 (chr9, 37.0 cM). Locus D9mit182 was identified as a main effect in our study but not identified as an effect in the study of Masinde et al. It was found that D9mit182 (chr9, 53.6 cM) is close to chemokine receptor 2 (chr9, 71.9 cM). Among the loci in the three epistatic effects we identified, D11mit242 (chr11, 31.7 cM) is close to chemokine (C-C motif) ligase 4 (chr11, 47.6 cM) and chemokine (C-C motif) ligase 6 (chr11, 41.5 cM). Genes related to growth factors are known to play an important role in wound healing [43, 45]. D7mit246 and D17mit176 are involved in the epistatic effects we identified; and D7mit246 is 5.0 cM away from the fibroblast growth factor receptor 2 (FGFR 2), and D17mit176 is 12.2 cM away from the vascular endothelial growth factor (VEGF).
Our EBLASSO-NEG algorithm is based on a Bayesian logistic regression model that uses the same three-level hierarchical prior for the regression coefficients as the one used in the Bayesian LASSO linear regression model [23, 30, 31], the Bayesian hyper-Lasso linear regression model  and the HyperLasso logistic regression model . The HyperLasso of Hoggart et al. uses a numerical algorithm to estimate the mode of the posterior distribution, whereas our EBLASSO-NEG first estimates the variance of the regression coefficients and then find an approximation of the posterior distribution of the regression coefficients based on the estimated variance. As shown in our simulation study, our EBLASSO-NEG offers better performance than the HyperLasso of Hoggart et al. in terms of power of detection, false positive rate and speed, especially when the number of possible effects is very large in QTL models with both main and epistatic effects.
The LASSO-logistic regression was applied to GWAS to identify genomic loci associated with complex disease [28, 29], and it can be directly employed in QTL mapping for binary traits as shown in our simulation study and real data analysis. The LASSO-logistic regression particularly implemented with the glmnet algorithm  is very efficient. Our EBLASSO-NE algorithm and the LASSO-logistic regression essentially employ the same two-level prior for the regression coefficients. However, the major difference is that the LASSO-logistic regression estimates the mode of the posterior distribution, whereas our EBLASSO-NE algorithm first estimates the variances of the regression coefficients and then finds the posterior distributions of the regression coefficients. In our simulation study, we demonstrated that both our EBLASSO-NE and EBLASSO-NEG algorithms outperform the LASSO-logistic regression in terms of power of detection and false positive rate, although their speed is lower than that of the LASSO. The good performance of our EBLASSO-NE and EBLASSO-NEG may be due to the fact that our model inference using the variance and posterior distribution of the regression coefficients provides more information than the point estimation of the regression coefficients yielded by the LASSO-logistic regression.
Another prior distribution commonly employed in Bayesian shrinkage is the mixture of normal and inverse-χ2 distributions as used in the Bayesian linear regression model for continuous traits [23, 24, 46] and the generalized linear model for continuous or binary traits in the BhGLM method . The BhGLM method uses the EM algorithm to estimate the mode of the posterior distribution treating the variance of regression coefficients as missing data. As shown in our simulations for QTL models with only main effects, our EBLASSO-NEG and EBLASSO-NE offer power of detection better than and a false positive rate comparable to the BhGLM method. The speed of the EBLASSO-NEG is much higher than that of the BhGLM method, while the speed of the EBLASSO-NE is comparable to that of the BhGLM method. However, the BhGLM method was not able to handle a QTL model with 115,921 variables and 1,000 samples in our simulation, whereas our EBLASSO-NEG and EBLASSO-NE completed the analysis of this model within 5 min and 35 min, respectively.
Our EBLASSO-NEG and EBLASSO-NE use the same Laplace’s method originally proposed in  as the one used in the RVM for classification [32, 33]. However the prior distributions used by three methods are different. Although both the uniform prior and the inverse gamma prior for the variance of regression coefficients were considered in [32, 33], the more efficient RVM algorithm  employs the uniform prior. The uniform prior does not have any hyperparameter and lacks flexibility of adjusting the degree of shrinkage that our EBLASSO-NEG and EBLASSO-NE enjoy. The uniform prior of the RVM causes at least two problems. First, it often does not provide sufficient degree of shrinkage in multiple QTL mapping that includes a very large number of possible effects, which results in a large number of false positive, as observed in our simulations in this paper and in the simulation study for QTL mapping for continuous traits in [30, 46]. Second, because of the relatively low degree of shrinkage, the regression model usually contains a relatively large number of nonzero regression coefficients, which reduces the speed of the algorithm, as seen in our simulations.
Our EBLASSO-NEG and EBLASSO-NE estimate the variance of regression coefficients iteratively. In each iteration, Laplace’s method is first used to obtain an approximation of the posterior distribution, which results in an equivalent linear regression model. Then, the EBLASSO-NEG uses the EBLASSO algorithm we developed in  to estimate the variance. Therefore, the EBLASSO-NEG essentially is a combination of Laplace’s method and the EBLASSO algorithm. However, the EBLASSO does not consider the prior of the EBLASSO-NE, and thus, we derive a novel formula in (12) and modify the EBLASSO algorithm to estimate the variance for EBLASSO-NE. The EBLASSO-NEG has two hyperparameters, whereas the EBLASSO-NE has only one hyperparameter, which simplifies cross-validation for selecting the optimal value of the hyperparameter. Moreover, simulation studies demonstrated that the EBLASSO-NE identified some QTLs that were not detected by the EBLASSO-NEG, suggesting that combination of the two methods can lead to increased detection power.
The full Bayesian methods [11–13, 15, 47] use the threshold model that employs a latent liability variable to link the binary trait with the QTLs and then apply MCMC simulation to infer the model. It is well known that MCMC simulation requires very large computation for the model with a relatively large number of variables. Therefore, these fully Bayesian methods may not be computationally efficient for QTL mapping with both main and epistatic effects from a relatively large number of QTLs.
The MIM methods [17, 18] for mapping binary trait loci either does not employ any variable selection technique  or uses step-wise logistic regression and BIC to select variables . Hence, it is difficult for them to deal with a QTL model with a relatively large number of loci and epistatic effects due to their demanding computation burden and possible large false discovery rate. The probability model in  for binary trait mapping entails a model space exponentially increasing with the number of loci. Although the BIC is used for model selection, its computational complexity increases dramatically with the number of loci.
We demonstrated that our EBLASSO-NE and EBLASSO-NEG can easily handle a model with 115,922 variables. If much more variables are involved, in e.g. GWAS or QTL mapping with high order interactions, our methods can be combined with a variable screening method such as the sure independence screening (SIS) [48, 49] to facilitate computation. It is not difficult to extend our EBLASSO-NE and EBLASSO-NEG algorithms to QTL mapping for ordinal traits. We can first apply Laplace’s method to get an approximately equivalent linear model by deriving the gradient and the Hessian matrix of the posterior distribution and then apply our efficient EBLASSO algorithm to infer the linear model. Since the multinomial logistic regression model includes more variables than the binary logistic regression model, it will require more computation.
We have developed two algorithms named EBLASSO-NEG and EBLASSO-NE for the inference of Bayesian logistic regression models for multiple QTL mapping. While the EBLASSO-NEG is an extension of the EBLASSO  to handle binary traits, the simulations demonstrate that the EBLASSO-NEG algorithm provides superior performance in terms of power of detection and false positive rate, comparing with five other existing methods. Moreover, the EBLASSO-NEG is faster than four other methods tested but only slower than the LASSO algorithm. The hierarchical prior in the EBLASSO-NE in this paper was not considered in the EBLASSO . Here we derive a novel formula given in equation (12) to estimate the variance of regression coefficients. Our simulations show that the EBLASSO-NE provides comparable or better power of detection and false positive rate comparing with five other existing methods, and the power of detection could be improved if the results are combined with that of the EBLASSO-NEG. In summary, our EBLASSO-NE and EBLASSO-NEG algorithms provide an efficient tool for QTL mapping for binary traits involving a large number of both main and epistatic effects, and they can be extended to QTL mapping for ordinal traits.
Akaike information criterion
Bayesian information criterion
Bayesian hierarchical generalized linear model
Empirical Bayesian LASSO
Genome-wide association study
Least absolute shrinkage and selection operator
Maximum a posterior
Markov Chain Monte Carlo
Normal exponential prior
Normal exponential gamma prior
Quantitative trait locus
Relevant vector machine.
Falconer DS, Mackay TFC: Introduction to Quantitative Genetics. 1996, Boston: Addison-Wesley, 4
Hackett CA, Weller JI: Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics. 1995, 51 (4): 1252-1263. 10.2307/2533257.
Xu S, Atchley WR: Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics. 1996, 143 (3): 1417-1424.
Rao S, Xu S: Mapping quantitative trait loci for ordered categorical traits in four-way crosses. Heredity. 1998, 81 (2): 214-224. 10.1046/j.1365-2540.1998.00378.x.
Xu S, Yi N, Burke D, Galecki A, Miller RA: An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a four-way cross mouse family. Genet Res. 2003, 82 (2): 127-138. 10.1017/S0016672303006414.
Xu C, Zhang YM, Xu S: An EM algorithm for mapping quantitative resistance loci. Heredity. 2004, 94 (1): 119-128.
Xu C, Li Z, Xu S: Joint mapping of quantitative trait loci for multiple binary characters. Genetics. 2005, 169 (2): 1045-1059. 10.1534/genetics.103.019406.
Deng W, Chen H, Li Z: A logistic regression mixture model for interval mapping of genetic trait loci affecting binary phenotypes. Genetics. 2006, 172 (2): 1349-1358.
Haley CS, Knott SA: A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity. 1992, 69 (4): 315-324. 10.1038/hdy.1992.131.
Martínez O, Curnow RN: Estimating the locations and the sizes of the effects of quantitative trait loci using flanking markers. Theor Appl Genet. 1992, 85 (4): 480-488.
Yi N, Xu S: Bayesian mapping of quantitative trait loci for complex binary traits. Genetics. 2000, 155 (3): 1391-1403.
Yi N, Xu S: Mapping quantitative trait loci with epistatic effects. Genet Res. 2002, 79 (2): 185-198.
Yi N, Xu S, George V, Allison DB: Mapping multiple quantitative trait loci for ordinal traits. Behav Genet. 2004, 34 (1): 3-15.
Yi N, Banerjee S, Pomp D, Yandell BS: Bayesian mapping of genomewide interacting quantitative trait loci for ordinal traits. Genetics. 2007, 176 (3): 1855-1864. 10.1534/genetics.107.071142.
Huang H, Eversley CD, Threadgill DW, Zou F: Bayesian multiple quantitative trait loci mapping for complex traits using markers of the entire genome. Genetics. 2007, 176 (4): 2529-2540. 10.1534/genetics.106.064980.
Yang R, Li J, Wang X, Zhou X: Bayesian functional mapping of dynamic quantitative traits. Theor Appl Genet. 2011, 123 (3): 483-492. 10.1007/s00122-011-1601-0.
Chen Z, Liu J: Mixture generalized linear models for multiple interval mapping of quantitative trait loci in experimental crosses. Biometrics. 2009, 65 (2): 470-477. 10.1111/j.1541-0420.2008.01100.x.
Li J, Wang S, Zeng ZB: Multiple-interval mapping for ordinal traits. Genetics. 2006, 173 (3): 1649-1663. 10.1534/genetics.105.054619.
Coffman CJ, Doerge RW, Simonsen KL, Nichols KM, Duarte CK, Wolfinger RD, McIntyre LM: Model selection in binary trait locus mapping. Genetics. 2005, 170 (3): 1281-1297. 10.1534/genetics.104.033910.
Xu S: Estimating polygenic effects using markers of the entire genome. Genetics. 2003, 163 (2): 789-801.
Wang H, Zhang YM, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S: Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics. 2005, 170: 465-480. 10.1534/genetics.104.039354.
Hoti F, Sillanpää MJ: Bayesian mapping of genotype x expression interactions in quantitative and qualitative traits. Heredity. 2006, 97 (1): 4-18. 10.1038/sj.hdy.6800817.
Yi N, Xu S: Bayesian LASSO for quantitative trait loci mapping. Genetics. 2008, 179 (2): 1045-1055. 10.1534/genetics.107.085589.
Xu S: An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics. 2007, 63 (2): 513-521. 10.1111/j.1541-0420.2006.00711.x.
Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008, 4 (7): e1000130-10.1371/journal.pgen.1000130.
Yi N, Banerjee S: Hierachical generalized linear models for multiple quantitative trait locus mapping. Genetics. 2009, 181: 1101-1133. 10.1534/genetics.108.099556.
Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996, 58 (1): 267-288.
Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009, 25: 714-721. 10.1093/bioinformatics/btp041.
Ayers KL, Cordell HJ: SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol. 2010, 34 (8): 879-891. 10.1002/gepi.20543.
Cai X, Huang A, Xu S: Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping. BMC Bioinformatics. 2011, 12 (1): 211-10.1186/1471-2105-12-211.
Park T, Casella G: The Bayesian lasso. J Am Stat Assoc. 2008, 103 (482): 681-686. 10.1198/016214508000000337.
Tipping ME: Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001, 1 (3): 211-244.
Tipping ME, Faul AC: Fast marginal likelihood maximisation for sparse Bayesian models. 2003, Key West, FL: Proc 9th International Workshop on Artificial Intelligence and Statistics
Cockerham CC: An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics. 1954, 39 (6): 859-882.
Griffin JE, Brown PJ: Bayesian hyper-lassos with non-convex penalization. Aust N Z J Stat. 2011, 53 (4): 423-442. 10.1111/j.1467-842X.2011.00641.x.
Carlin BP, Louis TA: Bayesian methods for data analysis. 2008, London/New York: Chapman & Hall/CRC, 3
Bishop CM: Pattern recognition and machine learning. 2006, New York: Springer
MacKay DJC: The evidence framework applied to classification networks. Neural Comput. 1992, 4 (5): 720-736. 10.1162/neco.19220.127.116.110.
Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. 2009, New York: Springer, 2
Wu R, Ma CX, Casella G: Statistical genetics of quantitative traits: linkage, maps, and QTL. 2007, LLC: Springer Science + Business Media
R Development Core Team: A language and environment for statistical computing. 2012, Vienna, Austria: R Foundation for Statistical Computing
Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat software. 2010, 33 (1): 1-22.
Masinde GL, Li X, Gu W, Davidson H, Mohan S, Baylink DJ: Identification of wound healing/regeneration Quantitative Trait Loci (QTL) at multiple time points that explain seventy percent of variance in (MRL/MpJ and SJL/J) mice F2 population. Genome Res. 2001, 11 (12): 2027-2033. 10.1101/gr.203701.
Li X, Mohan S, Gu W, Baylink DJ: Analysis of gene expression in the wound repair/regeneration process. Mamm Genome. 2001, 12 (1): 52-59. 10.1007/s003350010230.
Kunimoto BT: Growth factors in wound healing: the next great innovation?. Ostomy Wound Manage. 1999, 45 (8): 56-64.
Xu S: An expectation maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity. 2010, 2010: 1-12.
Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS: An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics. 2007, 176 (3): 1865-1877. 10.1534/genetics.107.071365.
Fan J, Song R: Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat. 2010, 38 (6): 3567-3604. 10.1214/10-AOS798.
Fan J, Lv J: Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Series B Stat Methodol. 2008, 70 (5): 849-911. 10.1111/j.1467-9868.2008.00674.x.
This work was supported by the National Science Foundation (NSF) under NSF CAREER Award no. 0746882 to XC and by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome, Genetics and Breeding Program 2007-35300-18285 to SX.
The authors declare that they have no competing interests.
AH participated in the design of the algorithms, developed the computer programs, performed the simulations and data analyses, and drafted the manuscript. SX participated in the development of the algorithms, designed simulation study, participated in the data analyses and helped to draft the manuscript. XC conceived of and coordinated the study, developed the algorithms, participated in the data analyses and drafted the manuscript. All authors read and approved the final manuscript.