Multitrait analysis of quantitative trait loci using Bayesian composite space approach

Background Multitrait analysis of quantitative trait loci can capture the maximum information of experiment. The maximum-likelihood approach and the least-square approach have been developed to jointly analyze multiple traits, but it is difficult for them to include multiple QTL simultaneously into one model. Results In this article, we have successfully extended Bayesian composite space approach, which is an efficient model selection method that can easily handle multiple QTL, to multitrait mapping of QTL. There are many statistical innovations of the proposed method compared with Bayesian single trait analysis. The first is that the parameters for all traits are updated jointly by vector or matrix; secondly, for QTL in the same interval that control different traits, the correlation between QTL genotypes is taken into account; thirdly, the information about the relationship of residual error between the traits is also made good use of. The superiority of the new method over separate analysis was demonstrated by both simulated and real data. The computing program was written in FORTRAN and it can be available for request. Conclusion The results suggest that the developed new method is more powerful than separate analysis.


Background
Multitrait analysis is defined as a method that includes all traits simultaneously in a single model [1], and can take into account the correlation among all traits. Many methods have been developed for mapping QTL by combining information of multiple traits. Jiang and Zeng [2] pro-posed a maximum likelihood approach, and concluded that joint analysis could improve the precision of parameter estimates and had higher QTL detecting power than separate analysis. A multitrait least-square approach was proposed by Knott and Haley [3] to detect QTL. It is a method that programs easily and computes fast, and com-pared with separate analysis of each trait, can increase the power to detect a pleiotropic QTL and improve the precision of the location estimate. Xu et al. [1] developed a maximum likelihood approach for jointly mapping multiple binary traits, which is implemented via EM algorithm. They found that the QTL detecting power of joint analysis was higher than the sum of those of separate analysis. But after the QTL detecting power for separate analysis was redefined more reasonably by a combined power (see also [1]), the power of joint analysis was almost equal to the combined power, that is, joint analysis had almost the same power as separate analysis. For QTL parameter estimation, joint analysis can improve the precision of the QTL position estimates, but the QTL effects and their standard deviations have no obvious difference. Another class of approaches for multitrait analysis that use a dimension reduction technique was proposed by Korol et al. [4]. Mangin et al. [5] used this technique to analyze independent PCA (principal components analysis) trait, and used the PCA test values to detect QTL, which was proved to be asymptotically equivalent to the multivariate maximum-likelihood ratio test. However, the parameters of this kind of methods are often too difficult to interpret biologically. A maximum-likelihood method for multitrait mapping of QTL under outbred population was developed by Eaves et al. [6], which based on identity-bydescent (IBD) variance components model approach, and QTL effects were treated as random.
All the joint mapping approaches mentioned above were based on one-QTL model. Recently, Bayesian methodology has been used for mapping QTL [7][8][9][10][11][12][13][14][15][16][17], and the main advantage is that it can easily handle multiple QTL simultaneously. Currently, Bayesian reversible jump MCMC (RJMCMC) has become a usual method for mapping multiple QTL. Liu et al. [7] applied the method to multitrait mapping of QTL in outbred population under random effect model. However, because the dimension of RJM-CMC is variable, it is always subject to poor mixing and hard to converge. Godsill [18] developed an effective Bayesian composite space method for model selection which keeps the model dimension fixed in each round of updating, and therefore it converges faster and is much easier to program. Yi et al. [15][16][17] successfully applied the novel approach to map QTL. In this article, we extend Bayesian composite space approach to multitrait analysis under inbred line crosses, and use both simulated data and real data to demonstrate the advantages and disadvantages of the proposed method.

Simulation Study
We simulated 200 backcross individuals, and each has marker information and phenotypic records for three traits. One chromosome with length of 600 cM was inves-tigated. Twenty-one markers were put on the genome with an average distance of 20 cM. Marker genotypes were observed for all the individuals. Thirteen QTL were added onto the genome, of which locus 96, 423, 487 and 584 had pleiotropic effects, and locus 250, 253 and 256, and locus 535 and 537 were closely linked and controlled different traits respectively. The positions and the effects of QTL for each trait are listed in Table 1. The population means for all traits were set to zero. The residual (co)variances are listed in Table 2. The heritability of each trait can be calculated as 0.728 for trait 1, 0.691 for trait 2 and 0.598 for trait 3.
In order to investigate the performance of our approach, two methods were used to analyze the simulated data. The first method was the proposed multitrait analysis; the second is single-trait analysis. In single-trait analysis, we use the method 1 of [16], for the proposed method was a direct extension from it. In both multitrait analysis and single-trait analysis, the prior variance and degree of freedom of the residual error was set to zero, because no prior information was available. The prior expected number of QTL l k was 3 and the maximum number of QTL L k equaled to the number of marker intervals (30). Therefore, the prior inclusion probability of the model indicator variable equaled to 0.1. For both methods, the MCMC ran for 1000 cycles as burn-in period (deleted) and then for additional 20,000 cycles after the burn-in. The chain was then thinned to reduce serial correlation by one observation saved every 10 cycles. The posterior sample contained 2000 (20, 000/10 = 2000) observations for the post-MCMC analysis.
The estimates of the QTL parameters for multitrait analysis and separate analysis are listed in Table 1 and Table 2. The results showed that there were no clear differences of the two methods in the estimates of the QTL positions, QTL effects and the corresponding standard deviation. Both methods can estimate QTL positions and effects, all closed to the true values. Figure 1 and 2 respectively show the profiles of the posterior probability of the QTL positions and the 2log e BF statistic for multitrait analysis, and Figure 3 and 4 for separate analysis. From these figures, we found that both profiles of the posterior probability of QTL positions and the 2log e BF statistic for multitrait analysis are generally higher than those for separate analysis. Moreover, two additional QTL located at 483 and 245 were detected by multitrait analysis. These suggested that multitrait analysis may be more powerful than separate analysis.

Real data analysis
We applied the new method to analyze the data from the North American Barley Genome Mapping Project [22].
The DH population included 150 lines (n = 150), each of which was genotyped for 223 codominant markers. These markers covered ~1500 cM of the genome along seven linkage groups with an average marker interval of ~7 cM. Eight traits, grain yield, lodging, height, heading data, grain protein, alpha amylase, diastatic power, and malt extract, were investigated in this project. Agronomic traits were measured in 16 areas, and malting quality traits in 9 areas. In our research, only three traits were studied, grain yield, height, and alpha amylase, and only the records in Crookston and Minnesota were used.
In the analysis, the prior expected number of QTL was taken as 3 for each trait, then the maximum number of QTL was calculated as L k ≈ 3 + 3· or L k = 8. Therefore, the prior inclusion probability of the model indicator variable equals to 0.375. To reduce the model space, we assumed each chromosome contain at most one QTL, except that the 7th was divided into two parts at the middle point and each part contains one QTL, for the results of other analysis (IM, CIM) always show signals of two QTL on 7th chromosome for some traits. Also two methods, multitrait analysis and Bayesian single-trait analysis (method 1 in [16]), were used to analyze the real data. The MCMC ran for 5 × 10 4 cycles after the first 2000 was discarded. The chain was thinned by every 10 cycles one observation being saved, which yielded 5000 samples for posterior Bayesian analysis. Figure 5 and Figure 6 show the profiles of 2log e BF statistic with real data by multitrait analysis and separate analysis. The profiles of Figure 5 are generally higher than that of Figure 6. For trait 1 (grain yield), no QTL was detected by separate analysis (Figure 6a), while eight QTL were detected by multitrait analysis (Figure 5a); for trait 2 (height), three QTL located on chromosomes 1, 2, and 7 were detected by separate analysis, however by multitrait l k  analysis, not only much stronger signals of these three QTL, but also four additional QTL on chromosome 3, 4, 5 and 6 were detected; for trait 3 (alpha amylase), two additional QTL located on chromosome 1, 3 were detected by multitrait analysis. The results of real data analysis also supported the conclusion that multitrait analysis was more powerful than separate analysis.

Discussion
The selection of hyper-parameter of the QTL effect is important in Bayesian analysis, which can influence the efficiency of the model selection. For example, with Bayesian shrinkage method [14], the hyper-parameter is a variable and assigned a special distribution so that no model selection is need. In Bayesian composite space approach, the updating of model indicator variables is closely dependent on QTL effects, but the selection of hyper-parameter is not much strict as Bayesian shrinkage analysis. Many approaches have been proposed for selection of hyper-parameter, and our method is only an extension of the approach of Yi et al. [15]. Moreover, we followed the approaches developed by Yi et al. [15] to obtain the prior probability for model indicator variables. However, we didn't investigate the influence of different prior probability on the results, because the proposed method is very computationally intensive. In addition, we suggested to use CIM-based multitrait analysis [2] to obtain the prior of variance-covariance of residual, but if prior information is not indeed known, we may take the noninformative prior [19], . In this simulation study, the noninformative prior is used and proved to be able to bring a precise estimate for variance-covariance of residual error.
The profiles of the posterior probability for multitrait analy-sis using the simulated data   The proposed multitrait analysis is based on Bayesian composite space approach, while other popular model selection approaches such as Bayesian shrinkage method [14] and Bayesian SSVS method [23] are also very easily extended, and the details will be demonstrated in another paper. We used BC and DH population as examples to demonstrate the efficiency of the method. The new method can be modified to be applied to other experiment designs, such as RIL, F2 design, etc. In addition, we only take the main effect into account, while the epistatic effect also can be included into the model. In that case, the model should be written as: , where q is main effect, q 1 and q 2 is two interacting QTL, and is (1 × m) column vectors of epistatic effect between QTL q 1 and q 2 . Certainly, the implementation will be complicated and quite time-consuming, but nevertheless, the extension is feasible and expected to be very efficient for mapping interacting QTL.
In this paper, we have not given a test procedure to distinguish closely linked and pleiotropic QTL which cause the genetic correlations between each trait. There have been some of literatures about it, and generally, the likelihood ratio (LR) statistic [1,2] and Bayesian factor (BF) statistic [7] always have been used to solve the problem [7]. In our multitrait analysis, although the LR testing procedure in [2] is completely applicable, it is not optimal, because it is based on single-QTL model. Also Bayesian approach can be used for such testing, but the computing time is a big factor of concern. Hopefully, an efficient and fast approach will be developed that could solve the problem nicely.
The profiles of the posterior probability for single trait analy-sis using the simulated data

Conclusion
Bayesian composite space approach [18] is an effective method for model selection. Yi [16] firstly used it for QTL mapping and proved it to be effective for mapping multiple QTL. In this article, we extended this novel statistical method to multitrait mapping of QTL. Compared with separate analysis, joint analysis is optimal, because the parameters are updated by vector or matrix and the correlation information between multiple traits can be made good use of. The powerful of the proposed multitrait method also be proved by both simulation experiments and real data analysis, and they all showed that the multitrait analysis tends to give higher statistical power than the single trait analysis.

Multivariate linear model
Consider n individuals derived from a backcross population crossed from two inbred lines with observations on some densely distributed codominant markers and on m quantitative traits. Supposed that the maximum number of QTL is p, the phenotypic value y ki of individual i for kth The profiles of Bayes factors for multitrait trait analysis using real data The profiles of Bayes factors for single trait analysis using real data (1) or excluded (0) from the model; b k0 is population mean; b kj is QTL effect; x kij is QTL genotype, if QTL genotype is homozygote x kij = 1, otherwise -1; e ki is residual error and assumed to follow multivariate normal distribution. If we denote equation (1)

Prior specification
The prior distribution of each QTL effect vector b j is mul- is the hyper-parameter, and We take , which is simply an extension from Bayesian single trait analysis [15]. The importance of the choice of the hyper-parameter will be discussed later.
In a large backcross population and under the definition of x mij (-1 or 1), can be simplified as = Σ e . The prior of the covariance matrix of residual error follows where, v e and are prior degree of freedom and covariance matrix of residual error, respectively, and can be obtained from other method, such as CIM based multitrait analysis [2], etc. The prior distribution of population mean b 0 is normal distribution with mean and variance equal to those calculated by phenotypic values. The prior probability distribution of QTL position λ kj is uniform distribution with bounds of two flanking markers, p(λ kj ) = 1/ d j , where d j is length of the interval where jth QTL is confined. Assuming that epistatic effect is absent, the prior inclusion probability for jth effect can be expressed as p(γ kj = 1) = 1 -l k /L k ] 1/N (see also [15]), where l k is the prior expected number of main-effect QTL, and could be roughly estimated with the use of standard genome scans; N is the number of possible main effects for each QTL and equal to 1 in BC family [15]; L k is the upper bound of QTL number, and equals to the number of marker interval in our simulation study, while in another approach suggested by Yi [15]L k is taken as 3 + 3· , which causes the model space to reduce dramatically [15].

Joint posterior density
The observable variables include phenotypic values, and marker information, . The where, p(y, m|Θ) is the likelihood and can be written as: where p(y|Θ) is multivariate normal density, and p(m|Θ) can be derived from a Markov model [14].   f. Update the model indicator variable matrices .

MCMC sampling
The conditional posterior distribution of the population mean b 0 is multivariate normal with mean and variance-covariance matrix The conditional posterior distribution of the QTL effect b j is sampled from multivariate normal distribution with mean and variance-covariance matrix The posterior distribution of the residual error follows inverted Wishart distribution, where and df e = n.
In step e, the QTL locations and QTL genotype matrices are updated jointly. For locus j, we can firstly sample a new QTL position for each trait from their prior distribu-tion (described later), then sample the QTL genotype matrices on the new position using equation (15), and finally, they are updated by the efficient Metropolis-Hastings algorithm [20,21]. Because the sampling of X ij is too complicate and we are going to firstly describe it.
Due to the QTL genotype x kij has two possible values (-1 or 1) in BC line, if m traits are investigated jointly, X ij has 2 m kinds of possible formations, and the general pattern of X ij can be written as: where, z 1 , z 2 , ..., z m ∈ {-1, 1}. For clarity, we omit the subscript ij from and present formulas to denote the genotype matrix of ith individual and jth loci. Because the QTL genotypes x kij of ith individual in the jth interval for all traits may be correlated, the joint prior probability of the genotype matrix X ij can't be simply expressed by the following equation: Instead, it can be derived from the Markov model (see Equation 14), assuming that the order of markers and QTL is M j Q 1 Q 2 ... Q m M j+1 (see Figure 7), where, Q 1 , Q 2 , ..., and Q m denote the QTL respectively affecting trait 1, trait 2, ..., and trait m in jth marker interval. Indicator variables x 1ij , x 2ij , ..., and x mij denote the genotypes of these QTL.
If no segregation interference is considered, the joint prior probability can be factorized into equation (14), and each term in equation (14) can be derived from Haldane map function. Only the first term in equation (14) is conditional on two flanking markers; others are not only conditional on two flanking markers but also on the genotypes of all the QTL prior to the interested one. If double recom- bination is ignored [2], each term in equation (14) can be inferred only by the genotype of the left nearest loci (marker or QTL) and the right marker, then equation (14) can be simplified as: Each term in equation (15) can be easily inferred.
It is worth mentioning that we assume the sequence of markers and QTL is M j Q 1 Q 2 ... Q m M j+1 , and in fact, the sequence of QTL may be variable in each round of updating. Therefore, we should firstly ascertain the sequence in each round, and then construct the appropriate formula to calculate the joint prior probability of the QTL genotype p(X ij = |m i,j ,λj,m i,j+1 ) according above rules.
For clarity, we take an example to demonstrate it. Consider 3 QTL Q 1 , Q 2 , and Q 3 that affect 3 traits respectively in an interval. Assuming that in a certain round the sequence of markers and QTL is M j Q 3 Q 1 Q 2 M j+1 , then the formula for calculating the joint prior probability of the QTL genotype can be written as: Once we obtain the joint prior probability of the QTL genotype, the joint conditional posterior probability of X ij can be expressed as: where is likelihood, and follows multivariable normal distribution, Once we have calculated 2 m possible posterior probabilities for the corresponding QTL genotype matrices, we are going to sample one genotype matrix according to their posterior probabilities. We firstly constructed the cumulative probability function F(d) by accumulating the 2 m probabilities in an arbitrary sequence for d = 1, 2, ..., 2 m and F(0) = 0, which is a discrete distribution; then sampled a random number from uniform distribution, u Ũ is the prior probability of QTL genotype conditional on new and old position, which has been described detailed previously; and The positions of markers and QTL and their sequence ranged on a certain marker interval Figure 7 The positions of markers and QTL and their sequence ranged on a certain marker interval.
In step f, block sampling of the indicator variable matrix Φ j is expected to have a better performance than separately updating each γ kj in Φ j . Due to there are two possible values (0 or 1) for each model indicator γ kj , if m traits are investigated jointly, each model indicator matrix Φ j has 2 m kinds of formations. The general formula of it can be written as: where, w k ∈ {0,1}, for k = 1, 2, ..., m. Because the prior probability of each γ kj is independent, the joint prior probability for all possible formations can be written as . Then the conditional posterior probability of Φ j can be written as The approach to sample Φ j is similar to QTL genotypes sampling previously mentioned.

Post-MCMC analysis
For summarizing the posterior sample, we use the mean of the posterior sample to estimate the QTL effect and the residual (co)variance, and the mode of the posterior probability or the peak of the 2log e BF statistic to localize QTL. 2log e BF statistic was introduced by Yi et al. [17] into QTL mapping, and BF statistic is defined as the ratio of the posterior odds to the prior odds for inclusion against exclusion of the locus [24]. The critical value of BF is 3 or 2log e BF = 2.1 for declaring the existence of a QTL.
In single-trait analysis, we can pick the QTL by plotting the profile of the posterior probability or 2log e BF statistic against the genome. In multitrait analysis, if only two traits are considered jointly, we can use a three-dimension graph to summarize the statistic for all traits jointly (e.g., Figure 2 in [19]). However, if the number of trait is greater than 2, we can't plot them in one graph. Instead, we can solve the problem by plotting the marginal posterior probability distribution. If we divide the genome into H bins, and denote each bin of kth trait with ζ kg , for g = 1,2, ..., H, then the marginal posterior probability distribution of ζ kg is defined as p(ζ kg |y) = p[(ζ kg = λ kq ) ∩ (γ kq = 1)], where, q indicates the qth interval that locus ζ kg resides in.

Then
, which can be calculated at each possible locus for each trait, respectively.

Authors' contributions
MF coordinated the study, developed the foundational principle of the method and wrote the computing program and the paper. Others were responsible for the simulation experiment, carried out the analysis of results and helped to consummate the whole paper.