A Bayesian genome screening of maximum number of drinks as an alcoholism phenotype with the new Haseman-Elston method

Common human disorders, such as alcoholism, may be the result of interactions of many genes as well as environmental risk factors. Therefore, it is important to incorporate gene × gene and gene × environment interactions in complex disease gene mapping. In this study, we applied a robust Bayesian genome screening method that can incorporate interaction effects to map genes underlying alcoholism through its application to the data of the Collaborative Studies on Genetics of Alcoholism provided by Genetic Analysis Workshop 14. Our Bayesian genome screening method uses the regression-based stochastic variable selection, coupled with the new Haseman-Elston method to identify markers linked to phenotypes of interest. Compared to traditional linkage methods based on single-gene disease models, our method allows for multilocus disease models for simultaneous screening including both main and interaction (epistatic) effects. It is conceptually simple and computationally efficient through the use of Gibbs sampler. We conducted genome-wide analysis and comparison between scans based on microsatellites and single-nucleotide polymorphisms. A total of 328 microsatellites and 11,560 single-nucleotide polymorphisms (by Affymetrix) on 22 autosomal chromosomes and sex chromosome were used.


Background
Alcohol dependence is a complex disorder that is influenced by many genetic and environmental factors. Identifying genes associated with alcohol dependence is critical to understand its etiology and to develop efficient methods for prevention and treatment. However, this effort has been hampered by the complexity underlying alcohol dependence: rather than there being one or a few major genes affecting alcohol dependence, it is likely that multiple genes interact with each other, together with environ-mental factors, to affect susceptibility to alcohol dependence. In this paper, we describe analyses of the Collaborative Study on the Genetics of Alcoholism (COGA) data (Problem 1), using self-reported "maximum number of drinks consumed in a 24-hour period" (denoted by M) as a quantitative trait, to map genes underlying alcohol dependence. The measure M is closely related to alcoholism diagnosis and provides a quantitative measure for alcohol dependence. For genome screens for this trait, we use the modified Haseman-Elston regres-sion method [1] along with the Bayesian variable selection methods developed by Oh [2] to locate susceptibility genes for alcohol dependence and assess their epistatic effects.
The Haseman-Elston method and its derivatives allow one to apply linear regression methods in linkage analysis. For each sibling pair, these methods use the number of alleles identical by descent (IBD) at each marker as the explanatory variable and a statistic measuring similarity of the quantitative traits in the sibling pair, squared difference, or cross-product, as the response variable.
In practice, the number of markers and their possible epistatic effects are often larger than the number of observations (patients or sib-pairs), where the design model is referred as being "supersaturated". As we often have hundreds of markers to consider, we must deal with the problem of multiple testing in this context. Besides, if one would like to take epistasis into account, the number of tests can easily exceed tens of thousands. Performing hypothesis tests for linkage for all of these possibilities without appropriate adjustment of multiple comparisons can lead to the identification of spurious genetic effects or the masking of real effects. The supersaturated nature of the design model also makes the conventional best subset model selection methods [3] practically infeasible. To overcome this infeasibility, efficient and robust Bayesian variable selection methods for multiple regressions were proposed by Oh [2] utilize the stochastic search variable selection (SSVS) methodology developed by George and McCulloch [4] in gene mapping. SSVS uses Gibbs sampling to sample from the posterior distribution of the model space and obtain the "best" subsets from the sampled posteriors. Therefore, it is capable of considering a large number of candidate variables without evaluating all possible models. Oh et al. [5] applied this approach to genetic linkage studies and successfully estimated main and epistatic effects. George and McCulloch's method [4], yet, is known to be sensitive to the choice of the priors. Oh [2] modified and extended SSVS, focusing only on statistics robust to the choice of the prior. These methods were shown to be computationally more efficient than the original SSVS [4] in handling many hundreds of thousands epistatic effects and gene × gender interactions through the simulation study by Oh [2].
In this study, we apply the method developed by Oh [2] to the COGA data of Genetic Analysis Workshop 14 (GAW14) to screen disease susceptibility genes related to the quantitative phenotype, "maximum number of drinks consumed in a 24-hour period", and compare the results from single-nucleotide polymorphisms (SNPs) with those from microsatellites. Because SNPs offer denser and more automated genotyping than microsatellites, we expect that SNP-based linkage analysis may have better performance in the mapping of disease loci.

Haseman-Elston method
The original Haseman-Elston method [6] is a general model-free method for testing linkage between candidate markers and quantitative trait loci on a sample of sibpairs. This method involves the regression of the squared trait difference D 2 = (X 1 -X 2 ) 2 ) in pairs of siblings on the number of alleles shared IBD between each sib pair at a given marker. Although the original Haseman-Elston method is simple, robust, and computational inexpensive, it may ignore information contained in the observed bivariate data. In fact, the squared mean corrected trait sum of sib pairs (S 2 = (X 1 + X 2 -2µ) 2 ) may provide additional information on the genetic effect [7]. This observation has led to a number of methods to modify the original Haseman-Elston method by combining both D 2 and S 2 in linkage analysis to improve statistical power. One of the simplest methods was proposed by Elston et al. [1], which uses CP = (S 2 -D 2 )/4 = (X 1 -µ)(X 2 -µ) as the response variable in the regression. It essentially averages both regression coefficient estimates (of D 2 and S 2 ) with equal weights. An additional advantage of using CP as the response variable is that it may be more normally distributed than D 2 and S 2 .

Bayesian genome screening
Assume that we observe m markers along the genome. Among these m markers, some may display some evidence of tight linkage to genes with large effects and others only with weak effects. They may exhibit main effects and/or epistatic effects. In our regression setup, the CPs are treated as the responses and the IBD status at each marker X j is the candidate explanatory variable. Then the observed CP value of sib-pair i can be described by the following linear model where n is the number of sib pairs and the errors ε~N(0, The choice of the values, (p 00 ,p 01 ,p 10 ,p 11 ) represents the belief of the relationship between factors (main effects), and (epistatic effects). If we believe that an epistatic effect without main effects is quite unlikely, we can simply set the prior as (p 00 ,p 01 ,p 10 ,p 11 ) = (0,0,0,p).
With an appropriate prior distribution on σ 2 , one can obtain the posterior distribution of γ using Gibbs sampling. Then, by examining the posterior distribution of γ, one can identify the optimal model with the highest posterior. However, since our interest is to find the markers having evidence for linkage, we proceeded to obtain the marginal posterior probabilities of different factors and rank their relative frequencies. That is, the marginal posteriors of all main effects and epistatic effects are obtained and used to rank these effects. We view the ranking as a measure of relative importance of all the factors, i.e., the degree of evidence for these factors linked to the disease genes. Our previous studies showed that the rankings of these factors are very robust [2], especially for those with the highest marginal probabilities, which are the focus of our study. Probabilities much less than 0.5 for both main and epistatic effects represent the belief that relatively few terms are active. This assumption is quite reasonable for our applications of gene screening, because there are only a few markers linked to the disease genes out of hundreds or thousands of markers. Even though the choice of p may be arbitrary, the results are robust to its setting. We should note that the priors for p are chosen partly to facilitate the computing.
In microsatellites and SNPs, there are 328 and 7,826 markers considered, respectively. Therefore, to consider epistatic effects, we needed to include 53,957 factors for microsatellites and about 30 millions factors for SNPs in the models. We set the prior as p = 0.05 for microsatellies and p = 0.005 for SNPs. There are about 1,499 sib pairs in the overall sample. We use 1,433 sib pairs for microsatellites analysis and 1,322 sib pairs for SNPs analysis in our study. The number of alleles shared IBD is obtained using GENEHUNTER (v.2.1) for microsatellites and MERLIN for SNPs for each sib pair at each marker. Because it is impractical to track the complete posterior of γ, only the marginal posterior of each marker is obtained. In our analysis, we use (p 00 ,p 01 ,p 10 ,p 11 ) = (0,p,p,p) to give all the factors an equal opportunity to be included in the final model as long as one of main effects is in the model.

Results
In the full sample of cases, the quantitative trait M ranges from 0 to 160. Five individuals are in the highest threshold class (M > 128 drinks). The highest reported M is 160. The use of the log transformation minimizes their impact on the regression analysis, which can be inflated by selfreport [9]. As reported in Saccone [9], there is a close relationship between diagnosis and M.
In each analysis, the Markov chain Monte Carlo (MCMC) sampler was run for 100,000 cycles after discarding the first 2,000 cycles for the burn-in period. Because MCMC samplers arise from recursive draws, they produce correlated samplers from the posteriors. Therefore, the chains are thinned (one iteration in every 10 cycles is saved) to reduce serial correlation in the stored samples. The total number of samples kept in the post-Bayesian analysis is 10,000. It takes ~2 hours for microsatellites and ~6 hours for SNPs to generate each sample with JAVA programs on a Linux cluster using 2.4-GHz Intel processors. Table 1 displays the rankings of the marginal posteriors for the main effect and epistatic effect screening obtained from microsatellites and SNPs.  quencies of the marginal posteriors. Figure 1 shows the comparisons of SNPs vs. microsatellites in the regions having evidence of linkage, and they have similar patterns. For gene × gene interaction, microsatellites are able to locate epistatic effects between chromosomes 8 and 15, and between chromosomes 10 and 17, whereas SNPs locate epistatic effects only between sex and markers on chromosome 23 (sex chromosome). However, the marginal posterior probabilities are not strong enough to support the evidence of epistatic effects.

Discussion
In this study, we have compared the genome-wide linkage analyses based on microsatellites and SNPs. Our methods located the main effects of markers both from microsatellites and SNPs and produced similar patterns between them. However, the results for epistatic effect screening are less consistent and revealing. This might be purely because these epistatic effects are weak in nature and further research in this area is warranted.

Conclusion
Bayesian genome screening methods provide a powerful and efficient tool in identifying potential markers and their epistatic effects. They are very effective because they are able to conduct searches over the entire model space; while the frequentist's best subset model selection proce-dure is constrained by computing power required to examine all candidate models. In addition, Bayesian genome screening methods can work on problems with many more candidate variables, which is essential to consider when epistatic effects are studied. When one tries to locate the epistatic effects, the number of covariates (factors) easily far outnumbers the sample size. Most traditional linkage methods do not work under this condition because they often assume a single-gene model and test effects one at a time. By using the prior structures that reflect the relationship among the candidate variables, our general approach can accommodate a large number of candidate markers as well as their epistatic effects by evaluating all factors simultaneously. We were able to locate markers on chromosome 4 that show the strong evidence of linkage with alcoholism related to quantitative phenotype, "maximum number of drinks consumed in a 24hour period", both from microsatellite and SNP scans and weak evidence for epistatic effects.

Authors' contributions
CO participated in the design of the study, performed the analysis, and drafted the manuscript. SW helped to obtain IBD values for linkage analysis. SW, NL, LC, and HZ participated in the design and the discussion of the study, and the preparation of the manuscript. All authors read and approved the final manuscript.
Comparisons between microsatellites and SNPs on chromo-some 4 Figure 1 Comparisons between microsatellites and SNPs on chromosome 4. Both results from microsatellites and SNPs show similar patterns for markers having the evidence being linked to the disease genes.