# Regression-based approach for testing the association between multi-region haplotype configuration and complex trait

- Yanling Hu
^{1}, - Sinnwell Jason
^{2}, - Qishan Wang
^{1}, - Yuchun Pan
^{1}Email author, - Xiangzhe Zhang
^{1}, - Hongbo Zhao
^{1}, - Changlong Li
^{3}and - Libin Sun
^{4}

**10**:56

https://doi.org/10.1186/1471-2156-10-56

© Hu et al; licensee BioMed Central Ltd. 2009

**Received: **20 January 2009

**Accepted: **17 September 2009

**Published: **17 September 2009

## Abstract

### Background

It is quite common that the genetic architecture of complex traits involves many genes and their interactions. Therefore, dealing with multiple unlinked genomic regions simultaneously is desirable.

### Results

In this paper we develop a regression-based approach to assess the interactions of haplotypes that belong to different unlinked regions, and we use score statistics to test the null hypothesis of non-genetic association. Additionally, multiple marker combinations at each unlinked region are considered. The multiple tests are settled via the *minP* approach. The *P* value of the "best" multi-region multi-marker configuration is corrected via Monte-Carlo simulations. Through simulation studies, we assess the performance of the proposed approach and demonstrate its validity and power in testing for haplotype interaction association.

### Conclusion

Our simulations showed that, for binary trait without covariates, our proposed methods prove to be equal and even more powerful than htr and hapcc which are part of the FAMHAP program. Additionally, our model can be applied to a wider variety of traits and allow adjustment for other covariates. To test the validity, our methods are applied to analyze the association between four unlinked candidate genes and pig meat quality.

## Background

Haplotypes, the linear arrangement of alleles on the same chromosome inherited as a unit, provide a natural framework for testing the association between genetic markers and complex traits more efficiently than separate marker analysis[1]. There is strong evidence that several mutations in *cis* position within a single gene can interact to create a "super allele" that has a large effect on the observed phenotype. The biological explanation for these haplotype effects is that several mutations in a gene cause several amino acid changes in the ultimate protein product, and the joint effect of these amino acid changes can have a much larger influence on the function of the protein product than any single amino acid change. This emphasizes the importance of examining candidate genes by SNP haplotyping. Some studies focus on haplotypes within a given genomic region [2–7]. Because complex traits are presumed to be the results of interaction by a set of genes which may be located in different regions, some methods aim to test gene-gene interaction, and interactions of single markers from different unlinked regions [8–11]. Specifically, Becker *et al*[12] reported a method to deal with haplotype interaction in unlinked regions for a binomial trait. They find the best haplotype combination from the unlinked regions by permutation, which is a modification of Ge *et al*[13]. However, this method could only be applied to case-control association testing, and could not include other covariates.

Generalized Linear Model (GLM) is an extension of the general linear modeling process that allows models to be fitted for several kinds of traits, such as Gaussian, Poisson, Binomial, etc., and allows various covariates. Schaid *et al*[5] introduced score statistics, which are receiving increased attention because they require only computation of the null estimates and are asymptotically equivalent to Wald and likelihood ratio statistics under both null and Pitman alternative hypotheses. Some methods that use score tests based on GLM to test haplotype-trait association have demonstrated the validity and power of this statistic[6]. However, these methods only considered one genomic region. If considering multi-region multi-marker haplotype configurations, a severe multiple-testing problem will occur. To obtain uncorrected *P*-values for a specific marker combination, we use an unnested simulation introduced by Becker and Knapp[2], which is based on the algorithm proposed by Ge *et al*[13].

We propose an alternative approach that uses score statistics based on GLM to build the statistic *T* over which some of the unlinked regions are considered and some markers are chosen at the selected regions. Since the distribution of *T* is generally unknown and is generally not comparable, we replace *T* with *P*^{
min
}which is inherited from the algorithm of Becker and Knapp[2]. This simulation method has already been validated by Manly[14] and Hoh *et al*[15], and has systematically been applied to some genetic data[2, 12, 16, 17].

## Simulation study

### Simulation schemes

We conduct a simulation study to evaluate the power and type I error of the association test and to compare our approach with others. The haplotype data are generated in a way similar to that of Roeder *et al*.[18] and Tzeng *et al*[6]. In every simulation scheme, we consider two unlinked regions. We consider that markers are in strong linkage disequilibrium within each region, but markers from different regions are in linkage equilibrium. Therefore, we separately produce two regions by using a modified Hudson's MS program[19]. This program generates data under a coalescent model in which the recombination occurs uniformly over the region. The 4 samples sizes are 50, 100, 200 and 400, respectively. The scaled recombination rate, *ρ* = *N*_{
e
}*δ*/*bp*, is set to 4 × 10^{-3} for the recombination cold spots, and 100 times greater in the hot spots, with the effective population size *N*_{
e
}is 1 × 10^{4}. The scaled mutation for the entire region, 4*N*_{
e
}*μ*/*bp*, is set to be 6 × 10^{-4}. Once the haplotypes have been generated, the first step is to restrict the disease or minor allele frequency. In this simulation, we set the allele frequency as 3 levels: 0.1, 0.3, and 0.5. We assume that the middle locus in every gene is the liability locus. Once a liability locus is chosen, a haplotype is defined as a segment of three adjacent SNPs in which the second SNP is the liability locus.

After randomly pairing haplotypes to form individual genotypes, we generate both continuous and binary trait values.

#### Continuous traits

For the Type I error test, we consider two simple models of quantitative traits simulated independently of the liability locus. Let model 1 include only an environmental effect *e*: Y= e. Let model 2 additionally incorporate a covariate *Z*: Y= *γ* × Z+ e. In the models, *e* follows a standard normal distribution with mean 0 and variance 1, and *Z* is generated from a standard normal distribution. For assessing power and the effective selection of the best combination of markers, we also consider two models of quantitative traits simulated in association with the liability locus. Let model 3 decompose the trait value into MRHC effect and environmental effect *e*: *Y* = *g* + *e*, and model 4 additionally incorporate a covariate *Z*: *Y* = *g* + *γ* × *Z* + *e*. In these models, *g* is the sum of all considered genes' effects. For the *i* th gene, *g*_{
i
}has a discrete distribution and equals
,
,
with probabilities *q*^{2}, 2*q*(1-*q*) and (1-*q*)^{2}, respectively. As in models 1 and 2, *e* follows a normal distribution with mean *ε* and variance
, and *Z* is generated from a standard normal distribution. For simplicity, we set
, *ε* = 0, and *γ* = 1. The trait values are generated using the normal penetrance function.
for the first model and
for the second model, where *m* is the number of the genes. We determine
through the heritability *h*^{2} of all liability loci, which we set at 0.4.

#### Binary traits

We generate binomial phenotypes on the basis of the above continuous traits, and consider four models where the disease prevalence is set to 0.10. If the values of the above continuous traits are more than a given threshold, we set the traits as disease cases, otherwise we set them as control. Let model 5 be the binary trait created from a continuous trait simulated as in model 1, model 6 from model 2, model 7 from model 3, and model 8 from model 4. Binary traits are simulated until an equal number of cases and controls are reached.

the 8 models in the trait producing in simulation

Model name | Trait type | Factor consider |
---|---|---|

Model 1 | Continuous traits | include only an environmental effect |

Model 2 | Continuous traits | include an environmental effect |

Model 3 | Continuous traits | include MRHC effect and environmental effect |

Model 4 | Continuous traits | include MRHC effect and environmental effect |

Model 5 | Binary Traits | Produce from model 1 above a given threshold |

Model 6 | Binary Traits | Produce from model 2 above a given threshold |

Model 7 | Binary Traits | Produce from model 3 above a given threshold |

Model 8 | Binary Traits | Produce from model 4 above a given threshold |

Under all scenarios, we compute the global *P* value with 1500 permutation replicates for each simulated data set. Empirical significance levels and power were computed as the portion of simulated data sets for which the global *P* value was less than or equal to 0.05.

## Results

### Comparison of three models htr, hapcc and HAPGLM

In order to check the validity and the accuracy of our HAPGLM approach, we first carry out simulations under the null hypothesis and compare it with hapcc and htr, which were implemented in the beta version of FAMHAP. htr performs a haplotype trend regression test proposed by Zaykin *et al*[7], and hapcc performs a *χ*^{2} test for haplotypes proposed by Becker *et al*[12, 16]. Here, we use model 5 and 7 to simulate the trait. Haplotypes and trait values are compared according to the frequency (*q* = 0.1, 0.3, 0.5) of the disease allele, and sample size (*n* = 50, 100, 200, 400).

*hapcc*. The reason is that the disease individual prevalence is 0.10, and the percent for the disease is somewhat small, making it difficult to find the significant difference. However, when the sample size is more than 100, the power is near one for the three models. There are no significant differences among hapcc, htr and our method.

Type I error of three models via 1500 simulations at *α* = 0.05

Sample Size | Minor Allele | Model Type | ||
---|---|---|---|---|

hapcc | htr | HAPGLM | ||

50 | q = 0.1 | 0.097(0.079-0.117) | 0.048(0.036-0.063) | 0.048(0.036-0.063) |

q = 0.3 | 0.032(0.022-0.045) | 0.039(0.028-0.053) | 0.044(0.032-0.059) | |

q = 0.5 | 0.029(0.020-0.041) | 0.041(0.030-0.055) | 0.047(0.042-0.071) | |

100 | q = 0.1 | 0.036(0.025-0.049) | 0.035(0.024-0.048) | 0.047(0.035-0.062) |

q = 0.3 | 0.031(0.021-0.044) | 0.072(0.057-0.090) | 0.043(0.031-0.057) | |

q = 0.5 | 0.042(0.030-0.056) | 0.047(0.035-0.062) | 0.051(0.038-0.067) | |

200 | q = 0.1 | 0.031(0.021-0.044) | 0.026(0.017-0.038) | 0.055(0.042-0.071) |

q = 0.3 | 0.051(0.038-0.067) | 0.033(0.023-0.046) | 0.047(0.035-0.062) | |

q = 0.5 | 0.040(0.029-0.054) | 0.067(0.052-0.084) | 0.037(0.026-0.051) | |

400 | q = 0.1 | 0.035(0.024-0.048) | 0.037(0.026-0.051) | 0.042(0.030-0.056) |

q = 0.3 | 0.047(0.035-0.062) | 0.028(0.019-0.040) | 0.041(0.030-0.055) | |

q = 0.5 | 0.028(0.019-0.040) | 0.041(0.030-0.055) | 0.042(0.030-0.056) |

Power of three models via 1500 simulations at *α* = 0.05

Sample Size | Disease Allele Frequency | Model Type | ||
---|---|---|---|---|

hapcc | htr | HAPGLM | ||

50 | q = 0.1 | 0.947 | 0.947 | 0.947 |

q = 0.3 | 0.866 | 0.907 | 0.968 | |

q = 0.5 | 0.666 | 0.391 | 0.596 | |

100 | q = 0.1 | 0.977 | 0.971 | 0.952 |

q = 0.3 | 0.963 | 0.927 | 0.941 | |

q = 0.5 | 0.834 | 0.728 | 0.917 | |

200 | q = 0.1 | 0.980 | 0.980 | 0.968 |

q = 0.3 | 0.981 | 0.981 | 0.968 | |

q = 0.5 | 0.961 | 0.912 | 0.954 | |

400 | q = 0.1 | 0.967 | 0.981 | 0.948 |

q = 0.3 | 0.981 | 0.983 | 0.974 | |

q = 0.5 | 0.934 | 0.925 | 0.967 |

### Three factors analysis for global test

Type I error of global test via 1500 simulations at *α* = 0.05

Sample Size | Minor Allele Frequency | Model Type | |||
---|---|---|---|---|---|

Model 1 | Model 2 | Model 5 | Model 6 | ||

50 | q = 0.1 | 0.034(0.024-0.047) | 0.031(0.021-0.044) | 0.048(0.036-0.063) | 0.039(0.028-0.053) |

q = 0.3 | 0.050(0.037-0.065) | 0.041(0.030-0.055) | 0.044(0.032-0.059) | 0.044(0.032-0.059) | |

q = 0.5 | 0.052(0.039-0.068) | 0.062(0.048-0.079) | 0.047(0.035-0.062) | 0.056(0.043-0.072) | |

100 | q = 0.1 | 0.040(0.029-0.054) | 0.032(0.022-0.045) | 0.047(0.035-0.062) | 0.037(0.026-0.051) |

q = 0.3 | 0.038(0.027-0.052) | 0.044(0.032-0.059) | 0.043(0.031-0.059) | 0.037(0.026-0.051) | |

q = 0.5 | 0.048(0.036-0.063) | 0.044(0.032-0.059) | 0.051(0.038-0.067) | 0.052(0.039-0.068) | |

200 | q = 0.1 | 0.045(0.033-0.060) | 0.041(0.030-0.055) | 0.055(0.042-0.071) | 0.031(0.021-0.044) |

q = 0.3 | 0.048(0.036-0.063) | 0.041(0.030-0.055) | 0.047(0.035-0.062) | 0.042(0.030-0.056) | |

q = 0.5 | 0.047(0.035-0.062) | 0.049(0.036-0.064) | 0.037(0.026-0.051) | 0.041(0.030-0.055) | |

400 | q = 0.1 | 0.027(0.018-0.039) | 0.040(0.029-0.054) | 0.042(0.030-0.056) | 0.038(0.027-0.052) |

q = 0.3 | 0.043(0.031-0.059) | 0.037(0.026-0.051) | 0.041(0.030-0.055) | 0.044(0.032-0.059) | |

q = 0.5 | 0.041(0.030-0.055) | 0.052(0.039-0.068) | 0.042(0.030-0.056) | 0.048(0.036-0.063) |

Power of global test via 1500 simulations at *α* = 0.05

Sample Size | Disease Allele Frequency | Model Type | |||
---|---|---|---|---|---|

Model 3 | Model 4 | Model 7 | Model 8 | ||

50 | q = 0.1 | 0.977 | 0.824 | 0.947 | 0.478 |

q = 0.3 | 0.964 | 0.920 | 0.968 | 0.891 | |

q = 0.5 | 0.947 | 0.625 | 0.596 | 0.741 | |

100 | q = 0.1 | 0.934 | 0.936 | 0.952 | 0.937 |

q = 0.3 | 0.979 | 0.957 | 0.941 | 0.889 | |

q = 0.5 | 0.965 | 0.836 | 0.917 | 0.887 | |

200 | q = 0.1 | 0.978 | 0.963 | 0.968 | 0.916 |

q = 0.3 | 0.968 | 0.972 | 0.968 | 0.967 | |

q = 0.5 | 0.967 | 0.971 | 0.954 | 0.960 | |

400 | q = 0.1 | 0.968 | 0.967 | 0.948 | 0.944 |

q = 0.3 | 0.971 | 0.947 | 0.974 | 0.920 | |

q = 0.5 | 0.948 | 0.958 | 0.967 | 0.962 |

### Recombination analysis for global test

Type I error of global test for different recombination at *α* = 0.05

Haplotype Diversity | Minor Allele Frequency | Model Type | |||
---|---|---|---|---|---|

Model 1 | Model 2 | Model 5 | Model 6 | ||

High | 0.1 | 0.046(0.034-0.061) | 0.053(0.040-0.069) | 0.045(0.033-0.060) | 0.043(0.031-0.057) |

0.3 | 0.051(0.038-0.067) | 0.043(0.031-0.057) | 0.047(0.035-0.062) | 0.038(0.027-0.052) | |

0.5 | 0.058(0.044-0.074) | 0.052(0.039-0.068) | 0.042(0.030-0.056) | 0.042(0.030-0.056) | |

Low | 0.1 | 0.047(0.035-0.062) | 0.041(0.030-0.055) | 0.042(0.030-0.056) | 0.043(0.031-0.057) |

0.3 | 0.039(0.028-0.053) | 0.044(0.032-0.059) | 0.037(0.026-0.051) | 0.047(0.035-0.062) | |

0.5 | 0.040(0.029-0.054) | 0.051(0.038-0.057) | 0.044(0.032-0.059) | 0.050(0.037-0.065) |

Power of global for different recombination level at *α* = 0.05

Haplotype Diversity | Disease Allele Frequency | Model Type | |||
---|---|---|---|---|---|

Model 3 | Model 4 | Model 7 | Model 8 | ||

high | 0.1 | 0.971 | 0.860 | 0.955 | 0.957 |

0.3 | 0.967 | 0.962 | 0.961 | 0.961 | |

0.5 | 0.948 | 0.968 | 0.953 | 0.965 | |

low | 0.1 | 0.973 | 0.977 | 0.972 | 0.972 |

0.3 | 0.963 | 0.944 | 0.951 | 0.958 | |

0.5 | 0.977 | 0.933 | 0.948 | 0.953 |

### Analysis for specific MRHC

Power for the specific MRHC at *α* = 0.05

Disease Allele Frequency | Model Type | |||
---|---|---|---|---|

Model 3 | Model 4 | Model 7 | Model 8 | |

0.1 | 0.981 | 0.856 | 0.978 | 0.946 |

0.3 | 0.943 | 0.962 | 0.916 | 0.937 |

0.5 | 0.933 | 0.954 | 0.946 | 0.952 |

### Select the best combination of markers

*T*is the largest and the

*P*value is the smallest. These are presented in table 9. The combination without these two loci presents low

*T*and high

*P*value.

The best ten combinations in two regions with six markers

Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|

Model 5 | 2|5 | 12|5 | 23|5 | 123|5 | 245 | 234|5 | 123|45 | 12|56 | 123|56 | 12|456 |

Model 6 | 5 | 2 | 2|5 | 3|5 | 2|6 | 23 | 12 | 56 | 2|4 | 45 |

Model 7 | 2|5 | 12|45 | 13|456 | 123|45 | 2|45 | 23|45 | 123|56 | 12|456 | 123|45 | 12|56 |

Model 8 | 12|5 | 2|5 | 23|5 | 2|45 | 12|45 | 2|56 | 123|5 | 23|56 | 123|45 | 12|56 |

### Application to a pig meat quality dataset

Meat quality is very important in the pig meat production industry. Many candidate genes have been identified that could be used to improve this trait through marker assisted selection (MAS)[20]. The Heart Fatty Acid-Binding (*H-FABP*) gene encodes a type of cytosol protein that transports fatty acids from the cell membrane to other sites where 3-acyl-glyceride and phospholipids are synthesized and fatty acids are oxidized. Gerbens [21–23] discovered *Msp* I, *Hae* II and *Hinf* I polymorphisms of the *H-FABP* gene that is related to intramuscular fat content. Melanocortin-4 Receptor (MC4R) is believed to be a link between feed intake and body weight[24].

Polymorphism of the *MC4R* gene has been reported to be associated with back fat thickness[25]. Adipocyte Determination and Differentiation factor-1 (*ADD1*) can activate or restrain some genes in fat and glucose metabolism. Research has suggested that the *ADD1* gene can be used as a candidate gene for pork quality[26]. Calpastatin (*CAST*), which is an endogenous inhibitor (Ca^{2+} dependent cysteine proteinase), plays a central role in the regulation of calpain activity in cellsand is considered to be one of the major modulators of the calpains[27, 28]. The *CAST* gene represents an excellent candidate gene for studying variation in pork quality. We aim to find association between multi-region haplotype effects from these candidate regions and meat quality.

*ADD1*, 1 in

*H-FABP*, 1 in

*MC4R*and 4 in

*CAST*. We code these polymorphic markers as (A1, A2, H1, M1, C1, C2, C3, and C4). The

*χ*

^{2}test of these polymorphic markers show that there are significant differences in 5 polymorphic markers (except for A1, A2, C4) between the five populations. We set up single locus models including sex and breed as environmental covariates, for every polymorphic marker, and use statistical software SAS macro GLM for calculations. The results show significant effects at 0.05 level for: A1, H1, C2, C3 and C4 in back fat thickness (BK); A1, C1 and C3 in meat color (MC); A1, H1, C1 and C4 in intramuscular fat content (IMF), and A1 in protein content. To apply our methods, we also incorporate two environmental covariates (sex and breed), and use back fat thickness, tenderness, drip loss, meat color, intramuscular fat content, pH 1 hour after slaughter, pH 24 hours after slaughter, and the content of protein as the dependent variable in the regression model. Table 10 illustrates the marker combination in which raw

*P*values are lower at 0.01 level, Compared to single locus model, MRHC analysis can detect more markers which are significantly associated with traits.

All marker combinations with raw *P* values less than 0.01

Trait Type | Number | Marker Combination |
---|---|---|

BK | 15 | 3|67, 1|3|568, 1|3|7, 3|58, 2|3|58, 3|568, 3|57, 1|3|5, 1|3|57, 3|567, 1|3|56, 2|3|68, 1|3|78, 2|3|78, 3|578 |

Tend | 34 | 4|68, 4|56, 578, 568, 4|568, 1|4|58, 1|4|68, 4, 3|56, 1|2|57, 4|567, 3|58, 3|4|5, 1|57, 3|4|58, 3|568, 4|67, 5|67, 4|78, 58, 5678 4|678, 4|58, 4|57, 3|4|7, 3|4|6, 3|578, 1|3|4|5, 1|4|5, 1|3|678, 3|4|578, 1|3|4|8, 1|3|56,1|3|5 |

DL | 4 | 1|3|678, 3|4|578, 567, 58 |

MC | 10 | 1|3|4|5, 3|4|678, 1|3|57, 1|568, 1|4|567, 1|4|57, 1|4|578, 1|56. 1|3|56, 1|3|67 |

IMF | 4 | 1|3|4|5, 1|4|56, 4, 568 |

pH1 | 32 | 1|3|4|5, 1|3|5, 1|3|56, 1|3|57, 1|3|6, 1|3|67, 1|3|7, 1|4|567, 1|4|57, 1|4|578, 1|4|58, 1|4|6, 1|56, 1|578, 3|4|56, 3|4|6, 3|4|678,3|4|7, 3|4|8, 3|56, 3|57, 3|568, 3|57, 3|58, 3|6, 3|67, 3|678, 3|7, 3|78, 4|56, 4|58, 4|6 |

pH24 | 5 | 1|4|8, 3|4|8, 4|6, 58, 8 |

Protein | 8 | 1|3|6, 5, 1|5, 12|3, 1|4|8, 3|58, 1|3|8, 12|3|4|5 |

The specific MRHC of 4 regions of haplotype interaction

Trait Type | The specfic haplotype configurations in 4 genes |
---|---|

BK | AG|A|A|AGGA, AG|A|G|GGGA, AG|A|G|AGGA, AG|G|G|AGAA, AG|G|G|AGAA |

Tend | AG|A|G|AGGA, AG|A|G|AGGA, AG|A|G|AGGA, GG|A|A|AAAA |

DL | AG|A|G|AGAA, AG|A|G|AGGA, AG|G|G|AGAA, GG|G|A|AAAA |

MC | AG|A|A|AGGA, AG|A|G|AAGA, AG|A|G|AGAA, AG|A|G|AGGA, AG|G|G|AGAA |

IMF | AG|A|G|AGAA, AG|A|G|AGGA, AA|G|G|AAAG |

pH1 | AG|A|G|AGAA, AG|A|G|AGGA, AG|G|G|AGAA, AG|G|G|AAAA |

pH24 | AG|A|G|AGAA, AG|A|G|AGGA, AG|G|G|AGAA, AG|A|A|AGGA |

Protein | AG|A|G|AGAA, AG|A|G|AGGA, AG|A|A|AGGA, AG|A|A|AGGA |

## Discussion

Presently many publications have proven that the genetic dissection of complex traits depends not only on the identification of genes involved in disease susceptibility but also on the elucidation of the synergistic role that genes play with other genes and with environmental factors[8–11, 31–33]. Therefore, considering unlinked genomic regions simultaneously is desirable. There are two models hapcc and htr in FAMHAP program which can compute the haplotype interaction in unlinked region. For hapcc, Becker *et al*.[12] chose the usual *χ*^{2} test statistic for contingency tables which can be applied only to case-control traits. For htr, the haplotype trend regression test proposed by Zaykin *et al*[7], chose *F* statistic and could be used for qualitative and quantitative traits, but won't allow other covariates in the model. Our proposed methods are based on score equations for GLMs which allow adjustment of covariates and can model qualitative and quantitative traits. For binary trait without covariate, the type I error and power comparison show that our model has the same power as hapcc and htr, and type I error is as expected. For a small number of markers, the run times for hapcc, htr and HAPGLM are approximately equal. Additionally, our model has obvious advantages. First, our model can be applied to analyze haplotype association across independent regions with adjusting of covariates for a wider variety of traits. Second, the score statistic (
) of the individual MRHCs can be easily computed.

Our model adopted the simulation method proposed by Becker *et al*.[2, 12, 17], which can be computationally feasible to deal with the multiple marker combination. For our program, the evaluation of a single simulated data set with 15 markers in 3 regions will take no more than 10 seconds on average on two nodes with 3.0 GHz Intel with 512 MB main memory. In general, it will be possible to simultaneously consider about 600 to 1,200 hypotheses on a standard PC. Our program is very flexible to allow selection of loci and genes for analysis. However, in regions with too many possible haplotype combinations, our program runs out of memory. We need to consider more aggressive trimming parameters or other haplotype estimation algorithms. For example, we will improve our program by using the haplotype ancestral cluster idea to cluster rare haplotypes with similar ancestral haplotypes, which was used by Tzeng *et al*.[34]. Haplotypes of the entire block can be represented by a smaller set of SNPs which are referred to as tag SNPs[35]. In order to analyze more markers, it will be helpful to select tag markers at each region and to carry out the analysis on the set of these markers. Tag SNPs selection will save run-time, and we plan our further research along this path.

In this research, we proposed markers from different regions which are proposed to be in linkage equilibrium. Markers from different regions can be in linkage disequilibrium, but the methods can allow such markers as if they were in the same region.

Since the current model assumes that the subjects are independent of each other (i.e., unrelated), it is critical to extend the current approach to account for the correlations between subjects, given their family data. Therefore, further studies are needed to address the impact and modeling strategies with regard to the assumptions in the model. This model is restricted to outcomes that can be placed in the generalized linear model framework. An extension to failure-time data could also be placed in the framework.

## Conclusion

It is quite common that the genetic architecture of complex traits involves many genes and their interactions. Therefore, dealing with multiple unlinked genomic regions simultaneously is desirable. We developed a regression-based approach which can be applied to a wider variety of traits and allow adjustment for other covariates to assess the interactions of haplotypes that belong to different unlinked regions. Multiple marker combinations at each unlinked region are also considered. In addition, HAPGLM can be downloaded for free at: ftp://public.sjtu.edu.cn/, user: ylhu0323, password: public.

## Methods

### Contribution to multi-region haplotype configurations

Consider *R* unlinked genomic regions, and *m*_{
r
}observed markers for each of *n* unrelated individuals in region *r*. Further, we suppose that markers within each region are in strong linkage disequilibrium, while markers from different regions are in linkage equilibrium. What we wish to investigate is whether some of the genomic regions are associated with the phenotype of interest via some of the markers from each region.

Let *G*^{
r
}be the multi-locus genotype of an individual at region *r*, and *h*^{
r
}be a haplotype of region *r*. If the haplotypes
and
are compatible with *G*^{
r
},
is then called a haplotype explanation of *G*^{
r
}. Obviously, for a given *G*^{
r
}, there may be several haplotype explanations.

where *G*^{
r
}is the multilocus genotype of a fixed individual at region *r*. Let
be the set of unordered haplotype explanation, which are compatible with *G*^{
r
}. let
be the estimated frequency of haplotype
, the sum in the denominator runs over all possible haplotype explanations, and the Kronecker symbol *δ* is defined as *δ*_{j, k}= 1 if *j* = *k*, and *δ*_{j, k}= 0 if *j* ≠ *k*, where *j* and *k* is the pair of haplotypes at region *r*. The "~*r*" is all the pair haplotypes in region *r* for each individual.

*G*= (

*G*

^{1},

*G*

^{2}, and,

*G*

^{ R }) be the multi-region genotype of an individual and be a possible multi-region haplotype explanations (MRHEs) that are compatible with (

*G*

^{1},

*G*

^{2}, and,

*G*

^{ R }). Let denote a multi-region haplotype configuration (MRHC). An MRHE is formed by two MRHCs from the two gametes, but there may be 2

^{R}MRHCs for a given MRHE, some of which may be the same. We construct an

*n*×

*h*matrix with rows referring to the

*n*individuals and columns referring to the different MRHCs. Cell

*x*

_{ ef }of this matrix denotes the contribution of individual

*e*to MRHC

*f*, and can be calculated according to the following equation,

### Regression model with MRHC

*y*denote an

*n*×

*1*vector of measured phenotypes of a trait,

*α*denote an

*h*×

*1*vector of the effects for the MRHCs,

*β*denote the regression parameters for the intercept and environmental factors, be the contribution matrix obtained above, and denote the design matrix corresponding to measured environmental factors. Then we have the following generalized linear model (GLM):

*Z*=

*X*

_{ e }|

*X*

_{ g }and

*γ*= (

*α*|

*β*). Then, the likelihood of trait

*y*

_{ i }for subject

*i*, given the vector

*Z*

_{ i }, can be expressed as a GLM for exponential family data[36] according to

where *a*, *b*, and *c* are known functions, and *ϕ* is the dispersion parameter. To implement the score statistics for different types of traits, we need only assume a distribution for the trait and to make the appropriate substitutions for the expected value of the trait,
, the dispersion parameter *a*(*ϕ*), the ratio *b*"(*η*)/*a*(*ϕ*) and the link function. (see table 1 of [5]).

### Score test for incorporating contribution of MRHC

*H*

_{0}:

*α*= 0. Let

*ζ*denote the vector of nuisance parameters (

*β*,

*μ*,

*ϕ*). The likelihood function for (

*α*,

*ζ*) on the basis of the data according to Tzeng

*et al*[6] is

Where *P*(*x*_{g, i}) is the contribution of individual *i* to MRHCs.

*α*is the partial derivative of the likelihood equation (5), with respect to

*α*. The resulting score statistic, denoted by

*S*

_{ α }, is the score function evaluated at the restricted maximum-likelihood estimate under the null hypothesis.

*S*

_{ α }is the statistic that we use to test MRHC effect; in appendix A, we show the following result:

where
and
are the restricted maximum-likelihood estimated under the null hypothesis, and *E*(*X*_{g, i}|*G*_{
i
}) is the contribution of individual *i* to MRHC under the observed multi-region genotypes, *G*.

*S*

_{ α }under the null hypothesis

*H*

_{0}:

*α*=

*0*. Under

*H*

_{0},

*S*

_{ α }is asymptotically distributed as multivariate normal[36]. We consider the generalized score test, which would ensure the asymptotic null

*χ*

^{2}distribution even under model misspecification[5]. Define

*θ*= (

*α*,

*ζ*) and let

*V*

_{ α }denote the variance of

*S*

_{ α }, the equation can be expressed according to Louis[37] and Tzeng

*et al*[6] as:

In appendix B, we show the above result.

The score statistic is distributed asymptotically as *χ*^{2} with degrees of freedom equal to the rank of *V*_{
α
}.

Schaid *et al*[5] proved that the score function for *α* and the score function for haplotype probabilities are independent under the null hypothesis, so that the covariance between the two score functions is zero. Since the contribution to each MRHC is estimated from the haplotype frequency that is used to calculate the score statistic *S*_{
α
}, the variance of the score statistic is not penalized by the use of estimated haplotype frequencies.

where *z*_{
k
}follows *χ*^{2}(1) under the null hypothesis *H*_{0}: *α* = 0.

The *P* value
is assessed via simulation. In each replicate of this simulation, a sample is constructed in which the sample trait and environmental covariate of each individual are randomly permuted at the same time, and the score test statistic is computed again. Let
denote the value of the test statistic obtained for the *i* th replicate. Then
is the fraction of permutation replicates resulting in a test statistic greater than or equal to the test statistic of the real data, i.e.,
, with *t* denoting the number of permutation replicates and |
| denoting the number of elements of
.

### Testing more than one hypothesis

*m*markers in several genes, there would be 2

^{m}-1 marker combinations. To test 2

^{m}-1 combinations with associated raw

*P*values, and declare the global

*P*value the significance level for our analysis would lead to another multiple-testing problem. In order to avoid nested simulation, we use the method which Becker and Knapp[2] adapted from Ge

*et al*[13]. The basic idea is that, to test B = 2

^{m}-1 marker combinations, global

*P*is estimated by the proportion of permutation samples with min

_{ B }P

_{ t }smaller than that in the observed data, where

*t*is the simulation time. For each marker combination

*B*∈

*B*and for each permutation replicate

*i*= 1,...,

*t*, the raw

*P*value of the

*i*th permutation replicate is calculated as

*i*> 0, is the minimum of the uncorrected

*P*values over all MRHC in the

*i*th permutation replicate. So the

*P*value for the global hypothesis

*H*

_{0}is calculated as:

This permutation method is explained in more detail in[2].

## Appendix A

*S*

_{ α }(

*Y*,

*G*,

*X*

_{ e },

*α*,

*ζ*) denote the score function of the data (

*Y*,

*G*,

*X*

_{ e }) for

*α*. As set forth by[37],

*S*

_{ α }(

*Y*,

*G*,

*X*

_{ e },

*α*,

*ζ*) is the expectation of the complete-data score function given the observed data--that is,

## Appendix B

## Declarations

### Acknowledgements

In this research, we had much help. During writing and running the program, Ning Xu, Haitao Wang, Bofei Xiao, etc. gave much help. Becker, Jing Li, Hongyu Zhao, etc. gave us many helpful suggestions in our study. This work is supported by the National Natural Science Foundation of China (grant no.30671492), the National High Technology Research and Development Program of China (863 project) (grant no. 2006AA10Z1E3, 2008AA101002) and the National 973 Key Basic Research Program (grant no, 2006CB102102, 2004CB117500).

## Authors’ Affiliations

## References

- Dawson E, Abecasis GR, Bumpstead S, Chen Y, Hunt S, Beare DM, Pabial J, Dibling T, Tinsley E, Kirby S: A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002, 418 (6897): 544-548. 10.1038/nature00864.View ArticlePubMedGoogle Scholar
- Becker T, Knapp M: A Powerful Strategy to Account for Multiple Testing in the Context of Haplotype Analysis. Am J Hum Genet. 2004, 75 (4): 561-570. 10.1086/424390.PubMed CentralView ArticlePubMedGoogle Scholar
- Bell JT, Wallace C, Dobson R, Wiltshire S, Mein C, Pembroke J, Brown M, Clayton D, Samani N, Dominiczak A: Two-dimensional genome-scan identifies novel epistatic loci for essential hypertension. Hum Mol Genet. 2006, 15 (8): 1365-1374. 10.1093/hmg/ddl058.View ArticlePubMedGoogle Scholar
- Carlborg O, Haley CS: Epistasis: too often neglected in complex trait studies. Nat Rev Genet. 2004, 5 (8): 618-625. 10.1038/nrg1407.View ArticlePubMedGoogle Scholar
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score Tests for Association between Traits and Haplotypes when Linkage Phase Is Ambiguous. Am J Hum Genet. 2002, 70 (2): 425-434. 10.1086/338688.PubMed CentralView ArticlePubMedGoogle Scholar
- Tzeng JY, Wang CH, Kao JT, Hsiao CK: Regression-Based Association Analysis with Clustered Haplotypes through Use of Genotypes. Am J Hum Genet. 2006, 78 (2): 231-242. 10.1086/500025.PubMed CentralView ArticlePubMedGoogle Scholar
- Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing Association of Statistically Inferred Haplotypes with Discrete and Continuous Traits in Samples of Unrelated Individuals. Hum Hered. 2002, 53 (2): 79-91. 10.1159/000057986.View ArticlePubMedGoogle Scholar
- Cordell HJ, Barratt BJ, Clayton DG: Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions, and parent-of-origin effects. Genet Epidemiol. 2004, 26 (3): 167-185. 10.1002/gepi.10307.View ArticlePubMedGoogle Scholar
- Devlin B, Roeder K, Wasserman L: Analysis of multilocus models of association. Genet Epidemiol. 2003, 25 (1): 36-47. 10.1002/gepi.10237.View ArticlePubMedGoogle Scholar
- Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37 (4): 413-417. 10.1038/ng1537.View ArticlePubMedGoogle Scholar
- Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN: Nonparametric Tests of Association of Multiple Genes with Human Disease. Am J Hum Genet. 2005, 76 (5): 780-793. 10.1086/429838.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker T, Schumacher J, Cichon S, Baur MP, Knapp M: Haplotype Interaction Analysis of Unlinked Regions. Genet Epidemiol. 2005, 29 (4): 313-10.1002/gepi.20096.View ArticlePubMedGoogle Scholar
- Ge Y, Dudoit S, Speed TP: Resampling-based multiple testing for microarray data analysis. Test. 2003, 12 (1): 1-77. 10.1007/BF02595811.View ArticleGoogle Scholar
- Manly BFJ: Randomization, Bootstrap And Monte Carlo Methods in Biology. 2007, New York: Chapman & Hall/CRCGoogle Scholar
- Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 2001, 11 (12): 2115-2119. 10.1101/gr.204001.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker T, Cichon S, Jonson E, Knapp M: Multiple Testing in the Context of Haplotype Analysis Revisited: Application to Case-Control Data. Ann Hum Genet. 2005, 69 (6): 747-756. 10.1111/j.1529-8817.2005.00198.x.View ArticlePubMedGoogle Scholar
- Becker T, Herold C: Joint analysis of tightly linked SNPs in screening step of genome-wide association studies leads to increased power. Eur J Hum Genet. 2009, 17 (8): 1043-1049. 10.1038/ejhg.2009.7.PubMed CentralView ArticlePubMedGoogle Scholar
- Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B: Analysis of single-locus tests to detect gene/disease associations. Genet Epidemiol. 2005, 28 (3): 207-219. 10.1002/gepi.20050.View ArticlePubMedGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002, 18 (2): 337-338. 10.1093/bioinformatics/18.2.337.View ArticlePubMedGoogle Scholar
- Dekkers JCM: Commercial application of marker-and gene-assisted selection in livestock: Strategies and lessons. J Anim Sci. 2004, 82: E313-328.PubMedGoogle Scholar
- Gerbens F: Characterization, chromosomal localization, and genetic variation of the porcine heart fatty acid-binding protein gene. Mamm Genome. 1997, 8 (5): 328-332. 10.1007/s003359900433.View ArticlePubMedGoogle Scholar
- Gerbens F, de Koning DJ, Harders FL, Meuwissen TH, Janss LL, Groenen MA, Veerkamp JH, Van Arendonk JA, te Pas MF: The effect of adipocyte and heart fatty acid-binding protein genes on intramuscular fat and backfat content in Meishan crossbred pigs. J Anim Sci. 2000, 78 (3): 552-559.PubMedGoogle Scholar
- Gerbens F, van Erp AJ, Harders FL, Verburg FJ, Meuwissen TH, Veerkamp JH, te Pas MF: Effect of genetic variants of the heart fatty acid-binding protein gene on intramuscular fat and performance traits in pigs. Am Soc Animal Sci. 1999, 77: 846-852.Google Scholar
- Seeley RJ, Yagaloff KA, Fisher SL, Burn P, Thiele TE, van Dijk G, Baskin DG, Schwartz MW: Melanocortin receptors in leptin effects. Nature. 1997, 390 (6658): 349-10.1038/37016.View ArticlePubMedGoogle Scholar
- Kim KS, Larsen N, Short T, Plastow G, Rothschild MF: A missense variant of the porcine melanocortin-4 receptor (MC4R) gene is associated with fatness, growth, and feed intake traits. Mamm Genome. 2000, 11 (2): 131-135. 10.1007/s003350010025.View ArticlePubMedGoogle Scholar
- Foretz M, Guichard P, Ferre P, Foufelle F: SREBP-1c is a major mediator of insulin action on the hepatic expression of gluckinase and lipogenesis related genes. Proc Natl Acad Sci USA. 1999, 96: 12737-12742. 10.1073/pnas.96.22.12737.PubMed CentralView ArticlePubMedGoogle Scholar
- Forsberg NE, Ilian MA, Ali-Bar A, Cheeke PR, Wehr NB: Effects of cimaterol on rabbit growth and myofibrillar protein degradation and on calcium-dependent proteinase and calpastatin activities in skeletal muscle. J Anim Sci. 1986, 67 (12): 3313-3321.Google Scholar
- Murachi T, Tanaka K, Hatanaka M, Murakami T: Intracellular Ca2+-dependent protease (calpain) and its high-molecular-weight endogenous inhibitor (calpastatin). Adv Enzyme Regul. 1980, 19: 407-424. 10.1016/0065-2571(81)90026-1.View ArticlePubMedGoogle Scholar
- Li CL, Pan YC, Me H: Polymorphism of the H-FABP, MC4R and ADD1 genes in the Meishan and four other pig populations in China. S Afr J Anim Sci. 2006, 36 (1): 1-View ArticleGoogle Scholar
- Wang QS, Pan YC, Sun LB, Meng H: Polymorphisms of the CAST gene in the Meishan and five other pig populations in China: short communication. S Afr J Anim Sci. 2007, 37 (1): 27-30.Google Scholar
- Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ: Estimation and Tests of Haplotype-Environment Interaction when Linkage Phase Is Ambiguous. Hum Hered. 2003, 55 (1): 56-65. 10.1159/000071811.View ArticlePubMedGoogle Scholar
- Lobach I, Carroll RJ, Spinka C, Gail MH, Chatterjee N: Haplotype-based regression analysis and inference of case-control studies with unphased genotypes and measurement errors in environmental exposures. Biometrics. 2008, 64 (3): 673-684. 10.1111/j.1541-0420.2007.00930.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou W, Thurston SW, Liu G, Xu LL, Miller DP, Wain JC, Lynch TJ, Su L, Christiani DC: The Interaction between Microsomal Epoxide Hydrolase Polymorphisms and Cumulative Cigarette Smoking in Different Histological Subtypes of Lung Cancer 1. Cancer Epidemiol Biomarkers Prev. 2001, 10 (5): 461-466.PubMedGoogle Scholar
- Tzeng JY: Evolutionary-based grouping of haplotypes in association analysis. Genet Epidemiol. 2005, 28 (3): 220-231. 10.1002/gepi.20063.View ArticlePubMedGoogle Scholar
- Patil N, Berno AJ, Hinds DA: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001, 294 (5547): 1719-1723. 10.1126/science.1065573.View ArticlePubMedGoogle Scholar
- McCullagh P, Nelder JA: Generalized Linear Models. 1989, London: Chapman & Hall/CRCView ArticleGoogle Scholar
- Louis TA: Finding the observed information matrix when using the EM algorithm. J Roy Stat Soc B Stat Meth. 1982, 44 (2): 226-233.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.