We focus on qualitative traits only in this study. It can be easily extended to any other traits through a generalized linear model. Different variants and collapsing strategies are considered within the framework of logistic regression. We also compared some recently proposed methods, SSU tests[15], adaptive tests [17], ORWSS [13] and Logistic Kernel-Machine Test[18] in our simulation study. The goal of this work is to detect any association between the trait and a given genetic region which includes both common and rare variants. Consider an association study with *N* samples in a genetic region with *K* variants. Let *Y*
_{
i
} denote the coded trait for the *ith* sample, 0 for controls and 1 for cases. The variants were coded by an additive genetic model: *X*
_{
ik
} was coded as 0, 1, and 2 as genotype scores for the *kth* marker of the *ith* sample, where *i* = 1, . . . , *N*, and *k* = 1, . . . , *K*. Let
and
be common variants and rare variants based on a certain threshold. For example, SNPs with minor allele frequencies less than 0.01 are considered as rare variants.

### Collapsing Methods and Logistic Regression

Collapsing approaches have been previously proposed using either an indicator function or a sum (proportion) function [11, 14]. Let S_{i} denote the collapsed score for a genetic region. The indicator function based collapsing method is
and the sum (proportion) function based collapsing method is
.

In a case control study, it is natural to consider the logistic regression model for tests, and those collapsing methods can be achieved by: Logit Pr(

*Y*
_{
i
} = 1) =

*β*
_{0}, +

*β*
_{1}
*S*
_{
i
}. The null hypothesis of no genetic effect is H

_{0} : β

_{1} = 0. In a candidate gene study, we employed the likelihood ratio test. Because the score test is computationally faster than the likelihood ratio test, we use the following tests for the genome wide association study. Let

where
and
.

which has an asymptotic χ^{2} distribution with degrees of freedom one.

The limitation of the current collapsing approaches is that they only consider rare variants. For example, when common variants contribute to the heritable variability not detectable by the traditional common SNPs approaches, ignoring them will lose power of the tests.

The Combined Multivariate Collapsing method (CMC) [

11] solves this problem by regarding collapsed score as a common SNP and performing a Hotelling's

*T*
^{2} test on multiple markers. To put this method within our logistic regression framework, we consider a multivariate logistic regression model.

The null hypothesis of no genetic effect is
.

Another collapsing method uses a data-driven weight considering both common and rare variants.

where the weight is calculated by
,
and *N*
_{0} is the number of controls in the study [12]. By using a weight, the collapsed score amplifies the contribution of rare variants. The test statistic can be derived from logistic regression as before. Because the weights are data-dependent, a permutation test is employed to find P-values.

For a region with both common and rare variants, the above two approaches consider all the genetic information. However, it is impossible that all variants in this region contribute to the heritable variability, and it is more likely that only some of them are causal. If many of rare variants are non-casual, collapsing will inevitably introduce noise and lose power of the test.

A covering method called RareCover [21], has been recently proposed to determine a collapsing subset from all the variants in this region using a forward selection procedure. For the purpose of comparison, we also put this strategy in our logistic regression framework. Instead of using Pearson's χ^{2}, which was used by the original authors, we considered the squared correlation coefficient R^{2} as the screening test statistic. Starting from a score without any rare variants, each rare variant is examined, and it is added into this score if it improves the test statistic the most. An optimal subset was obtained by a forward selection procedure to achieve the highest squared correlation between the collapsed score and traits. The test statistic then can be derived from a logistic regression model between the trait and the collapsed score as before. P-value can be found by permutation. However, this method does not consider genetic information from the common variants in this region and it ignores the direction of the rare variants by using either the squared correlation coefficient *R*
^{2} or Pearson's *χ*
^{2}.

### Recent proposed multi-marker tests

We also compared some recently proposed methods, SSU tests[15], adaptive tests [17], ORWSS [13] and Logistic Kernel-Machine Test[18] in our simulation studies. We briefly review these methods here. SSU and SSUw tests are defined as follow.

Let the score vector *U* = (*U*
_{1}, . . . , *U*
_{
K
}), where each component
, and
are the sample mean of phenotype.

*SSU* = *U*'*U* And *SSUw* = *U*'*Diag*(*If*)^{-1}*U*, Where *I*_{
f
} = *Cov*(*U*) is the expected fisher information matrix. Asymptotic distributions of the above two test statistics are scaled *χ*^{2} distributions[15].

For the Adaptive test, suppose that

*U*
_{
m
} = (

*U*
_{1}, . . . ,

*U*
_{
m
}), where

*m<K*, is the vector containing the first m components. Adaptive test statistics is

where *Pval*(*T*(*U*
_{
m
})) is the p-value of the test statistic, T. For the Adaptive test, we used SSU and SSUw as the score of the test statistics T. The adaptive tests are called aSSU and aSSUw tests. More generally, one can order the SNPs based on the single test statistics and repeat the adaptive test process, resulting in the aSSU-Ord and aSSUw-Ord. The P-value of *aT* is calculated by a permutation procedure.

For the ORWSS test, the score is constructed in the same way as other weighted sum test.

but the weight is calculated as follow.

The amended estimator of the odds ratio is computed by adding 0.5 to each cell of the 2 by 2 table for case control studies. If we define

*γ*
_{
k
} =

*log*(

*OR*
_{
k
}), where

*OR*
_{
k
} is the odds ratio for the k

*th* marker.

where *σ* is the standard deviation calculated from γ_{k}, k = 1, . . . , K, c is a parameter and
is the mean of log odds ratios[13]. In the simulation study, because number of variants is small, we using the logarithm of odds as a weight directly for each SNP without classification.

Then the test statistic is defined as

P-value of ORWSS is calculated by a permutation procedure.

For the Logistic Kernel-Machine Test, the test statistics is based on logistic regression with a kernel function of the SNPs.

Some commonly used kernels include linear, identity-by-descent (IBS) and quadratic kernels. We only consider the linear kernel here. In order to test whether there is a true genetic effect, the null hypothesis is H

_{0} :

*h*(

*X*) = 0. The test statistics has been developed as

which follows a scaled *χ*
^{2} distribution[18].

For all the tests above, we considered both common and rare variants, since we want to develop a robust strategy to detect any association between complex traits and genetic regions considering both common and rare variants.

### Weighted Selective Collapsing Strategy

Now, we propose a new collapsing strategy, which considers genetic information from both common and rare variants. The new strategy tries to remove the noise generated by the non-causal variants and to improve the power by considering both deleterious and protective components of this region. In brief, our strategy is as follows. We defined rare variants as SNPs with minor allele frequencies less than 0.01, others as common variants. Starting from a null model without any variants, by a forward selection procedure, common SNPs are first selectively collapsed into two components, which will serve as bases for the rare variants. One is a deleterious component having an extremely positive correlation coefficient with the trait. Another is a protective component having an extremely negative correlation coefficient. Because rare variants have high genetic effects, they were added into the collapsed set one at a time by a weighted sum function until either there were no variants remaining, or there was no improvement of the correlation coefficient. Repeat the forward selection procedure without common variants as the basis, two more components were generated. Last, the collapsed score was obtained from the four components according to the measure of squared correlation coefficient with the trait. The test statistic then can be derived from a logistic regression model between the trait and the collapsed score as before. P-values can be computed by permutation.

Now, we describe the procedure in details. Assume there are *J* common variants and *K* rare variants within a certain predefined genomic region. Let
and
denote vectors across all samples for common and rare variants, defined by a threshold MAF = 0.01, where *j* = 1, . . . , *J*, and *k* = 1, . . . , *K*. Let *S*
_{+} denote the deleterious component, which is a vector collapsed by the subset of the SNPs to achieve an extremely positive correlation. Let *S*
_{-} denote the protective component, which is a vector collapsed by the subset of the SNPs to achieve an extremely negative correlation.

Step 1: Forward selection on common SNPs with sum collapsing.

a) Calculate the correlation coefficient R for each common SNP with the trait. The common SNP with the largest correlation coefficient is added into

, while the common SNP with the lowest correlation coefficient is added into

.

where
is the sum of the vector *S*
_{+} and
, for *j* = 1, . . . , *J*.

b) Update *S*
_{+} and *S*
_{-} with
and
. Let *j* take values only from the remaining common SNPs. Repeat a) until either all common variants are collapsed into components or there is no improvement for the correlation coefficient of each component.

Step 2: Forward Selection on rare SNPs with weighted sum collapsing.

a) Because rare variants have high genetic effects, the data driven weight is derived as follows to favor the rare variants with high genetic effect in both deleterious and protective way.

where
.

indicates a mutation for the *i*th sample in the *k*th rare variant. *p*_{
k
} is the empirical estimate of the probability that an individual with the mutation will have the disease. w_{k} is adjusted based on p_{k} with the constraint that the sum of the weights is the number of rare variants.

b) Calculate the correlation coefficient R for each rare SNP with the trait. The rare SNP with the largest correlation coefficient is added into

, while the rare SNP with the lowest correlation coefficient is added into

.

where
is the sum of the vector *S*
_{+} and
, for *k* = 1, . . . , *K*.

c) Update *S*
_{+} and *S*
_{-} with
and
. Let k take values only from the remaining rare SNPs. Repeat b) until either all rare variants are collapsed into components or there is no improvement for the correlation coefficient of each component. The whole procedure generates two collapsed scores
,
representing deleterious and protective components for respectively rare variants based on common variants.

Step 3: Construct the final collapsed score. Repeat Step2 considering rare variants only without the bases from common variants. Thus, our test can be robust when common SNPs are not associated with the trait. It will generate another two components,

and

. The final collapsed score is derived as follow.

where

The test statistic (wSC) can be derived from a logistic regression model between the trait and the collapsed score as before. P-values can be computed by permutation.

S

_{wSC} is constructed by comparing the potential effect of components in different directions. As an alternative, we also propose a method (wSCd) to detect the genetic effects and it is robust when the effects are in different directions. To find wSCd, we will follow all the same steps described before in deriving wSC, but the final collapsed score is

where