Properties of permutationbased gene tests and controlling type 1 error using a summary statistic based gene test
 David M Swanson^{1}Email author,
 Deborah Blacker^{2},
 Taofik AlChawa^{3},
 Kerstin U Ludwig^{3, 4},
 Elisabeth Mangold^{3} and
 Christoph Lange^{1, 5}
DOI: 10.1186/1471215614108
© Swanson et al.; licensee BioMed Central Ltd. 2013
Received: 11 June 2013
Accepted: 18 October 2013
Published: 7 November 2013
Abstract
Background
The advent of genomewide association studies has led to many novel diseaseSNP associations, opening the door to focused study on their biological underpinnings. Because of the importance of analyzing these associations, numerous statistical methods have been devoted to them. However, fewer methods have attempted to associate entire genes or genomic regions with outcomes, which is potentially more useful knowledge from a biological perspective and those methods currently implemented are often permutationbased.
Results
One property of some permutationbased tests is that their power varies as a function of whether significant markers are in regions of linkage disequilibrium (LD) or not, which we show from a theoretical perspective. We therefore develop two methods for quantifying the degree of association between a genomic region and outcome, both of whose power does not vary as a function of LD structure. One method uses dimension reduction to “filter” redundant information when significant LD exists in the region, while the other, called the summarystatistic test, controls for LD by scaling marker Zstatistics using knowledge of the correlation matrix of markers. An advantage of this latter test is that it does not require the original data, but only their Zstatistics from univariate regressions and an estimate of the correlation structure of markers, and we show how to modify the test to protect the type 1 error rate when the correlation structure of markers is misspecified. We apply these methods to sequence data of oral cleft and compare our results to previously proposed gene tests, in particular permutationbased ones. We evaluate the versatility of the modification of the summarystatistic test since the specification of correlation structure between markers can be inaccurate.
Conclusion
We find a significant association in the sequence data between the 8q24 region and oral cleft using our dimension reduction approach and a borderline significant association using the summarystatistic based approach. We also implement the summarystatistic test using Zstatistics from an alreadypublished GWAS of Chronic Obstructive Pulmonary Disorder (COPD) and correlation structure obtained from HapMap. We experiment with the modification of this test because the correlation structure is assumed imperfectly known.
Keywords
Dimension reduction Eigenvector Genebased testing Permutation testsBackground
The focus in genetic association studies has been on uncovering loci that are risk factors for an outcome, be it binary or continuous, or markers in linkage disequilibrium (LD) with those causal loci. Increasingly, however, genebased tests are coming to the forefront, especially as sequencing technologies mature and grow cheaper [1–3]. Genebased tests are useful to provide insight into whether a region of the genome has a significant association with some outcome and for intergene significance comparisons, despite differences in the size of genes [4, 5]. Development of such tests is difficult, however, as markers are usually correlated with one another and have highly variable minor allele frequencies [6]. As a result, tests have often been born more out of practicality or computational ease. Some genebased tests take the smallest pvalue over all the markers in the region. Others, such as that implemented in PLINK, take a more sophisticated approach, converting pvalues of markers to ${\chi}_{1}^{2}$ test statistics, averaging those tests statistics, then comparing it to a null distribution generated from permutations of the outcome under the null [1, 7]. Liu et al. (2010) use a similar, though more efficient method, in which they again convert marker pvalues to ${\chi}_{1}^{2}$ test statistics, take the sum of those test statistics, then generate a null distribution by sampling from sums of correlated ${\chi}_{1}^{2}$ random variables. Both approaches are intuitive and valuable ways to assess gene significance, though in both cases the power for detection of a gene becomes not only a function of the effect size of the individual markers, but the degree to which markers are in LD with one another. For example, assuming only one marker in the region has a truly nonzero effect size, power for detecting that effect will be higher if the marker is in high LD with other markers than if it is independent of them. Moskvina et al. (2012) independently made this observation, having noticed that the significance of regions they tested changed according to how much they pruned markers in high LD with one another [8].
One way to think about why this phenomenon occurs is that, rather than transforming the test statistic so that markers highly correlated with one another “mean less” because they do not contribute independent information, they transform the null distribution for certain markers under the null to “mean more.” As a result, the type 1 error is maintained, but power varies as a function of the correlation between the marker and surrounding markers. Intuitively, this is not a desirable property because it will lead to a systematic underdetection of those loci that happen to be independent of proximal markers even though they are inherently no less important in predicting the outcome. This issue becomes particularly problematic for sequence data since there would generally be even more correlation. However, the issue is a zerosum tradeoff; what results in less power for detection of single nucleotide polymorphisms (SNPs) in low LD translates to more power for detection of SNPs in high LD. Though, if there is an underlying common function or characteristic of those genomic regions whose significant SNPs are not in high LD, perhaps due to when they first occurred in evolutionary history, such regions will likely be missed in association analyses so that potentially key regions will not be studied in greater depth. As a result of this shortcoming, which may be more or less important depending on the specific LD structure of the genomic region under study, we propose two methods, one of which transforms summary Zstatistics from univariate regressions of markers so that it follows a standard parametric distribution under the null hypothesis and power does not vary with the LD structure, and the second of which uses an Eigen decomposition of the information matrix to find the “effective” amount of information in the region and increases power by performing a more parsimonious test. If the information matrix is evaluated under the null, this latter test is essentially a dimensionreduced score test analogue to a method described in [2, 3], which finds the principal components of the data matrix. Specifically, for the first approach we propose, we find Zstatistics associated with each marker in our region and the correlation matrix of the markers and perform a χ^{2} test, an approach similar to that proposed by Yang et al. (2012) [9]. In case the correlation matrix is imperfectly known, we propose a modification of this test that adjusts the correlation structure to protect the type 1 error. In the latter approach, we calculate the eigenvectors associated with the information matrix to obtain a most powerful linear combination of the scores, on which we again perform a χ^{2} test after having normalized by the variance of the loadings. These two approaches are proposed for different situations: the dimension reduction approach for when there is correlation between markers and the analyst has access to the original data, and the summarystatistic test for when the analyst only has access to Zstatistics from univariate regressions of the outcome on the marker along with an estimate of the correlation matrix of the markers in the population from which they come. Moskvina et al. (2012) also propose solutions, one of which is based on Hotelling’s T^{2} test, while another is based on multivariate logistic regression, though concludes that both perform similarly. We compare these various approaches under different structures of LD and effect size. We apply our methods to a casecontrol sequence data set of oral cleft and an alreadypublished GWAS study of Chronic Obstructive Pulmonary Disorder (COPD) [10].
Methods
Description of permutationbased gene tests
First we show how power differs for permutationbased gene tests as a function of linkage disequilibrium from a theoretical perspective. When we refer to permutationbased gene tests, we mean genebased tests in which the sum of the χ^{2} statistics for markers is taken and then an empirical pvalue is calculated by permuting casecontrol status to generate a null distribution. By Imhof (1961), in connection with Liu et al. (2010), we know that the null distribution of the permutationbased test is $\sum _{i=1}^{q}{\lambda}_{i}{\chi}_{1}^{2}$, where ${\chi}_{1}^{2}$ is a chisquared random variable on 1 d.f., λ_{ i } is the i^{ t h } eigenvalue of Σ, the q×q correlation matrix of the SNPs comprising the gene to be tested. Under the alternative, the distribution is approximately $\sum _{i=1}^{q}{\lambda}_{i}{\chi}_{1}^{2}\left({\delta}_{i}^{2}\right)$, where the noncentrality parameter δ_{ i } is calculated ${\delta}_{i}={v}_{i}^{t}\xb7\mu /\sqrt{{\lambda}_{i}}$, where v_{ i } is the eigenvector of Σ corresponding to λ_{ i }, and μ is the qdimensional mean of the multivariate normal distribution of Zstatistics calculated for univariate regressions of each SNP. μ is a function of the power for detection of each SNP in the gene. The distribution under the alternative is approximately $\sum _{i=1}^{q}{\lambda}_{i}{\chi}_{1}^{2}\left({\delta}_{i}^{2}\right)$, rather than exactly, because the correlation matrix of marker Zstatistics coming from univariate regressions diverges from the correlation structure of the covariates when under the alternative. However, so long as there is not significant variation between observations in the true probability of being a case, this divergence will not be relevant. Since the relative risk of disease conferred by most minor alleles is small, it is likely that the approximation is valid in most studies.
As described in the Additional file 1 section of the manuscript, the authors have posted an R script online that allows one to see power variation for permutationbased gene tests as a function of the correlation structure, power to detect the causal marker (in a univariate regression), and its location relative to the other SNPs comprising the gene.
Description of summary statistic based test
We first describe a simple solution to the problem of how LD structure can affect the power to detect genomic regions in which there are significant SNPs. Our solution is based on the Zstatistics associated with each marker and the correlation matrix of the SNPs. Since we propose this test as one that can be used without a full data set, we propose a modification of it in case the true correlation structure is not perfectly known or it is believed that study participants are not reflective of the population from which the correlation of SNPs are calculated (such as with HapMap reference panels).
One solution to the problem of underdetection of SNPs in low LD posed by permutation tests is transformation of the genebased test statistic so that under the null it follows a standard parametric distribution, rather than creating a nonstandard null distribution through permutations. One way to accomplish this task, and one in which it is unnecessary to reanalyze data, is to perform a joint test on the Zstatistics coming from a univariate regression model for each marker. It is a an approach similar to that described by Yang et al. (2012), though uses summary statistics directly rather than estimated model coefficients. Since the estimated covariance structure of these statistics under the null is the correlation of the markers themselves [14], one can use the data to estimate the covariation of the Zstatistics or an online database of LD or correlation structure of SNPs. The intuition behind this result is that if two markers are highly correlated, then when by chance under the null, one marker is significant (or insignificant), the other marker will similarly be significant (or insignificant). However, if two markers are not correlated, then the chance significance or insignificance of one marker will not inform the significance of the other marker. And since Zstatistics have variance 1 by definition, their covariance matrix is identical to their correlation matrix. Thus, supposing we have q markers, which, from previous studies, are known to have Zstatistics of Z=(Z_{1},…,Z_{ q })^{ T }, and which have correlation structure V, then under the null hypothesis of no marker being associated with the outcome, $T\equiv {\mathbf{\text{Z}}}^{T}{V}^{1}\mathbf{\text{Z}}\sim {X}_{q}^{2}$. One then rejects the null of no association between the region composing the q markers and the outcome for an extreme value of the test statistic T using a predetermined α level.
If one is not confident that V accurately reflects the correlation of the SNPs in the data matrix and therefore Z under the null hypothesis (because V is an estimate possibly coming from a reference panel if the original data is not available and the analysis is performed only with access to summary Zstatistics), it is possible to construct a more conservative test by shrinking the offdiagonal elements of V towards 0. Thus, if V is an estimate of the covariance of the SNPs, one can compute ${V}_{\gamma}^{\ast}\equiv \mathrm{\gamma V}+(1\gamma ){\mathbf{\text{I}}}_{q}$, where I_{ q } is a q×q identity matrix and 0≤γ≤1.
Again using [15], if Z∼M V N(0,V) but we use ${V}_{\gamma}^{\ast}$ as an estimate of the correlation structure in the gene based test, then ${Z}^{T}{V}_{\gamma}^{\ast 1}Z\sim \sum _{i=1}^{q}{\lambda}_{i}\xb7{\chi}_{1}^{2}$ where q is the dimension of vector Z, and where λ_{ i } is the i^{ t h } eigenvalue of $\mathit{V}{V}_{\gamma}^{\ast 1}$. By construction of ${V}_{\gamma}^{\ast 1}$, $\sum _{i=1}^{q}{\lambda}_{i}<dim\left(Z\right)$ for 0<γ<1, where dim(·) the dimension of the vector argument. This fact in itself does not not necessarily imply a more conservative test for all size α tests because when eigenvalues are not equal to one another as is the case for the decomposition of $\mathit{V}{V}_{\gamma}^{\ast 1}$ with $\mathit{V}\ne {V}_{\gamma}^{\ast 1}$, $\sum _{i=1}^{k}{\lambda}_{i}<dim\left(Z\right)$ can be true, but ${Z}^{T}{V}_{\gamma}^{\ast 1}Z$ is not stochastically less than ${\chi}_{dim\left(Z\right)}^{2}$, the null distribution of the test statistic when the correlation structure is correctly known. However, for modest values of γ (i.e., 0.81.0, where 1.0 corresponds to no transformation), the test using the adjusted correlation matrix will generally be more conservative. It is difficult to obtain simple solutions for how much conservative a test will be using this modification since it will depend on the quantile corresponding to the intended type 1 error and the specific $\mathit{V}{V}_{\gamma}^{\ast 1}$.
Thus, to give a practical sense of useful values of γ, we borrowed a correlation structure of eight SNPs in the INSIGF2 gene of Chromosome 11 from the CEU reference panel in one case and the CHB+JPT reference panel in the other case [13]. If the true, underlying population giving rise to the SNPs was more reflective of the CEU reference panel, but the analyst incorrectly guessed the correlation structure to be that of the CHB+JPT reference panel when performing the summary statistic genebased test,the type 1 error rate for a nominal 0.05 size test would in fact be a highly inflated 0.61. Similarly, if the true, underlying population giving rise to the SNPs was more reflective of the CHB+JPT reference panel, but the analyst incorrectly guessed the correlation structure to be that of the CEU reference panel for the summary statistic genebased test, the type 1 error rate for a nominal 0.05 size test would be 0.69. If the type 1 error is inflated in one scenario, there is no implication that it will be deflated in the ’inverse’ scenario.
In the scenario where the underlying population was more reflective of the CEU panel, the type 1 error using our modified summary statistic test with adjusted correlation matrix and γ’s of 0.9, 0.5, and 0.3 led to reduced error rates of 0.36, 0.11, and 0.09, respectively, instead of 0.61. When the underlying population was CHB+JPT, the type 1 error using our adjustment correlation matrix and γ’s of 0.9, 0.5, and 0.3 led to reduced error rates of 0.45, 0.10, and 0.07, respectively, instead of 0.69.
Type 1 error rates were calculated by sampling from an eightdimensional multivariate normal distribution with mean vector 0 (i.e., under the null) and unit variance for all elements, but whose correlation structure corresponded to the “true” correlation structure of the population in any one of these two situations described, be it that of the CEU or CHB+JPT panel. We then multiplied each sample, Z_{ 1 },…Z_{ n }, with n=2000, by the estimated correlation matrix, V_{ γ }, by taking ${Z}_{i}^{T}{V}_{\gamma}^{1}{Z}_{i}$, where V was taken from the panel unreflective of the truth (e.g., the CEU correlation structure was used for an underlying, true correlation structure corresponding to the CHB+JPT panel and vice versa) and possibly modified with the γ parameter. To find the error rate, we calculated the probability mass beyond the 0.95 quantile of ${\chi}_{8}^{2}$, the distribution ${Z}_{i}^{T}{V}_{\gamma}^{1}{Z}_{i}$ would follow if V a r(Z_{ i })=V_{ γ } (i.e., assuming the true correlation of Z_{ i } was known).
While in all cases, the nominal size of the test is not quite achieved, type 1 error is greatly reduced, and in some cases will be achieved when divergence between correlation structures of the true and hypothesized populations are not as great as that in these scenarios. The greatest reduction in type 1 error occurs with initial deviation of γ from 1; i.e., a movement of γ from 1 (indicating an unadjusted correlation matrix) to 0.9 will reduce type 1 error more than a movement of 0.6 to 0.5. And, as mentioned, with very small values of γ, there is not necessarily a guarantee of continued reduction in type 1 error for some nominal α level tests, nor should such γ values be used if indicative of no confidence in one’s estimated correlation matrix.
To simulate a less drastic divergence between true and estimated correlation matrices and assess error rates and the proposed adjustment method in that context, we generated correlation matrices whose entries were betadistributed random variables with means corresponding to the entries in the CHB+JPT reference panel and standard deviation approximately 0.030.04 (approximately because standard deviation is partly a function of the mean). With a population whose underlying correlation structure was in truth reflective of the CHB+JPT panel, but using the generated correlation matrices in our calculations of the test statistic, the average type 1 error rate was 0.19. Adjusting the generated correlation matrices according to our method and with a γ value of 0.95, the error rate was reduced to 0.05. Adjustment of the generated correlation matrices with a γ value of 0.90 led to a type 1 error rate of 0.03.
The summary statistic based test we have proposed is a viable way of performing genebased testing when one does not want power to vary as a function of the correlation structure of the SNPs composing the gene. A weakness of such an approach is an inability to know the underlying correlation structure of the SNPs used in the univariate regression analyses giving rise to the Zstatistics used in the summary statistic test. We have shown that incorrect guesses of the underlying correlation structure can lead to a significant increase in the type 1 error rate and therefore have proposed an adjustment method which can lead to achievement of error rates in line with the nominal size of the test. However, since by supposition of this setting the correlation structure of SNPs is never known, it is impossible to know the needed value of γ. As a result, it may be best to perform one’s summary statistic based test with γ values ranging from 0.81.0 as a sensitivity analysis to see how one’s conclusions change based on different values. Values of γ smaller than 0.8 probably reflect little confidence in the estimated correlation structure, in which case feasibility of the analysis in the first place should be reassessed.
Description of Eigen decompositionbased test
The above approach controls for the LD structure of the region under study by transforming the test statistic so that LD no longer affects the power to detect significant regions. However, there are other ways one can make use of the LD structure to construct more powerful tests, such as by dimension reduction. Consider an extreme example where an investigator is interested in a region with d SNPs, and these SNPs are in nearly perfect LD so that a correlation matrix of them has offdiagonal entries close to 1. Because they are highly correlated, the association between any SNP and the outcome adds little information on top of that between any other SNP and the outcome. As a result, intuition may tell us that using a dd.f. test on the region after having properly accounted for the underlying LD structure is not the most powerful approach since there is essentially the information of 1 SNP contained in the entire region. On the other hand, it is difficult to justify focusing on any one SNP over another as one might do when “tagging” the region. Also, while no additional SNP contributes much information over another, there is still some amount of additional information contained in each one that, ideally, would not be ignored.
Finding the eigenvectors and values of the information matrix is one way to approach this scenario. Unlike the previously proposed summary statistic test, this analysis approach requires the original SNP data. It gleans the essential information from the LD block, thus stripping away extraneous information that dilutes the power of proposed tests while avoiding the arbitrariness of pruning the number of SNPs being examined. It is an approach similar to finding the principal components of the data matrix and then regressing the outcome on those components if the information matrix is evaluated under the null [2, 3] and may even be thought of as a score test analogue to it. If certain covariates have been shown to control for population stratification, it also may be fitting for the matrix to be evaluated under the alternative using the estimated effect sizes of those covariates. Also, as simulations demonstrate, there may be power gains under certain correlation structures or when effect or sample sizes are small. Since the information matrix is the covariation of the scores associated with each marker and since score functions of highly correlated markers are correlated as well, identifying the chief axes of the covarying scores is synonymous with finding the eigenvectors of the information matrix. One can then detect small deviations from the mean under the null hypothesis, the 0 vector, by performing a parsimonious test. Additionally, if we are considering the underlying model to be that of logistic regression, both the information matrix and the score have simple forms and are computationally tractable.
where, as is consistent from our definition of I_{j,l}, I_{l,j}=I_{j,l}. Also, I is (q+1)×(q+1) because there are q markers and 1 intercept available for use in the model.
Now define I_{ q } to be the q×q information matrix for the covariates, not including the intercept. That is, I_{ q } is I_{ q+1 } without the first column and first row of I_{ q+1 }. I_{ q } can be decomposed into E W E^{ 1 }, where E≡(e_{ 1 }, e_{ 2 }, ⋯,e_{ q }) is a matrix of the q eigenvectors e_{ i }, 1≤i≤q, and W≡diag(λ_{1}, λ_{2} ⋯,λ_{ q }) is a diagonal matrix of the corresponding eigenvalues λ_{ i }, 1≤i≤q. While one can use a variable number of eigenvectors in the analysis, if we suppose that we are in the situation described above where all d markers are highly correlated, then making use of just the first component may be sufficient to adequately encompass the information contained in the genomic region. More generally, a systematic criterion for deciding which eigenvectors to use is employing all those whose associated eigenvalues are larger than the average eigenvalue.
For the sake of explanation, we suppose first that we will construct the test using only the eigenvector associated withe largest eigenvalue and then generalize later. Denote e_{ 1 } the first column of E and vector associated with the largest eigenvalue (assume the columns of E are ordered according to decreasing eigenvalue). The interpretation of e_{ 1 } is the axis of maximum variation of the distribution whose covariance matrix is I_{ q }, and λ_{1}, the associated eigenvalue, can be interpreted as the variation along that axis. Since I_{ q } is q×q, e_{ 1 } is a (q×1) unit eigenvector. Define a new 2×2 information matrix I^{ ∗ } as I^{ ∗ }_{(1,1)}≡I_{q+1 (1,1)}, I^{ ∗ }_{(2,2)}≡λ_{1}, and I^{ ∗ }_{2,1}=I^{ ∗ }_{1,2}=e_{ 1 }^{ T }·I_{ q (,1) }, where I_{ q (,1) } is the first column of I_{ q }, v^{ T } denotes the transpose of vector v, and e_{ 1 }^{ T }·I_{ q (,1) } denotes the dot product of vectors e_{ 1 } and I_{ q (,1) }. The test statistic for the 1 d.f. score test analogue of the method described in [2, 3] is then (S^{ t }·e_{ 1 })^{2}·[(I^{∗ 1})_{(2,2)}], where S is the qdimensional vector of scores associated with the q markers, which follows a ${\chi}_{1}^{2}$ distribution under the null hypothesis of no geneoutcome association.
Define ${{I}^{\ast \ast}}_{p\times p}^{1}$ as the lowerright p×p submatrix of I^{ ∗∗ }^{1}. The test statistic is ${\mathit{S}}^{T}\xb7({e}_{1}\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}}{e}_{p})\xb7{{I}^{\ast \ast}}_{p\times p}^{1}\xb7{({e}_{1}\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}}{e}_{p})}^{T}\xb7\mathit{S}$, which follows a ${\chi}_{p}^{2}$ distribution under the null hypothesis of no geneoutcome association, where again S is a vector of scores of length q.
Analogous to the test statistic defined in the previous test, define ${{I}^{\ast \ast \ast}}_{{p}^{\prime}\times {p}^{\prime}}^{1}$ as the lowerright (p^{′}×p^{′}) submatrix of I^{ ∗∗∗ }^{1} and S^{ ′ } as the vector of scores associated with the (q1) markers whose dimension we reduce. The test statistic is ${{S}^{\prime}}^{T}\xb7({e}_{1}^{\prime}\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}}{e}_{{p}^{\prime}}^{\prime})\xb7{{I}^{\ast \ast \ast}}_{{p}^{\prime}\times {p}^{\prime}}^{1}\xb7{({e}_{1}^{\prime}\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}}{e}_{{p}^{\prime}}^{\prime})}^{T}\xb7{S}^{\prime}$, which follows a ${\chi}_{{p}^{\prime}}^{2}$ distribution under the null hypothesis of no geneoutcome association.
While there was extended discussion of the type 1 error rate for the previous, summary statistic, test, and how it varies as a function of the shrinkage parameter, γ, since V was considered possibly misspecified, there is no such necessary discussion of type 1 error for the Eigen decomposition test; since we assume perfectly known data for the Eigen decomposition test, it is a relatively standard, parametric, hypothesis test so that asymptotic results hold and type 1 error rates correspond to the nominal size of the test.
Results and discussion
Simulation results for summary statistic based test and comparison with Hotelling’s T^{ 2 }
Simulations were generated under the same framework as we used with the permutation test simulation above. Covariates were generated with a minor allele frequency of approximately 0.3, and, within any LD block, correlation between SNPs was again approximately 0.65, whereas SNPs not in the LD block were independent of one another. We assumed HardyWeinberg equilibrium. The gene consisted of 20 SNPs, and there were 600 subjects with an equal number of cases and controls. Power calculations were based on 1000 iterations at each effect size (e.g., Figure 5) or LD block size (e.g., Figure 6). Lastly, binary outcomes were generated assuming a logistic regression model, where presence of the causal SNP determined the probability of being a case or control.
Simulation results for Eigen decompositionbased test
The simulation framework for the Eigen decomposition test differed slightly from that of the permutationbased test and the summary statistic test. The gene consisted of 15 SNPs, and there were 800 subjects with an equal number of cases and controls. Power calculations were based on 2000 iterations at each of effect size (Figures 7 and 8). We calculated power at 11 different effect sizes, with the effect size ranging from a log OR of 0 to 1.0. As described in the previous paragraph, correlation scenarios varied between independence of SNPs (Figure 7) to mutual correlation of SNPs (Figure 8). Minor allele frequency again fell in the range of “common” variants at 0.3 with HardyWeinberg equilibrium assumed. Binary outcomes were generated assuming a logistic regression model, and presence of the causal SNP and the effect size determined probability of being a case or control.
Data analysis of oral cleft sequence data and COPD summary statistics
We analyze a sequence data set composed of 192 cases of cleft lip and 192 controls, on whom we have data for 14 SNPs. The data come from a GWAS in which a candidate gene was identified and then sequenced [16]. We prune the data set so that any observations with missing values or deletions are excluded, giving 172 cases and 176 controls. We also prune SNPs so that any SNP with a MAF less than 0.02 among either cases or controls is excluded, leaving 8 SNPs. We calculate the correlation matrix of SNPs by pooling cases and controls. Using the summary statistic based test, we find that the region composed of the 8 SNPs is associated with cleft lip (p=0.06). Using the Eigen decomposition based test with the two eigenvectors whose associated eigenvalue is bigger than the average eigenvalue, we calculate a pvalue of 0.017; using 3 eigenvectors such that more than 80% of the variation in scores is explained, we calculate a pvalue of 0.016. Thus, as is consistent with the potential power gains posed by dimension reduction, this latter test shows a stronger association between the region of 8 SNPs and cleft lip. For comparison, we also calculated a permutation test pvalue, giving 0.008 (5000 permutations), and a Hotelling’s T^{2} pvalue, giving 0.056 (nonparametric, permutationbased pvalue for this test gives 0.057). Assuming little correlation among SNPs, one would expect the permutation test pvalue to give a pvalue similar to that of the summary statistic based test. The greater significance of the permutation test pvalue suggests that a significant SNP is in LD with other SNPs and examination of the data matrix confirms this idea; a SNP whose pvalue is 0.018 using a univariate logistic regression model is highly correlated with one SNP (r=0.71) and moderately correlated with another SNP (r=0.43). Since only 8 SNPs are being analyzed, these two SNPs in LD with the significant SNP may be driving the significance of the permutation test.
We also apply the summarystatistic based test to results borrowed from an alreadypublished GWAS along with information on the correlation of markers taken from HapMap [10, 13]. Pillai et al. (2009) identified 5 SNPs in the CDKAL1 gene on Chromosome 6 to be associated with Chronic Obstructive Pulmonary Disorder (COPD). We run our summary statistic based test on their results. Since the results come from a study of Norwegians, we use the (CEU) reference panel from HapMap as an estimate of the correlation structure of SNPs. The underlying population is unlikely to be identical, however, and so we adjust the correlation matrix, shrinking the offdiagonal elements toward 0 as described in the modification of the summary statistic based test to preserve the type 1 error rate. We do so with γ values of 1 (i.e., assuming the correlation structure is correct), 0.9, and 0.8, and corresponding pvalues for the 5 d.f. test are 0.0066, 0.003, and 0.001. While the summary statistics we use are borrowed from the 100 most significant SNPs of their analysis [10], the high level of significance for tests corresponding to all γ values and nonarbitrary choice all SNPs in the chosen gene suggest that there is likely some association between the CDKAL1 gene and COPD. Since the test statistics are themselves random variables, specific realizations of them are not necessarily associated with increasing pvalues as one would anticipate with decreases in γ. However, work above has shown that, in general, modest decreases in γ should help preserve type 1 error.
Conclusion
With the availability of sequence data and GWAS, the importance of statistical analysis is shifting from singlelocus tests to multiloci tests that can cover genomic regions, e.g., genes or even pathways. The motivation for this development is to test a hypothesis more grounded in biology and, at the same time, to reduce the multiple testing problem and allow for many SNPs with a small effect size to increase the power of the test by their combined inclusion in the model. One of the theoretical issues that has so far not been addressed adequately is the impact of LD on the power of the test statistic in permutationbased gene tests. Controlling for the LD between loci is important to assess the relative importance of the different regions that are tested, especially when LD heterogeneity between regions is significant. In this paper, we have proposed 2 approaches that address this issue.
While our summary statistic based test may give one similar results to a Hotelling’s T^{2} based test, the summary statistic test does not require the original marker data from which Zstatistics are calculated. This unique advantage opens up the possibility for more indepth analysis of previously published studies, and, with sufficient methodological development, could even suggest summary statistic based pathway analyses when combined with summary statistics from expression analyses. It also opens up the possibility of crossstudy genebased tests, where Zstatistics from the same markers are combined across previously published GWAS to reap power gains. A shortcoming of our summary statistic based test is that if the estimated correlation structure used in the test is not reflective of the underlying population, the test may suffer from inflated type 1 error. We therefore proposed an modification of the test by adjusting the estimated correlation matrix, which, in general, should help control the error rate. If there is insufficient justification for why the estimated correlation matrix is representative of the underlying population, the test should not be used even with correlation matrix adjustment. If one has the original SNP data, one can perform the dimensionreduced score test proposed in this paper, reducing the dimension to the number of eigenvectors that explains some predetermined proportion of variation in the data.
Both of the proposed genebased tests in this paper fail to describe the direction of association between the gene and outcome, instead describing only significance of association. Direction of association is a difficult concept to interpret when a gene is composed of multiple SNPs, with some alleles protective and others a risk factor for the outcome. One goal in genebased testing might be to gain an understanding of such a concept. Additionally and with regard to dimension reduction approaches, if alleles in a dimensionreduced block of SNPs are both protective and harmful, there could be a loss of power using a dimensionreduction genebased test. A test that used a priori analyses to decide whether alleles are protective or harmful and, in turn, used that information to inform the dimension reduction process might be another valuable area of research in genebased testing.
Abbreviations
 OR:

Odds Ratio
 COPD:

Chronic Obstructive Pulmonary Disorder
 LD:

Linkage Disequilibrium
 SNP:

Single Nucleotide Polymorphism.
Declarations
Acknowledgements
This work was supported by the National Institutes of Health [Training Grant T32 NS048005].
Authors’ Affiliations
References
 Liu J, Mcrae A, Nyholt D, Medland S, Wray N, Brown K, Hayward N, Montgomery G, Visscher P, Martin N, et al: A versatile genebased test for genomewide association studies. Am J Hum Genet. 2010, 87: 139145. 10.1016/j.ajhg.2010.06.009.PubMed CentralView ArticlePubMedGoogle Scholar
 Gauderman W, Murcray C, Gilliland F, Conti D: Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007, 31 (5): 383395. 10.1002/gepi.20219. http://dx.doi.org/10.1002/gepi.20219,View ArticlePubMedGoogle Scholar
 Wang K, Abbott D: A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008, 32 (2): 108118. 10.1002/gepi.20266. http://dx.doi.org/10.1002/gepi.20266,View ArticlePubMedGoogle Scholar
 Li M, Gui H, Kwan J, Sham P: GATES: a rapid and powerful genebased association test using extended Simes procedure. Am J Hum Genet. 2011, 88 (3): 283293. 10.1016/j.ajhg.2011.01.019. http://dx.doi.org/10.1016/j.ajhg.2011.01.019,PubMed CentralView ArticlePubMedGoogle Scholar
 Li M, Wang K, Grant S, Hakonarson H, Li C: ATOM: a powerful genebased association test by combining optimally weighted markers. Bioinformatics (Oxford, England). 2009, 25 (4): 497503. 10.1093/bioinformatics/btn641. http://dx.doi.org/10.1093/bioinformatics/btn641,View ArticleGoogle Scholar
 Wang T, Elston R: Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet. 2007, 80 (2): 353360. 10.1086/511312. http://dx.doi.org/10.1086/511312,PubMed CentralView ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, De Bakker P, Daly M, et al: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81 (3): 559575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
 Moskvina V, Schmidt K, Vedernikov A, Owen M, Craddock N, Holmans P, O’Donovan M: Permutationbased approaches do not adequately allow for linkage disequilibrium in genewide multilocus association analysis. Eur J Hum Genet. 2012, 20,Google Scholar
 Yang J, Ferreira T, Morris A, Medland S, Replication DG, Madden P, Heath A, Martin N, Montgomery G, Weedon M, Loos R, Frayling T, Mark M, Hirschhorn J, Goddard M, Visscher P, of ANthropometric Traits (GIANT) Consortium GI: Conditional and joint multipleSNP, analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature Genet. 2012, 44 (4): 36975 S1. 10.1038/ng.2213. http://dx.doi.org/10.1038/ng.2213,PubMed CentralView ArticlePubMedGoogle Scholar
 Pillai SG, Ge D, Zhu G, Kong X, Shianna KV, Need AC, Feng S, Hersh CP, Bakke P, Gulsvik A, et al: A genomewide association study in chronic obstructive pulmonary disease (COPD): identification of two major susceptibility loci. PLoS Genet. 2009, 5 (3): e100042110.1371/journal.pgen.1000421.PubMed CentralView ArticlePubMedGoogle Scholar
 Stefanski L, Carroll R: Score tests in generalized linear measurement error models. J R Stat Soc. Series B Methodological. 1990, 345359.Google Scholar
 Lagakos S: Effects of mismodelling and mismeasuring explanatory variables on tests of their association with a response variable. Stat Med. 1988, 7 (12): 257274. 10.1002/sim.4780070126. http://dx.doi.org/10.1002/sim.4780070126,View ArticlePubMedGoogle Scholar
 The International HapMap Consortium: The international HapMap project. Nature. 2003, 426: 789796. 10.1038/nature02168.View ArticleGoogle Scholar
 Conneely KN, Boehnke M: So many correlated tests, so little time! rapid adjustment of Pvalues for multiple correlated tests. Am J Hum Genet. 2007, 81 (6): 11581168. 10.1086/522036.PubMed CentralView ArticlePubMedGoogle Scholar
 Imhof J: Computing the distribution of quadratic forms in normal variables. Biometrika. 1961, 48 (3/4): 419426. 10.2307/2332763.View ArticleGoogle Scholar
 Mangold E, Ludwig KU, Birnbaum S, Baluardo C, Ferrian M, Herms S, Reutter H, de Assis NA, Al Chawa T, Mattheisen M, et al: Genomewide association study identifies two susceptibility loci for nonsyndromic cleft lip with or without cleft palate. Nat Genet. 2009, 42: 2426.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.