Improving power in genetic-association studies via wavelet transformation
© Jiang et al. 2009
Received: 29 April 2009
Accepted: 11 September 2009
Published: 11 September 2009
Skip to main content
© Jiang et al. 2009
Received: 29 April 2009
Accepted: 11 September 2009
Published: 11 September 2009
A key to increasing the power of multilocus association tests is to reduce the number of degrees of freedom by suppressing noise from data. One of the difficulties is to decide how much noise to suppress. An often overlooked problem is that commonly used association tests based on genotype data cannot utilize the genetic information contained in spatial ordering of SNPs (see proof in the Appendix), which may prevent them from achieving higher power.
We develop a score test based on wavelet transform with empirical Bayesian thresholding. Extensive simulation studies are carried out under various LD structures as well as using HapMap data from many different chromosomes for both qualitative and quantitative traits. Simulation results show that the proposed test automatically adjusts the level of noise suppression according to LD structures, and it is able to consistently achieve higher or similar powers than many commonly used association tests including the principle component regression method (PCReg).
The wavelet-based score test automatically suppresses the right amount of noise and uses the information contained in spatial ordering of SNPs to achieve higher power.
In a genome-wide association study (GWAS), if a SNP has a strong LD with a disease locus, single marker methods should have more power than multiple marker methods. However, if several SNPs have moderate associations with disease genes, multiple marker methods (such as Hotelling's T 2 test [1–4] or multiple logistic regression) can provide higher power . One of the problems of multiple marker methods is their large number of degrees of freedom, which in turn may lead to low power. Therefore, reducing the number of degrees of freedom becomes a key issue in gaining power for a multilocus method. For example, in haplotype association studies, tests based on haplotype sharing [6, 7] have fewer degrees of freedom and higher power than tests based on haplotype frequencies. Another common approach to reducing the number of degrees of freedom is PCReg [8, 9]. Projections of genotype data onto a few principal component directions of the variance-covariance matrix can often capture a majority of genotypic variances, and have fewer degrees of freedom. Tests based on projected genotype data can potentially have higher power. A weighted score test based on the Fourier transform  was also introduced to reduce the number of degrees of freedom. The basic idea behind this test is to transform genotype data into Fourier coefficients, and form the test based on those Fourier coefficients. Noise reduction is done by putting high weights on low frequency components, and low weights on high frequency components. The rationale behind this approach is a belief that assuming the signal function belongs to a certain smoothness class, high frequency components are mostly noise and therefore should be suppressed. This method works well if genotypic variation is across all SNPs. Under the common disease and rare variant hypothesis, a collapsing method and a combined multivariate and collapsing method were proposed , in which the genotypes of rare variants are collapsed to reduce the number of degrees of freedom and to increase power.
Reducing the number of degrees of freedom in genetic studies is similar to suppressing noisy data; the difficulty is to know how much of the data should be suppressed. A good test should adapt to LD structures, which means it can automatically decide the amount of noise to be suppressed. In a window of SNPs, if most of the SNPs are not associated with the disease, then most genotypic variations are noise, which should all be suppressed. On the other hand, if most of the SNPs are in fact associated with the disease, then these genotypic variations are true signals and therefore should be kept. An ideal noise suppression (de-noising) process should automatically choose an optimal suppression level to maximize the signal to noise ratio. We propose a novel test, which is able to automatically choose an optimal suppression level.
Many tests based on genotype data, for example, Hotelling's T 2 test [1–4], logistic regression test and PCReg , do not take the spatial order of the SNPs on a chromosome into consideration (see proof in the Appendix). As a result, interchanging the relative positions of two SNPs does not have an effect on the tests (see proof in the Appendix). The test we proposed in this paper takes the order of the SNPs on a chromosome into consideration, and it gains power by doing so. Tests based on haplotype data also consider the order of SNPs on a chromosome. For example, the ordering of SNPs is essential for tests based on haplotype similarity (the longest continuous interval of matching alleles between haplotypes). If the relative order of SNPs is changed, the shared length between two haplotypes will be reduced and the disease association cannot be detected. If we assume affected individuals share some ancestral haplotypes, the relative positions of SNPs on a chromosome contain genetic information associated with a disease, and ignoring this information may reduce the power of genetic association tests. The proposed test is a compromise between these genotype-based and haplotype-based tests in the sense that the ordering of SNPs is considered but ambiguous haplotypes do not need to be inferred.
Wavelet transformation is a method for decomposing data into different frequency components. It is an effective noise suppressing (de-noising) method, and it is designed to deal with choppy signals. Because of its adaptability to jumps, small wiggles, and other unusual features in the target function, it has become an important tool to replace the Fourier transform in many practical situations. The introduction of the wavelet method in statistics began more than a decade ago . Since then it has been applied to many areas of statistics including nonparametric regression, time series analysis, nonparametric density estimation, and contingency table cell probability estimation .
where c i, jare the wavelet coefficients and φ i, jand ψ i, jare the wavelet basis functions. A discrete wavelet transform changes discrete data to wavelet coefficients. There are many choices for the father and mother wavelets, and they are chosen to give the wavelet transformation desired properties: to suppress noise more effectively; to be easily adapted to dense or sparse signals, and to deal with unsmooth functions with unusual features. Noise suppression is achieved by removing terms from the above summation (letting some wavelet coefficients be zero), while the main features of the target function can still be kept. Computing discrete wavelet coefficients is faster than computing Fourier coefficients. The time required for calculating discrete wavelet coefficients of n data points is O(n), and that for Fourier coefficients is O(n log n).
There are three steps in a wavelet transformation. Step one is to transform genotype scores to wavelet coefficients. Step two decides which wavelet coefficients to be removed or shrunk, and this process is called thresholding. Step three transforms the modified wavelet coefficients back to modified genotype scores. The key to reducing the number of degrees of freedom and increasing the power lies in step two: thresholding. A wavelet coefficient is kept if it is greater than a high threshold, it is dropped if it is smaller than a low threshold, and it is shrunk if it lies in between the two thresholds. For a data set, one can choose a single threshold for all wavelet coefficients, or choose multiple thresholds to handle wavelet coefficients on different resolution levels and with different frequencies. One can also choose thresholds manually. But this will not work for a GWAS. The scale of a GWAS calls for a data-dependent choice of thresholds. The simplest data-dependent threshold is a universal single threshold , which is given by , where σ is the standard deviation of noise and n is the sample size. It is generally believed that at high resolution levels the wavelet coefficients of the signal are more sparse than those in the low resolution levels, therefore, a level-dependent choice of thresholds is better than the universal threshold. A false discovery rate (FDR) method has been used in choosing level-dependent thresholds . While it works well for sparse signals, it does not adapt very well to dense signals. Many other thresholding methods have been proposed, and this is still an active research area.
We chose the empirical Bayesian thresholding . The advantages of this approach include its superior adaptability to sparse and dense signals. It is a data-dependent, automatic procedure. Let C i = μ i + ϵ i be the sample wavelet coefficients at a certain resolution level, where ϵ i represents noise. Suppose the signal μ i has a prior distribution (1 - w)δ 0 + wγ, where δ 0 is an atom probability at zero and γ is a unimodal symmetric density. Let (c, w) be the median of the posterior distribution of μ, given C = c. There exists t(w) such that (c, w) = 0 if and only if |c| ≤ t(w). If is the marginal maximum likelihood estimator of w, then the empirical Bayesian threshold is t( ).
The empirical p-values are calculated by permutations. The permutation approach may have limitations for GWAS. However, when we applied this method to a data of about 2,000 individuals with 500,000 SNPs, it took less than a week to run on a cluster with 23 nodes and 65 CPUs, which is still manageable.
For a case-control design, we compare the proposed score test based on wavelet transform (T w ) with the following commonly used tests: the score test based on Fourier transform (T f ); a test obtained by fitting a regression function with one SNP, followed by Bonferroni correction to find the global p-value (T b ); and a likelihood-ratio test based on logistic regression (T l ). For quantitative traits, we compare T w with T f , T b , and PCReg  (T p ). The reasons for choosing these tests to be compared with the proposed test are the following: we chose T f because both T w and T f are affected by the order of the SNPs, but we want to show that T w adapts better to sparse and dense data than T f ; we chose T b because it is a common and very effective test when one SNP has a strong association with the disease. If several SNPs have small to moderate information, a regression with multiple SNPs may have advantages, which is the reason behind our choice of T l . Both T b and T l are not affected by the order of SNPs, and they do not suppress data to reduce the degrees of freedom. The strength of T p lies on data suppression. A comparison of T w with T f , T b , T l , and T p demonstrated that T w achieves a higher power by effectively suppressing noise and by using extra information regarding the ordering of the SNPs.
The type I error rates and the powers of T w , T f , T b , and T l were analyzed. In our simulation study, there are 200 cases and 200 controls, and the number of SNPs is m + 1 with the SNP at the center being an unobserved disease SNP. The allele frequencies of m non-disease SNPs are obtained from a uniform distribution between 0.2 and 0.8, and the allele frequency of the disease SNP is p. The haplotypes of the m + 1 SNPs are generated from a multivariate normal distribution with a variance-covariance matrix (ρ ij ). The m + 1 allele frequencies give the cutoff points, which translate a multivariate normal vector to a haplotype. The sum of two haplotypes is a genotype vector. We can only observe the genotype data of cases and controls on m SNPs; the genotype of the disease SNP is not observable.
Next we applied the wavelet-based test to three genes: CHI3L2, IL21R, and CTLA4, which have also been analyzed in other simulation studies [8, 10, 17]. We downloaded genotype data of 60 individuals (parents) from CEPH (Utah residents with ancestry from northern and western Europe) in the HapMap . In the following selections, we only chose SNPs with missing genotypes less than 5%.
In a 20 kb region around CHI3L2, we chose SNPs with minor allele frequencies greater than 0.26. This leads to 17 SNPs (rs755467, rs2255089, rs2494004, rs1325284, rs2251715, rs961364, rs2251608, rs2764543, rs2477574, rs2477578, rs942694, rs942693, rs2182114, rs5003369, rs3934922, rs3934923, and rs8535). During simulation studies, we randomly picked one SNP as the disease locus, and assumed it was not observed. We sampled a genotype at the disease locus from 60 individuals, and assigned its trait value (0 or 1) according to its genotype using a multiplicative disease model. A sample of 200 cases and 200 controls was then obtained. Next we generated genotypes of each case and control at the other 16 SNPs by sampling from individuals with the same genotype at the disease locus. Suppose we only observed genotype data at these 16 loci. If the disease SNP was among the observed SNPs, single locus methods should have higher power than multilocus methods. The proposed test is not intended to replace single locus methods under this situation. We only claim that the new test has a higher power when none of the SNPs in the window has a strong association with the disease by itself, in which case single locus method may fail to detect signals. In a region of 20 kb around IL21R, we chose SNPs with minor allele frequencies greater than 0.25. This yields 21 SNPs, in which we chose the largest block containing nine SNPs (rs179766, rs7203086, rs2040790, rs11074861, rs7199138, rs8057551, rs8061992, rs9930086, and rs8049804). A similar data set as before was generated from these nine SNPs.
Positions of 12 sites on 12 chromosomes.
The new test can also be applied to quantitative traits. We compared the power and the type I error rates of T w with three commonly used tests T f , T b , and T p . The model used to generate trait values is y = μ(x 1 + βx 2) + e, where e is a standard normal random variable; x 1 = 1, 0, -1 and x 2 = 0, 1, 0 for genotypes DD, Dd, dd, respectively; and β = -1, 0, 1 for recessive, additive, and dominant models, respectively. We calculated μ from the heritability, which ranges from 4% to 10%.
PCReg  is described as follows. Let G be the matrix of genotypes, and let y be the vector of trait values. Suppose y and the columns of G are all centered such that their means are zeros. Let the columns of A be the first several characteristic vectors of G T G such that they can explain more than 80% of the total variation in G. Let G 1 = GA be the projections of the genotype data onto these principal directions. The regression model is y = G 1 b 1 + ϵ, and T p is the regression F -statistic. We used permutation to calculate the empirical p-value of T p .
In Table 2, the results show that the wavelet-based test T w has the correct type I error rates.
Type I error rates for tests with qualitative traits.
Type I error rates
LD = A1
LD = A2
LD = A3
LD = A4
LD = A5
LD = A6
Note that in the above simulations, the causal locus is randomly chosen among all SNPs around a gene. Therefore, it is free to vary during simulations. Second, the causal locus is not observed in any simulations because the intent of the proposed test is to provide a tool to detect small to moderate associations in a window of SNPs when none of them have a strong LD with the disease. We expect that if the causal locus is observed, the power of single locus tests will increase more than those of multilocus tests. Third, the window sizes in our simulations are eight and 16. When the window size is eight, we can recode genotypes to obtain as many positive pairwise correlations as possible. For larger window sizes it will not be feasible because of computational burden. When genotypes are recoded, it is done for all individuals regardless of their phenotypes.
Type I error rates for tests with quantitative traits.
Type I error rates
LD = A1
LD = A2
LD = A3
LD = A4
LD = A5
LD = A6
Simulation studies show that the proposed test achieves a higher power than other commonly used tests. The improved power results from three sources. The first is the use of the wavelet transformation of genotype data. The wavelet transform is designed to deal with unsmooth signals with jumps and small wiggles. Genetic data are not smooth nor periodic, which is naturally dealt with by the wavelet transform. Using the transformed data instead of the original data enables us to view the signals in different frequencies and in different resolution levels separately. The second is the choice of thresholds in the wavelet transformation. The wavelet transformation decomposes data into coefficients corresponding to different frequencies and to different resolution levels. It is generally believed that a low frequency signal is more likely to be a true signal than a high frequency one, and a true signal is more sparse on a fine resolution level than on a coarser level. Suppressing wavelet coefficients at different frequencies and different resolution levels in various ways increases the effectiveness of the noise suppression, which means that the data can be represented by using fewer wavelet coefficients. Empirical Bayesian thresholding automatically decides how much noise to be suppressed at each level according to the data. A wavelet transform with empirical Bayesian thresholding gives the proposed test its ability to adapt to LD structures: it suppresses more if many SNPs under consideration are unrelated with the disease, and it suppresses less if most SNPs are in fact associated with the disease.
The third reason for the improvement of the power comes from taking the relative positions of SNPs on a chromosome into consideration. An important difference between wavelet-based tests and PCReg is that PCReg does not consider the relative positions of SNPs. It views a multilocus genotype as a vector. The wavelet-based test treats a multilocus genotype as a discretized function instead of a vector. The difference between regarding genotypes as a function versus a vector is that viewing it as a function allows us to take the order of SNPs on a chromosome into consideration. The importance of the ordering of SNPs on a chromosome can be illustrated by the following simple example. Suppose in a GWAS, one finds two locations. At one location, multilocus genotype 1#1#1# appears frequently among cases; and at another location ###111 appears often among cases, where # represents noise and 1 represents a heterozygous genotype. The question is which location is more likely to be a true signal, and which one is more likely to be noise. For PCReg, the two locations have the same importance. Projecting onto the first, the third, and the fifth dimensions is the same as projecting onto the fourth, the fifth, and the sixth dimensions. However, for a wavelet-based test, they are different. A wavelet-based test considers multilocus genotypes as discretized functions, ###111 represents a low frequency function, while 1#1#1# represents a high frequency function. If everything else remains the same, a low frequency function is more likely to be a true signal, and a high frequency function is more likely to be noise. It is worth noticing that for tests based on haplotype sharing, ###111 is also more important than 1#1#1# because the former will increase shared length of haplotypes. In the Appendix, we prove that many genotype-based tests and some haplotype-based tests do not utilize the information contained in the spatial order of SNPs.
We propose a score test based on a wavelet transformation. The goal is to increase power by suppressing noise and therefore reducing the number of degrees of freedom. The adaptability of the empirical Bayesian thresholding provides the test with the ability to automatically suppress the right amount of noise which is shown by simulation studies using HapMap data. Whether the window contains SNPs related to the disease or not, the proposed test always has the highest power comparing with the single marker test with Bonferroni correction and the likelihood-ratio test based on logistic regression. This shows the effectiveness of the noise suppression by the proposed test. The second advantage of the proposed test is that it takes the order of SNPs on a chromosome into consideration. In this sense, it is a compromise between genotype-based tests and haplotype-based tests. Since it considers the order of SNPs, it uses more information than other genotype-based methods, while avoiding the need to infer unobserved haplotypes or their frequencies as in haplotype-based tests. Simulation studies show that the proposed test consistently has a higher power than PCReg. The proposed test and PCReg both suppress data to reduce the number of degrees of freedom to increase the power. The major difference between them is that the proposed test takes the order of SNPs into consideration while PCReg does not. The difference between their powers show that considering the order of SNPs does increase the power of the tests. The proposed test and the test based on the Fourier transform have similar powers when all SNPs in a window are related to the disease and the wavelet-based test has a higher power when some of the SNPs are not related to the disease. This demonstrates the advantage of using the wavelet transform than using the Fourier transform. Since genotype data are not smooth nor periodic, it is naturally better dealt with by the wavelet transform. Although the proposed test has many advantages, it is certainly not universally better than other tests. For example, if a window contains a SNP strongly associated with the disease, a single locus method with Bonferroni correction should be better than any multilocus methods, including the new test. In a GWAS, researchers usually apply single locus methods first, and report significant findings if there are any. Only after that initial step, multilocus methods are used to identify information missed by the single locus methods. The proposed test should be used with this in mind.
If population stratification is a concern, we suggest to apply EIGENSTRAT  to the data to obtain several large principal components. These principal components are used to adjust genotypes and phenotypes as suggested by Price et al. . The wavelet-based test is calculated using adjusted genotypes and phenotypes. The genomic inflation factor λ GC  can be used as a criterion to determine if population structures are present. The new test has been successfully applied to a GWAS of the North American Rheumatoid Arthritis Consortium data from Genetic Analysis Workshop 16 .
wavelet-based score test, http://www.math.mtu.edu/~rjiang/preprint/wavelet.score.test.R
In the appendix we prove that T p , the Hotelling's T 2 test, T b , and T l are not affected by permuting SNPs. Let G = (g ij ) be an n × m genotype matrix, where g ij is the genotype of the ith individual at the jth marker. Subtract the mean of the jth column from g ij such that the mean of each column of G is 0. The sample covariance matrix of the genotypes is A = G T G/(n - 1). Suppose that λ 1 ≥ λ 2≥...≥λ m are the eigenvalues of A, and v 1, v 2,..., v m are the corresponding eigenvectors. Let D be a diagonal matrix with λ 1 ≥ λ 2≥...≥λ m as diagonal entries, and let V be an m × m matrix with v 1, v 2,..., v m as columns. Then A = VDV -1. Note that v i is the ith principal component, i = 1, 2,..., m. Write V = [V 1 V 2], where V 1 contains the used principal components, and V 2 contains the discarded principal components. In PCReg, the regression model is y = GV 1 b + ϵ.
Suppose the spatial order of SNPs is permuted, then the columns of G are also permuted accordingly. Suppose the new genotype matrix is , then = GP where P is obtained from applying the same permutation on the columns of the identity matrix. Assume that is also centered so that the mean of each column is 0. Let . Since P is an orthogonal matrix, P T = P -1. Therefore, = P T G T GP = P T VDV -1 P = (P T V)D(P T V)-1. Note that the diagonal matrix of the eigenvalues of is still D, and the matrix of eigenvectors of is P T V. Since A and have the same eigenvalues, the used principal components are the columns of and the discarded principal components are the columns of . In PCReg, the regression model after permutation is which is the same as before permuting SNPs.
The same arguments can be applied to prove that the haplotype T 2 statistic defined in  is not affected by permuting SNPs either. It was proved in  that both the multilocus T 2 and the haplotype T 2 statistics have the same power. Usually a haplotype-based test will have a higher, or at least a different, power than a genotype-based test. It is a interesting fact that both the multilocus T 2 and the haplotype T 2 have the same power and neither have used the information contained in the spatial order of SNPs.
Recall that T b is obtained by fitting a regression function with one SNP, followed by Bonferroni correction to find the global p-value. Permuting SNPs does not change its results.
The likelihood-ratio test based on logistic regression T l is also not affected by permuting SNPs.
This research was partially supported by a National Institutes of Health grant GM069940-01A2.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.