Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Selection of singlenucleotide polymorphisms in disease association data
 Jungnam Joo^{1}Email author,
 Xin Tian^{1},
 Gang Zheng^{1},
 JingPing Lin^{1} and
 Nancy L Geller^{1}
DOI: 10.1186/147121566S1S93
© Joo et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
We studied several methods for selecting singlenucleotide polymorphisms (SNPs) in a disease association study. Two major categories for analytical strategy are the univariate and the set selection approaches. The univariate approach evaluates each SNP marker one at a time, while the set selection approach tests disease association of a set of SNP markers simultaneously. We examined various test statistics that can be utilized in testing disease association and also reviewed several multiple testing procedures that can properly control the familywise error rates when the univariate approach is applied to multiple markers. The set association methods were then briefly reviewed. Finally, we applied these methods to the data from Collaborative Study on the Genetics of Alcoholism (COGA).
Background
Due to the abundance and utility of singlenucleotide polymorphism (SNP) markers in the finemapping of complex traits, a growing amount of current genetic research focuses on the analyses of SNP data. Such analyses typically involve association, in which differences in allele or genotype frequencies of SNPs near or within candidate genes between affected and unaffected individuals are tested. To localize disease susceptibility genes (loci), thousands of SNPs are usually investigated and the main question is how to identify diseaseassociated SNP markers among a large pool.
A simple approach that is commonly used is to evaluate one SNP at a time. In this analytical strategy, each SNP is tested with appropriate testing procedures, such as Pearson's chisquare test and CochranArmitage (CA) trend test, and those SNP markers with a significant disease association are identified. Current technology, however, can genotype on the order of 100,000 SNPs at a time. Even with a preliminary genome scan, such as linkage analysis, which can restrict the chromosomal region to reduce the number of SNPs for investigation, often a large number of SNPs are tested simultaneously. Therefore, investigators are at great risk of falsepositive findings. Various methods for marker selection with consideration of multiple comparisons are available. Dudoit et al. [1, 2] summarized a number of procedures that control different type I error rates, such as familywise error (FWER) and false discovery rates (FDR) [3].
For a complex trait, however, several markers, each with a rather small effect, might act together to contribute to disease susceptibility. In this case, markerbymarker approaches often fail to find significance. Recently, several investigators incorporated the multigenic nature of complex traits in selecting SNPs for association [4, 5]. One promising approach has been proposed by Hoh et al. [6], which performs a simultaneous significance test on a set of possibly interacting SNP markers while controlling the genomewide significance level via permutation procedures.
In this study, we describe different strategies for selecting SNPs in a disease association study and apply them to the Collaborative Study on the Genetics of Alcoholism (COGA) data.
Methods
Measures for disease association
Allelic association and HardyWeinberg disequilibrium
To measure the extent of the association for a given SNP, Hoh et al. [6] proposed a statistic that combines several sources of information, such as allelic association (AA) and HardyWeinberg disequilibrium (HWD). In a 2 × 2 table with rows corresponding to cases and controls, and columns corresponding to SNP alleles, the χ^{2} statistic can be utilized as a measure for AA. HWD can also be computed using χ^{2} for deviation from HardyWeinberg equilibrium based on the affected individuals only. Let a_{ i }and u_{ i }be the AA statistic and HWD for association of the i^{th} SNP, respectively. The product of these two statistics, a_{ i }× u_{ i }, is used to measure the effects of AA and HWD for association. We denote this test statistic as AA × HWD. Hoh et al. [6] used trimming for markers with extremely high values of HWD. They first find the number d of largest HWD values (for example, using 99^{th} percentile of the χ^{2} distribution) based on control individuals, and d HWD values are set to zero in the further analysis.
Robust linear trend testsMERT and MAX
Two robust tests, the maximin efficiency robust test (MERT) and the maximal test (MAX) are useful in detecting diseaseassociated markers when the underlying genetic model is unknown. Suppose we have a family of optimal test statistics {Z_{ i }: i ∈ Λ}, where Λ = {1, 2, ..., k}is an index of k underlying models. For example, using the CA trend test, Z_{ x }, x = 0, 1/2, 1, are optimal test statistics for the recessive, additive, and dominant models, respectively [7]. Assume that under the null hypothesis, each Z_{ i }asymptotically follows a standard normal distribution and that their correlation matrix under the null hypothesis of no disease association is given by . Closed forms of the test statistics and correlations for the CAtrend test in casecontrol studies can be found in Friedlin et al. [8]. From Gastwirth [9], MERT can be written as a linear combination of two tests with the minimum correlation. Suppose that the minimum correlation is reached at the two tests and , i_{1}, i_{2} ∈ Λ. Then, a linear combination of the extreme pair given by
which asymptotically follows a standard normal distribution under the null hypothesis.
When the minimum correlation ρ_{0} is small, MERT may not be powerful. Freidlin et al. [10] suggested the use of a maximal statistic (MAX) when ρ_{0} < 0.50 and showed that the MAX and MERT have similar power when ρ_{0} ≥ 0.75. Several versions of MAX tests are possible but here we focus on Z_{MAX} = max( , Z_{MERT}, ) for a onesided test and Z_{MAX} = max( , Z_{MERT},  ) for a twosided test.
Multiple testing
Dudoit et al. [1] provided multiple testing procedures which strongly control the FWER for gene expression data and which are directly applicable to disease association data with multiple markers. The Bonferroni singlestep adjusted pvalue is a well known procedure for dealing with multiple testing. While it is easy to calculate, this method is extremely conservative. The improvement in power can be achieved by stepwise procedures such as Holm's procedure. To take into account the dependence structure between test statistics, Westfall and Young's [11] stepdown minP or stepdown maxT adjusted pvalues are useful. Since the joint distribution of the test statistics is usually unknown, resampling methods can be used to estimate these adjusted pvalues.
Set association approach
Hoh et al. [6] provided a method that tests the diseaseassociation of a set of markers instead of testing each SNP separately. In their method, the sum of test statistics over a suitable set of markers is first formed to combine the evidence for association. Permutation procedures are then used to evaluate pvalues associated with each sum and the overall type I error. The following summarizes the set association approach of Hoh et al. [6].
1) Order test statistics t_{ i }, i = 1, ..., m, so that t_{(1)} ≥ t_{(2)} ≥ ... ≥ t_{(m)}.
2) For a fixed N ≤ m, take sums with an increasing number of terms, starting with the most significant markers, such that S(n = 1) = t_{(1)}, S(n = 2) = t_{(1)} + t_{(2)}, ..., S(n = N) = t_{(1)} + ... + t_{(N)}.
3) Generate the permutation samples from the original sample (permuting labels of cases and controls) under the null hypothesis of no association and evaluate the pvalue of each sum. Take the minimum pvalue (minP).
4) Generate other permutation samples from the original sample under the null hypothesis of no association. To obtain the pvalue corresponding to each permutation sample, repeat the above 3 steps by regarding each permutation sample as the original.
5) Evaluate the overall significance level of (minP).
Study subjects and genetic markers
The COGA data provide alcoholism diagnosis on 1,614 individuals from 143 families. We focus on two categories for the alcoholism diagnosis (aldx1), "affected" as a case and "purely unaffected" as a control, and we used all 609 cases and 261 controls whose SNP data were available. From the preliminary genome scan by linkage analysis (Lin and Wu [12]), one candidate gene cluster, alcohol dehydrogenase, on chromosome 4 was identified. Alcohol dehydrogenase catalyzes the ratedetermining reaction in ethanol metabolism. Genetic studies of diverse ethnic groups have firmly demonstrated significant allelic associations between alcohol dehydrogenase genes and alcoholism. Therefore, we restrict our analysis to SNPs located near this gene cluster. Because the SNPs are evenly distributed in the entire genome but not densely genotyped near any genes, we found two SNPs (rs749407, rs980972) within the cluster and we selected two additional SNPs (rs1037475, rs1491233) flanking each side from the Illumina SNP data.
Results
Results from the univariate methods
rs1037475  rs1491233  

χ ^{2}  AA × HWD  Z ^{2} _{MERT}  Z ^{2} _{MAX}  χ ^{2}  AA × HWD  Z ^{2} _{MERT}  Z ^{2} _{MAX}  
Test statistic  9.299  19.685  4.234  8.842  0.620  1.301  0.380  0.616  
pvalue^{1}  0.010  0.002  0.040  0.007  0.734  0.587  0.537  0.674  
pvalue^{2}  Bon^{3}  0.040  0.008  0.160  0.028  1.000  1.000  1.000  1.000 
Holm  0.040  0.008  0.160  0.028  1.000  0.742  0.888  1.000  
wy^{4}  0.037  0.076  0.141  0.025  0.920  0.680  0.542  0.886  
rs749407  rs980972  
Test statistic  0.619  0.929  0.586  0.586  4.900  5.244  3.848  4.820  
pvalue^{1}  0.734  0.371  0.444  0.684  0.086  0.136  0.050  0.060  
pvalue^{2}  Bon^{3}  1.000  1.000  1.000  1.000  0.344  0.544  0.200  0.240 
Holm  1.000  0.742  0.888  1.000  0.258  0.408  0.160  0.180  
wy^{4}  0.730  0.526  0.670  0.689  0.234  0.224  0.135  0.156 
We carried out an additional analysis on a total of 8 SNPs in the nearest area including the above four SNPs. Using the univariate method with Bonferroni and Holm's methods, only AA × HWD found rs1037475 to be significant. None of the methods found significant markers based on Westfall and Young's method. In the set association approach, the smallest pvalues were reached at S(n = 1) using χ^{2} and AA × HWD, and at S(n = 2) using MERT and MAX, where S(n = 1) corresponds to rs1037475 and S(n = 2) is the sum of rs1037475 and rs980972. The overall significance levels of these smallest pvalues were 0.094, 0.022, 0.226, and 0.074, respectively. Again, only AA × HWD reached the overall significance at α = 0.05. When we included more SNPs in the analysis (a total of 28), none of the methods found significant markers. By adding SNPs which may not be in linkage disequilibrium with the mutation, the method became extremely conservative.
Conclusion
In this paper, we studied different strategies to select diseaseassociated SNP markers when multiple markers are tested. Various test statistics can be utilized to measure the degree of individual association, and using these statistics, the univariate approach combined with an appropriate correction for multiple testing can identify significant markers. However, if several markers are acting together to contribute to the susceptibility of the disease, the set association approach may be useful. In the application to the COGA data, we observed different results using the univariate and set association approaches, that is, a SNP marker with a rather negligible effect using the univariate approach is picked up by the set association approach. An added advantage of the set association methods is their ability to detect interacting loci, though we do not investigate that property here. For a rigorous comparison of the performances between different approaches, further investigation with simulated data would be necessary.
We used only four SNPs in our analysis. In principal, these procedures can also be applied to testing thousand of SNPs as in a genomewide association study. However, for testing a very large number of SNPs, these procedures can be extremely conservative and computationally intense. As we include more SNPs in the analysis, the methods tend to become very conservative and fail to find any significance. Reducing the number of tests by restricting areas of investigation is one common approach to address the multiple testing problems in genomewide association studies and the methods described here may be optimal with the reduced data. To take full advantage of the abundant information from a genomewide SNP map, alternative approaches such as a method for controlling FDR and a sequential type analysis [13] are possible.
The choice of test statistics has a great impact on the testing results. The CA trend test is usually preferable to the χ^{2} test [14, 15] and two robust tests, MERT and MAX, provide protection against model misspecification [7, 8]. AA × HWD [6] showed quite consistent result using different numbers of SNPs in the analysis. However, its performance was unstable in the permutation procedure. The properties of these test statistics under a variety of genetic models may need further investigation.
The casecontrol dataset used in this study is a family dataset in which cases and controls could be biologically correlated. The effect of correlated structures between family members in statistical testing leads to an inflated variance due to the positive correlation. Therefore, without considering this factor, inflation in type I error rates may result. In one of our studies using the same dataset [16], we applied the method of Slager and Schaid [17] with modification, in which the correlations of related individuals are incorporated into the CA trend test. While adjusting for the correlations is desirable, we found that the variance inflation is rather minor, and thus in this study, we ignored family structure. The test statistics which incorporate the correlations between family members can also be utilized in the univariate and set association approaches described in this study.
Abbreviations
 AA:

Allelic association
 CA:

CochranArmitage
 COGA:

Collaborative Study on the Genetics of Alcoholism
 FDR:

False discovery rates
 FWER:

Familywise error rate
 HWD:

HardyWeinberg disequilibrium
 MAX:

Maximal text
 MERT:

Maximin efficiency robust test
 SNP:

Singlenucleotide polymorphism
Declarations
Authors’ Affiliations
References
 Dudoit S, Yang YW, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sinica. 2002, 12: 111139.Google Scholar
 Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Stat Sci. 2003, 18: 71103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995, 57: 289300.Google Scholar
 Stoesz MR, Cohen JC, Marcovina S, Guerra R: Extension of the HasemanElston method to multiple alleles and multiple loci: theory and practice for candidate genes. Ann Hum Genet. 1997, 61: 263274. 10.1017/S0003480097006179.PubMedGoogle Scholar
 Nelson MR, Kardia SLR, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001, 11: 458470. 10.1101/gr.172901.PubMed CentralView ArticlePubMedGoogle Scholar
 Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human casecontrol association studies. Genome Res. 2001, 11: 21152119. 10.1101/gr.204001.PubMed CentralView ArticlePubMedGoogle Scholar
 Zheng G, Freidlin B, Li Z, Gastwirth JL: Choice of scores in trend tests for case control studies of candidategene associations. Biometrical J. 2003, 45: 335348. 10.1002/bimj.200390016.View ArticleGoogle Scholar
 Freidlin B, Zheng G, Li Z, Gastwirth JL: Trend tests for casecontrol studies of genetic markers: power, sample size and robustness. Hum Hered. 2002, 53: 146152. 10.1159/000064976.View ArticlePubMedGoogle Scholar
 Gastwirth JL: The use of maximin efficiency robust tests in combining contingency tables and survival analysis. J Am Stat Assoc. 1985, 80: 380384. 10.2307/2287901.View ArticleGoogle Scholar
 Freidlin B, Podgor MJ, Gastwirth JL: Efficiency robust tests for survival or ordered categorical data. Biometrics. 1999, 55: 883886. 10.1111/j.0006341X.1999.00264.x.View ArticlePubMedGoogle Scholar
 Westfall PH, Young SS: Resamplingbased Multiple Testing. 1993, New York: John Wiley & SonsGoogle Scholar
 Lin JP, Wu C: Bivariate genome scans incorporating factor and principal component analyses to identify common genetic components of alcoholism, eventrelated potential, and electroencephalogram phenotypes. BMC Genet. 2005, 6 (Suppl 1): S11410.1186/147121566S1S114.PubMed CentralView ArticlePubMedGoogle Scholar
 Province M: A single, sequential, genomewide test to identify simultaneously all promising areas in a linkage scan. Genet Epidemiol. 2000, 19: 301322. 10.1002/10982272(200012)19:4<301::AIDGEPI3>3.0.CO;2G.View ArticlePubMedGoogle Scholar
 Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 12531261. 10.2307/2533494.View ArticlePubMedGoogle Scholar
 Slager SL, Schaid DJ: Casecontrol studies of genetic markers: power and sample size approximations for Armitage's test for trend. Hum Hered. 2001, 52: 149153. 10.1159/000053370.View ArticlePubMedGoogle Scholar
 Tian X, Joo J, Zheng G, Lin JP: Robust trend tests for association in case control studies using family data. BMC Genet. 2005, 6 (Suppl 1): S10710.1186/147121566S1S107.PubMed CentralView ArticlePubMedGoogle Scholar
 Slager SL, Schaid DJ: Evaluation of candidate genes in casecontrol studies: a statistical method to account for related subjects. Am J Hum Genet. 2001, 68: 14571462. 10.1086/320608.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.