A powerful hybrid approach to select top single-nucleotide polymorphisms for genome-wide association study
© Wang and Shete; licensee BioMed Central Ltd. 2011
Received: 9 July 2010
Accepted: 6 January 2011
Published: 6 January 2011
Genome-wide association (GWA) study has recently become a powerful approach for detecting genetic variants for common diseases without prior knowledge of the variant's location or function. Generally, in GWA studies, the most significant single-nucleotide polymorphisms (SNPs) associated with top-ranked p values are selected in stage one, with follow-up in stage two. The value of selecting SNPs based on statistically significant p values is obvious. However, when minor allele frequencies (MAFs) are relatively low, less-significant p values can still correspond to higher odds ratios (ORs), which might be more useful for prediction of disease status. Therefore, if SNPs are selected using an approach based only on significant p values, some important genetic variants might be missed. We proposed a hybrid approach for selecting candidate SNPs from the discovery stage of GWA study, based on both p values and ORs, and conducted a simulation study to demonstrate the performance of our approach.
The simulation results showed that our hybrid ranking approach was more powerful than the existing ranked p value approach for identifying relatively less-common SNPs. Meanwhile, the type I error probabilities of the hybrid approach is well-controlled at the end of the second stage of the two-stage GWA study.
In GWA studies, SNPs should be considered for inclusion based not only on ranked p values but also on ranked ORs.
Genome-wide association (GWA) study has recently become a powerful approach for detecting genetic variants for common diseases without prior knowledge of the variant's location or function [1–4]. Currently, almost all the GWA studies are conducted in two stages: certain numbers (10-50) of the most significant single-nucleotide polymorphisms (SNPs) associated with top-ranked p values are selected in stage one, and follow-up is performed in stage two.
This two-stage approach has been widely used and has successfully identified SNPs with novel susceptibility for different complex diseases such as lung [5–7], prostate [8–11], and breast cancers [12–15]; glioma [16, 17]; and type 2 diabetes [18–24]. The published GWA studies showed that in different studies, different numbers of the most significant SNPs were selected at stage one for follow-up. For example, in the GWA study of lung cancer, 10 top-ranked SNPs were selected in stage one , whereas in the GWA study of type 2 diabetes, 59 SNPs were selected in stage one . While the p value can indicate when the association of an SNP with a disease is statistically significant, it does not consider the associated odds ratio (OR). The rationale for using OR as a selection criterion is that when minor allele frequencies (MAFs) of causal SNPs are relatively low, much less-significant p values would be observed, even if they could correspond to higher ORs, which might be more useful for prediction of disease status; nevertheless, the less significant p value may limit the inclusion of these significant variants in a replication study. Therefore, selecting a certain number of significant SNPs only based on p values might overlook some important genetic variants that could have an even greater effect on the disease causation.
In this paper, we proposed a hybrid approach for selecting candidate SNPs from stage one of a GWA study that is based on both ranked p values and ranked ORs. Based on simulation studies, the power comparison results showed that our hybrid ranking approach was more powerful than the existing ranked p value approach for identifying relatively less-common SNPs. We performed additional simulation studies to investigate the type I error probabilities of the proposed hybrid approach, and found that the type I errors are well-controlled at the end of second stage of the two-stage GWA study.
For the hybrid approach proposed in this paper, ORs are considered in addition to p values when selecting SNPs in stage one. For stage one, we selected a set of SNPs: half were selected based on ranked p values, and the other half were selected based on ranked ORs.
Parameters for Simulation under Alternative Hypothesis
Disease Locus 1
Disease Locus 2
In stage one, we first simulated genotypes for the disease-causing loci D1 and D2 for each individual. Given the dataset of realizations of SNPs D1 and D2, we randomly generated a disease status for each individual using the logistic model above. Then, conditioned on data of D1, genotypes for marker locus M1 were simulated using the r 2 value of 0.8. Similarly, genotypes for marker M2 were simulated conditioned on data of D2 with the same r 2 value. Because the markers M1 and M2 were in high LD with the disease loci D1 and D2, respectively, they were also associated with the simulated disease. In this way, we simulated a large amount of data for the population of interest and then randomly sampled 1,000 disease-related cases along with 1,000 normal controls from this population. In this study, unless otherwise specified, we employed logistic regression to obtain ORs and used Wald's test to assess significance.
To investigate the power for each SNP to be selected in stage one and followed up in stage two, we generated 1,000,000 replicates of a single SNP under the null hypothesis of no association between the SNP and the disease, each replicate with 1,000 cases and 1,000 controls. To mimic realistic patterns of LD, we applied a forward-time population simulation software program (genomeSIMLA) to generate the 1,000,000 unassociated SNPs . We used information about markers on human chromosome 2, including rs numbers, allele frequencies, recombination fractions, and positions, to seed the initial population. The initial population was then advanced through 1,000 generations of mating to create a pool of individuals. The case and control status for individuals was assigned based on a penetrance function assuming one disease-causal SNP. We then randomly permuted the case-control statuses of individuals to break the association between the SNP and disease of interest. Therefore, all the SNPs simulated were unassociated with the disease of interest. The p values and ORs of all simulated markers were evaluated using PLINK . For the purpose of our simulation, we considered markers with a range of MAFs (10%~40%) that covered the MAFs defined for the disease loci. The p values and ORs obtained under the null hypothesis were used to determine the thresholds for selecting SNPs in stage one. We studied different selections of 10, 20, 30, and 40 SNPs in stage one to follow up on in stage two. Basically, we ranked the 1,000,000 SNPs with respect to p values as well as OR values. If investigators decided to select the top 20 SNPs in stage one using the existing ranked p-value approach, the top 20th-ranked p value out of the 1,000,000 p values would be the threshold for selection, and the SNPs in stage one with more significant p values than the threshold would be forwarded to stage two. Using our proposed hybrid approach, two thresholds for selection were considered. One threshold was the top 10th-ranked p value, and the other was the top 10th-ranked OR value. In order to take into account potential overlapping of the sets of top 10 p-value-based SNPs and top 10 OR-based SNPs, we would first pick SNPs based on p value threshold and then pick additional SNPs based on the OR threshold. The SNPs we selected in stage one either had more significant p values than the p value threshold or had larger OR values than the OR threshold. The reason for selecting 10 OR-based SNPs and 10 p-value-based SNPs was to have the same number (20) of total top SNPs carried over to stage two.
Further, we investigated what percentage of the SNPs could reach a genome-wide threshold for declaring significance. Therefore, for each replicate selected in stage one, we simulated independently another 1,000 cases and 1,000 controls using the same simulation approach and the same parameters. We then employed a joint analysis using fixed indicators for stages one and two, as joint analysis of two stages is efficient and always results in increased power to detect genetic variants . For each pair of corresponding replicates from stages one and two, data from both stages were pooled into one data set. We assumed that the data from different stages were from possibly different sites. Therefore, in order to control for the possible confounding effects of sites, we used (1, 2) to denote the indicator for each stage, with 1 representing stage one and 2 representing stage two. Multivariable logistic regression analysis was applied to the SNP and the stage indicator, and the significance was estimated using Wald's test. All results were based on 1,000 replicates.
We performed additional simulations to examine the type I error probabilities of the proposed hybrid approach under the null hypothesis of no association between SNPs and disease. We assumed that there were two marker loci, M1 and M2, that were not associated with the disease of interest, with MAFs similar to that used in the power studies. We applied the same software, genomeSIMLA , to generate the unassociated SNPs. We initialized the population with small ranges of MAFs. We simulated 10,000,000 replicates each with 1,000 cases and 1,000 controls. The case-control status was assigned at random, independent of the markers, so that markers were unassociated with the disease status. As in the power comparison studies, the same sets of p values and ORs of 1,000,000 replicates of a single SNP under the null hypothesis were employed to determine the thresholds for selecting SNPs in stage one to follow up. Similarly, for each replicate selected in stage one, we independently simulated another replicate using the same simulation approach and same parameters and performed a joint analysis using data from both stages.
Medians of Significance and Odds Ratios in Stage One Based on 1,000 Replicates
-Log10 of p value
Power to Select Top Single-Nucleotide Polymorphisms in Stage One for Replication Based on 1,000 Replicates
Number of Selections
p-Value-Based Ranking Approach
Hybrid Ranking Approach
Median Odds Ratios and Powers in Joint Analysis (Stage One and Stage Two Combined)
Number of Selections
p-Value-Based Ranking Approach
Hybrid Ranking Approach
Power at 1.7 × 10-7
Power at 1.7 × 10-7
Power at 1.7 × 10-7
Power at 1.7 × 10-7
As expected, all of the p values from the joint analysis were much more significant than those from stage one. We used a genome-wide threshold p value of 1.7 × 10-7 for declaring the significance of the SNPs. Our proposed hybrid approach for GWA study gained considerable power for marker M1 over that of the existing ranked p-value approach. For instance, when the number of selections in stage one was 20, the observed powers were 56.3% and 90.9% for selecting markers M1 and M2, respectively, using the existing ranked p-value approach; when our proposed hybrid approach was used, however, the observed powers were 80.0% and 87.0% for selecting M1 and M2, respectively. Even when 40 SNPs were selected for follow-up, we still observed a 20.6% increase in power for selecting marker M1 with loss of about 2% power for selecting marker M2.
Power to Select Both SNPs (M1 and M2) in Stage One and in Joint Analysis (Stage One and Stage Two Combined)
Number of Selection
Joint Analysis (Power at 1.7 × 10-7)
p-Value-Based Ranking Approach
Hybrid Ranking Approach
p-Value-Based Ranking Approach
Hybrid Ranking Approach
Type I error estimate
Based on 10,000,000 replicates for each marker locus, we first estimated the percentages of replicates of unassociated SNPs selected for follow-up in stage two. If 10 SNPs were selected in stage one, using the hybrid approach, marker M1 would be selected for follow-up in 4,365 of 10,000,000 replicates, and marker M2 would be selected in 58 of 10,000,000 replicates. When the standard ranked p-value approach was applied, marker M1 would be selected for follow-up in 143 of 10,000,000 replicates and marker M2 would be selected in 111 of 10,000,000 replicates. If 20, 30, and 40 SNPs were selected in stage one, using the hybrid approach, marker M1 would be selected for follow-up in 5,823, 6,736, and 7,803 replicates, respectively, and marker M2 would be selected in 111, 227 and 250 replicates, respectively. Using the standard ranked p-value-based approach, marker M1 would be selected for follow-up in 292, 401 and 511 replicates, respectively, and marker M2 would be selected in 250, 362 and 468 replicates, respectively. However, importantly, after performing the joint analysis, we found that none of the replicates that have been moved forward to the second stage could satisfy the genome-wide threshold p value of 1.7 × 10-7 for declaring significance, using either our approach or the standard approach. If we increased the threshold p value for declaring significance to 10-5, one replicate of marker M1 would be significant on the basis of both the hybrid and ranked p-value approaches. Therefore, we can conclude that the type I error rates of the proposed hybrid approach were well controlled at the end of the experiment (stage two).
In this paper, we proposed a hybrid approach to rank and select SNPs in stage one of a GWA study -- ORs are considered in addition to p values. The results from the simulation study show that the hybrid approach has an increased power for identifying the less-common genetic variants. Meanwhile, we concluded through simulation studies that the type I error rates of the hybrid approach are controlled at the end of the experiment.
In our study, we selected half of the candidate SNPs on the basis of ranked p values and the other half on the basis of ranked ORs. In reality, one could select the same or different numbers of candidate SNPs based on ranked p values and ranked ORs. For example, one could select the ranked-p-value-based SNPs using the common GWA study threshold (e.g., p value < 10-5) but select the ranked-OR-based SNPs using a less significant threshold (e.g., p value < 10-3) for follow-up in stage two. It should be noted that while ranking SNPs based on ORs, both risk and protective effect SNPs should be considered. In our study, we did not have knowledge about whether extremely large or extremely small effect size is more important; therefore, we reversed ORs corresponding to protective effect SNPs before ranking all the ORs (for ranking purposes only). In our simulation studies, we considered a range of MAFs of 10%~40%. A GWA study with 1,000 cases and 1,000 controls has adequate power to detect the SNPs with an MAF of 10% or higher. To detect variants with smaller MAFs, such as less than 5%, a larger sample size will be needed. The impact of the true OR, average LD and MAF on the power of a GWA study to detect susceptibility SNP markers is discussed in Park et al. . Furthermore, these authors also provide sample sizes required for GWA studies to identify associations.
One may argue that using the ranked p values within the rare variants and ranked p values within the common variants to select candidate SNPs would lead to similar results. The obvious deficiency of this approach is the difficulty in defining a boundary for separating rare and common variants. Furthermore, even if one could define a threshold for rare and common variants (say 10%), among the set of rare (or common) variants, there will still be relatively rare variants and relatively common variants. Therefore, when only selecting top SNPs based on the ranked p values within rare variants, one may still be more likely to select the SNPs that are relatively common in the rare variants set, but miss the SNPs that are relatively rare in the rare variants set. For example, if defining SNPs with MAFs < 10% as rare variants and using the ranked p values within rare variants set, one might not be able to capture the SNP with MAF = 3% but OR = 2.5 in stage one (but select the SNP with MAF = 9% but OR = 1.5, for example), even though the OR value represents a very significant association between this SNP and the disease of interest. We had considered such an approach to our simulated data, where we defined SNPs with MAF = 10% as rare variants and SNPs with MAF = 40% as common variants. The powers for selecting both rare and common variants did not increase, however, and were almost identical to those obtained using the standard ranked-p-value approach (data not shown).
Currently, ranked p value is used as a criterion for selecting common variants, but from Figures 1, 2, and 3, we can conclude that ranked OR should be used as an additional criterion for selecting less-common variants. These two approaches are complementary, and therefore, a hybrid approach using both ranked p value and ranked OR should be more powerful for selecting rare variants. It should be noted that the type I error is controlled in the joint analysis because the SNPs selected in stage one with the use of our hybrid ranking approach are not final, and they have to meet the GWA significance (1.7 × 10-7) in joint analysis. The results from the simulation studies confirmed this statement.
We proposed a hybrid approach for selecting candidate SNPs from the discovery stage of GWA study, based on both p values and ORs, and conducted a simulation study to demonstrate the performance of our approach. The power comparison results show that our hybrid ranking approach is more powerful than the existing ranked p value approach for identifying relatively less-common SNPs. Therefore, GWA studies should consider including SNPs based not only on ranked p values but also on ranked ORs. Furthermore, with the rapid development of sequencing techniques, much denser SNP chips with more low-MAF SNPs may be available in the near future. With these improved technologies, our hybrid ranking approach for selecting top SNPs offers a promising direction for future research in GWA studies.
Where Γ (·) is the gamma function. In order to draw the graphs of the expected test statistics and the corresponding p values, we used σ = 4.6 and n = 2000, and assumed an additive genetic model. Therefore, in the above equation, the degree of freedom is υ = n-2 = 1998, and the non-centrality parameter is , where ρ is the MAF, and OR is odds ratio.
We appreciate the two anonymous reviewers for their insightful and constructive comments. This work was supported by National Institutes of Health [grant number 1R01CA131324].
- Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37: 413-417. 10.1038/ng1537.View ArticlePubMed
- Manolio TA, Rodriguez LL, Brooks L, Abecasis G, Ballinger D, Daly M, Donnelly P, Faraone SV, Frazer K, Gabriel S: New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet. 2007, 39: 1045-1051. 10.1038/ng2127.View ArticlePubMed
- Pearson TA, Manolio TA: How to interpret a genome-wide association study. JAMA. 2008, 299: 1335-1344. 10.1001/jama.299.11.1335.View ArticlePubMed
- Manolio TA, Brooks LD, Collins FS: A HapMap harvest of insights into the genetics of common disease. J Clin Invest. 2008, 118: 1590-1605. 10.1172/JCI34772.PubMed CentralView ArticlePubMed
- Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, Eisen T, Dong Q, Zhang Q, Gu X, Vijayakrishnan J: Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet. 2008, 40: 616-622. 10.1038/ng.109.PubMed CentralView ArticlePubMed
- Hung RJ, McKay JD, Gaborieau V, Boffetta P, Hashibe M, Zaridze D, Mukeria A, Szeszenia-Dabrowska N, Lissowska J, Rudnai P: A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature. 2008, 452: 633-637. 10.1038/nature06885.View ArticlePubMed
- Thorgeirsson TE, Geller F, Sulem P, Rafnar T, Wiste A, Magnusson KP, Manolescu A, Thorleifsson G, Stefansson H, Ingason A: A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature. 2008, 452: 638-642. 10.1038/nature06846.PubMed CentralView ArticlePubMed
- Eeles RA, Kote-Jarai Z, Giles GG, Olama AA, Guy M, Jugurnauth SK, Mulholland S, Leongamornlert DA, Edwards SM, Morrison J: Multiple newly identified loci associated with prostate cancer susceptibility. Nat Genet. 2008, 40: 316-321. 10.1038/ng.90.View ArticlePubMed
- Gudmundsson J, Sulem P, Manolescu A, Amundadottir LT, Gudbjartsson D, Helgason A, Rafnar T, Bergthorsson JT, Agnarsson BA, Baker A: Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nat Genet. 2007, 39: 631-637. 10.1038/ng1999.View ArticlePubMed
- Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A: Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet. 2008, 40: 310-315. 10.1038/ng.91.View ArticlePubMed
- Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, Minichiello MJ, Fearnhead P, Yu K, Chatterjee N: Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007, 39: 645-649. 10.1038/ng2022.View ArticlePubMed
- Stacey SN, Manolescu A, Sulem P, Thorlacius S, Gudjonsson SA, Jonsson GF, Jakobsdottir M, Bergthorsson JT, Gudmundsson J, Aben KK: Common variants on chromosome 5p12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet. 2008, 40: 703-706. 10.1038/ng.131.View ArticlePubMed
- Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang Z, Welch R, Hutchinson A: A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 2007, 39: 870-874. 10.1038/ng2075.PubMed CentralView ArticlePubMed
- Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R: Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007, 447: 1087-1093. 10.1038/nature05887.PubMed CentralView ArticlePubMed
- Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, Friedman E, Narod S, Olshen AB, Gregersen P: Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci USA. 2008, 105: 4340-4345. 10.1073/pnas.0800441105.PubMed CentralView ArticlePubMed
- Shete S, Hosking FJ, Robertson LB, Dobbins SE, Sanson M, Malmer B, Simon M, Marie Y, Boisselier B, Delattre JY: Genome-wide association study identifies five susceptibility loci for glioma. Nat Genet. 2009, 41: 899-904. 10.1038/ng.407.PubMed CentralView ArticlePubMed
- Wrensch M, Jenkins RB, Chang JS, Yeh RF, Xiao Y, Decker PA, Ballman KV, Berger M, Buckner JC, Chang S: Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility. Nat Genet. 2009, 41: 905-908. 10.1038/ng.408.PubMed CentralView ArticlePubMed
- Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM: Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007, 316: 1336-1341. 10.1126/science.1142364.PubMed CentralView ArticlePubMed
- Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU: A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007, 316: 1341-1345. 10.1126/science.1142382.PubMed CentralView ArticlePubMed
- Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PI, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ: Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007, 316: 1331-1336. 10.1126/science.1142358.View ArticlePubMed
- Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007, 445: 881-885. 10.1038/nature05616.View ArticlePubMed
- Yasuda K, Miyake K, Horikawa Y, Hara K, Osawa H, Furuta H, Hirota Y, Mori H, Jonsson A, Sato Y: Variants in KCNQ1 are associated with susceptibility to type 2 diabetes mellitus. Nat Genet. 2008, 40: 1092-1097. 10.1038/ng.207.View ArticlePubMed
- Unoki H, Takahashi A, Kawaguchi T, Hara K, Horikoshi M, Andersen G, Ng DP, Holmkvist J, Borch-Johnsen K, Jorgensen T: SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and European populations. Nat Genet. 2008, 40: 1098-1102. 10.1038/ng.208.View ArticlePubMed
- Steinthorsdottir V, Thorleifsson G, Reynisdottir I, Benediktsson R, Jonsdottir T, Walters GB, Styrkarsdottir U, Gretarsdottir S, Emilsson V, Ghosh S: A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. Nat Genet. 2007, 39: 770-775. 10.1038/ng2043.View ArticlePubMed
- Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD: Data simulation software for whole-genome association and other studies in human genetics. Pac Symp Biocomput. 2006, 499-510. full_text.
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMed
- Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2006, 38: 209-213. 10.1038/ng1706.View ArticlePubMed
- Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N: Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet. 2010, 42: 570-575. 10.1038/ng.610.PubMed CentralView ArticlePubMed
- Bickel PJ, Doksum KA: Mathematical Statistics: Basic Ideas and Selected Topics. 1977, Oakland, California: Holden-Day, Incorporated
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.