Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Comparison of type I error for multiple test corrections in large singlenucleotide polymorphism studies using principal components versus haplotype blocking algorithms
 Kristin K Nicodemus^{1},
 Wenlei Liu^{2},
 Gary A Chase^{2},
 YaYu Tsai^{3} and
 M Daniele Fallin^{1}Email author
DOI: 10.1186/147121566S1S78
© Nicodemus et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
Although permutation testing has been the gold standard for assessing significance levels in studies using multiple markers, it is timeconsuming. A Bonferroni correction to the nominal pvalue that uses the underlying pairwise linkage disequilibrium (LD) structure among the markers to determine the number of effectively independent tests has recently been proposed. We propose using the number of independent LD blocks plus the number of independent singlenucleotide polymorphisms for correction. Using the Collaborative Study on the Genetics of Alcoholism LD data for chromosome 21, we simulated 1,000 replicates of parentchild trio data under the null hypothesis with two levels of LD: moderate and high. Assuming haplotype blocks were independent, we calculated the number of independent statistical tests using 3 haplotype blocking algorithms. We then compared the type I error rates using a principal componentsbased method, the three blocking methods, a traditional Bonferroni correction, and the unadjusted pvalues obtained from FBAT. Under high LD conditions, the PC method and one of the blocking methods were slightly conservative, whereas the 2 other blocking methods exceeded the target type I error rate. Under conditions of moderate LD, we show that the blocking algorithm corrections are closest to the desired type I error, although still slightly conservative, with the principal componentsbased method being almost as conservative as the traditional Bonferroni correction.
Background
A major controversy exists in determining significance levels for candidate gene or genomewide association scans using singlenucleotide polymorphism (SNP) data. Regardless of whether each SNP is analyzed one at a time or as part of a haplotype, the number of individual tests can become very large and can lead to an inflated type I error rate. Bonferroni correction is not an appropriate solution, given the correlation between tests in most SNP settings. Instead, permutation testing has been the gold standard for determining the significance level for SNP genome scans and candidate gene studies; however, it is computationally intensive and timeconsuming. Recently, a simpler method to determine the significance level for SNP association studies has been proposed that relies on the linkage disequilibrium (LD) structure of the genome to determine the number of independent tests [1]. This method uses principal components (PC) on pairwise LD measures to determine the number of independent tests and uses this number as the denominator in a Bonferroni correction to the unadjusted pvalues. However, using the pairwise LD measures between SNPs does not explicitly take into account the haplotype block structure of the human genome. We propose to use the sum of LD blocks defined for a set of SNPs plus singleton (not blockrelated) SNPs as the appropriate number for multiple test correction. However, the choice of block definition is also an issue of considerable controversy. A recent paper compared 3 measures of haplotype blocks, including a LDbased method described by Gabriel et al. [2], a recombinationbased method developed by Hudson and Kaplan [3], and a diversitybased method proposed by Patil et al. [4], and found low levels of agreement between them; The number of haplotype blocks and the haplotype block boundaries differed greatly across methods [5]. Therefore, the number of independent tests determined across these algorithms may differ widely from one another and from the number determined by PC analysis.
We proposed to obtain the number of independent tests using 3 haplotype blocking algorithms: the method described by Gabriel et al. [2] (Gabriel), the 4gamete test [3] (4GT), and the solid spine of LD measure (SSLD), as implemented in the program HAPLOVIEW [6], and using the number of components derived from PC analysis [1] to use for a Bonferronitype correction. We also considered the traditional Bonferroni correction (assuming all tests are independent) and the unadjusted pvalue type I error rate.
The blocking method of Gabriel et al. [2] describes a LD block as a contiguous set of SNPs in which 95% of pairwise D' confidence interval (CI) values are considered to be in strong LD (CI minima for upper CI bound = 0.98; CI minima for lower CI bound = 0.70). The 4gamete rule of Hudson and Kaplan [3] relies on historical recombination events to determine haplotype blocks. At each pairwise contiguous set, the frequency of observed 2SNP haplotypes is assessed; if at least 1 haplotype is observed with a frequency of less than 1% then that SNP is added to the block. A block is terminated when a recombination event is assumed to have taken place; that is, when all 4 possible 2SNP haplotypes are observed with a frequency of greater than 1%. The SSLD method creates blocks of SNPs that have contiguous pairwise D' values of greater than 0.8.
Methods
We used the observed LD structure from the Collaborative Study on the Genetics of Alcoholism (COGA) chromosome 21 data generated by Affymetrix and Illumina. We chose regions of high and moderate LD, then simulated caseparent trio genotypes at random from these data to reflect a null hypothesis of no association for assessment of type I error rates. We did not consider a low LD condition because the SNPs would be independent of one another and a Bonferroni correction would be appropriate. For the moderate LD condition (average contiguous pairwise r^{2} = 0.30), we used 20 Affymetrix SNPs between tsc1273568 and tsc0064946. The high LD condition (average contiguous r^{2} value = 0.46) was obtained by merging the chromosome 21 data from the Affymetrix and Illumina sets, using the UCSC Human Genome Map, Build 34. We selected 20 SNPs between tsc0897379 and tsc1650146 for the high LD condition.
Using PHASE version 2.1 [7, 8], we assigned chromosomes to founders from the original COGA datasets and then calculated the relative frequencies of the inferred haplotypes. These frequencies were used as weights in the sampling of founder chromosomes for each of the 1,000 replicates under each condition. For each founder, a pair of random numbers was generated, each corresponding to a particular haplotype. These haplotype pairs were then randomly paired to form parents. A single child was generated, giving equal weight to each of the 4 possible mating types, to create a total of 143 parentchild trios. All children were considered affected.
Association analysis of singleSNP data was performed using FBAT [9], a familybased transmission disequilibrium test. The number of independent tests per method was determined by assuming all SNPs were independent (traditional Bonferroni correction), the method proposed by Nyholt [1], and by calculating the number of singleton SNPs plus the number of haplotype blocks defined by the 3 blocking algorithms. These values were used in Bonferroni corrections to the unadjusted pvalues obtained using FBAT. We calculated the type I error rate as the proportion of datasets in which at least one SNP appeared to be significant after adjustment among all simulated datasets analyzed. We calculated type I error rates for the unadjusted pvalues, the Nyholt correction, a traditional Bonferroni correction, and correction for the 3 blocking methods across the 2 levels of LD.
Results
Percent agreement on number of effective SNPs per replicate across haplotype blocking methods
Moderate LD  High LD  

4GT Method  SSLD Method  4GT Method  SSLD Method  
Gabriel Method  1.6  23.6  0.0  0.20 
4GT Method  7.0  31.1 
Type I error rates across adjustment methods, for moderate and high LD conditions
Adjustment method  Moderate LD condition  High LD condition 

Unadjusted  0.547  0.530 
Bonferroni  0.029  0.018 
Nyholt Method  0.029  0.033 
Gabriel Method  0.035  0.033 
4GT Method  0.036  0.084 
SSLD Method  0.035  0.089 
Of 1,000 high LD replicates, 33 SNPs remained significant after correction by the Gabriel method for an experimentwise type I error rate of 3.3%. The Nyholt PC correction resulted in an identical experimentwise type I error rate of 3.3%. The 4GT and the SSLD methods were liberal, giving 8.4 and 8.9% type I error rates, respectively. As expected, Bonferroni correction provided a very conservative 1.8% type I error rate. Considering the unadjusted pvalues, the total number of replicates with SNPs appearing to be associated with disease at an alpha level of 0.05 was 530, for a 53% type I error rate.
Conclusion
Clearly, correcting for type I error is important in candidate gene and genomewide SNP studies. In contrast to the proposed use of principal components based on pairwise LD to correct for the number of effectively independent tests [1], we suggest using a LD blockbased correction, based on the LD block structure empirically detected in the data. We showed the expected inflation of type I error rates using only nominal pvalues, and the extremely conservative overcorrection induced by the traditional Bonferroni method. In general, our results show that the LD blockbased corrections prevent type I error inflation, without being overly conservative, presenting a compromise between the other approaches. Specifically, the Gabriel blocking algorithm consistently gave a ~3.4% type I error rate across moderate and high LD conditions, which is close to the desired 5% level. Although under moderate LD conditions both the 4GT and SSLD blocking methods gave slightly conservative type I error rates, under high LD conditions these methods are slightly liberal. The Nyholt method, as employed in this paper, is equivalent to a traditional Bonferroni correction under moderate LD conditions, although under high LD conditions it gave a similar type I error rate as the Gabriel method.
Like Schwartz et al. [5], we also found vast differences in definitions of haplotype blocks between blocking methods, with low levels of agreement about the number of independent SNPs between the 3 haplotype blocking methods. However, the range of these differences did not have a large effect on the type I error rates.
We believe the advantage to using the blocking algorithms instead of the Nyholt method is that the blocking methods are biologically meaningful and achieve type I error rates closer to the desired value over a range of LD levels.
In light of these results, several questions remain. Recent variations on the Nyholt PC method have been proposed that may improve its performance for correction, and this improvement should be evaluated in comparison to the blocking algorithms. These extensions to the PC method allow for a lower LD threshold in determining the number of independent tests. Second, the thresholds for each of the blocking algorithms were set to default values. Variation of these may result in a type I error rate closer to the desired value. Third, all methods examined here relied on D' as the LD metric of interest. The use of r^{2} instead may improve all blocking methods, and this should be explored further. Finally, higherorder LD structure was still not considered in the choice for number of effectively independent tests. A correction that allows for both withinblock and acrossblock correlation should further improve the proposed correction.
Abbreviations
 4GT:

Four Gamete Test
 CI:

Confidence interval
 COGA:

Collaborative Study on the Genetics of Alcoholism
 FBAT:

Familybased association test
 LD:

Linkage disequilibrium
 PC:

Principal components
 SNP:

Singlenucleotide polymorphism
 SSLD:

Solid spine of LD measure
Declarations
Acknowledgements
The authors are grateful to Priya Duggal for offering her merged dataset.
Authors’ Affiliations
References
 Nyholt DR: A simple correction for multiple testing for singlenucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet. 2004, 74: 765769. 10.1086/383251.PubMed CentralView ArticlePubMed
 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296: 22252229. 10.1126/science.1069424.View ArticlePubMed
 Hudson R, Kaplan N: Statisticial properties of the number of recombination events in the history of a sample of DNA sequences. Genetics. 1985, 111: 147164.PubMed CentralPubMed
 Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SP, Cox DR: Blocks of limited haplotype diversity revealed by high resolution scanning of human chromosome 21. Science. 2001, 294: 17191723. 10.1126/science.1065573.View ArticlePubMed
 Schwartz R, Halldorsson BV, Bafna V, Clark AG, Istrail S: Robustness of inference of haplotype block structure. J Comput Biol. 2003, 10: 1319. 10.1089/106652703763255642.View ArticlePubMed
 Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2004, 21: 263265. 10.1093/bioinformatics/bth457.View ArticlePubMed
 Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001, 68: 978989. 10.1086/319501.PubMed CentralView ArticlePubMed
 Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction. Am J Hum Genet. 2003, 73: 11621169. 10.1086/379378.PubMed CentralView ArticlePubMed
 Horvath S, Xu X, Laird NM: The family based association test method: strategies for studying general genotypephenotype associations. Eur J Hum Genet. 2001, 9: 301306. 10.1038/sj.ejhg.5200625.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.