Volume 6 Supplement 1
Identification of tag single-nucleotide polymorphisms in regions with varying linkage disequilibrium
© Duggal et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
We compared seven different tagging single-nucleotide polymorphism (SNP) programs in 10 regions with varied amounts of linkage disequilibrium (LD) and physical distance. We used the Collaborative Studies on the Genetics of Alcoholism dataset, part of the Genetic Analysis Workshop 14. We show that in regions with moderate to strong LD these programs are relatively consistent, despite different parameters and methods. In addition, we compared the selected SNPs in a multipoint linkage analysis for one region with strong LD. As the number of selected SNPs increased, the LOD score, mean information content, and type I error also increased.
A variety of methods to identify haplotype tagging single-nucleotide polymorphisms (ht-SNPs) and tagging SNPS are currently available. These programs employ different algorithms or methods to identify a SNP, which may include the identification of haplotypes, haplotype blocks, and regions of linkage disequilibrium (LD). However, a comprehensive comparison of these programs is lacking. We examined several different tagging SNP selection programs using the Collaborative Studies on the Genetics of Alcoholism (COGA) dataset while altering the amounts of LD (moderate to complete LD), the physical distance of the regions considered (68 kb-435 kb), and the haplotype or minor allele frequency. For each program, we present a comparison of the number of tagging SNPs selected in 10 regions on 9 chromosomes and the percentage of agreement among the programs. Additionally, we examined the effects of selected tagging SNPs on multipoint linkage analysis. Dense SNP panels are likely to result in increased inter-marker LD, which violates the assumption of equilibrium of markers in multipoint linkage analysis. We examined the results of multipoint linkage analysis using all the SNPs in a region with LD and only those tagging SNPs selected by the different programs.
Population and haplotype reconstruction
COGA is a 6-center collaborative study designed to identify loci for alcoholism and related disorders and these data were available as part of the Genetic Analysis Workshop 14 (GAW14) . We restricted our analysis to one ethnicity to limit bias on allele frequencies, LD measurements, and haplotype reconstruction. All individuals classified as White, non-Hispanic (n = 1,074) were included. There were 102 pedigrees with a mean size of 10.5 (SD ± 5.1) and 332 founders. From the total founders we randomly ascertained one founder per pedigree (n = 102). Then, we randomly ascertained 30 founders who were used for all subsequent analyses. Since some tag SNP programs require haplotypes we used an expectation maximization (EM) algorithm as implemented in the program SNPHAP (v1.1)  to reconstruct phase unknown haplotypes from the 30 founder individuals. For each individual we used the haplotypes with the highest probability.
Physical distance map
Because the physical positions of SNPs from Illumina and Affymetrix were based on different assemblies of the human genome, we obtained updated physical locations for each SNP from dbSNP on NCBI Build 34 to generate an integrated, high-density map. For SNPs with multiple physical locations, we chose the position closest to the previous build. SNPs without physical positions were excluded (n = 322). The Illumina map (4,720 SNPs) and the Affymetrix map (10,798 SNPs) were then merged. This merged SNP map was used so that we would have definite regions of LD due to increased SNP density. There were 94 SNPs common to both maps. Genotyping data from Illumina for these 94 SNPs were used due to the lower overall missing rate (Illumina = 0.05%, Affymetrix = 5.25%). In total we had 15,424 SNPs across the whole genome.
SNP selection programs
We used 7 different SNP selection programs and then compared the overall percentage of agreement between the programs for the selected tag SNPs in 10 regions. These methods are very complex and each method cannot be fully explained here, but we encourage the reader to consult the referenced papers. We provide details of how we ran each program since there are many options in each program.
SNPTagger uses previously inferred haplotypes, which are sorted in descending order according to their frequencies (frequencies ≥1% are reported). Then all markers are ranked according to their diversity values in the included haplotypes, calculated by counting the number of major and minor allele appearances in each column/marker separately, and choosing whichever is smaller.
Tag SNP proposes a multi-step EM algorithm begins with the calculation of the haplotype dosage, δh(H), the count of the number of copies of a specific haplotype h (0, 1 or 2) contained in the true pair of haplotypes for each individual conditional on the individual's genotype data, and over all ordered haplotype pairs. Selecting subsets of SNPs, the squared correlation between the true and predicted haplotype dosage (R2h) is calculated. The lowest haplotype frequency was set to 0.1%, and the set of SNPs above which the addition of any further SNPs did not yield an improved R2h were selected.
Chapman/HTSNP is a set of programs)[6, 7] that can be run within the statistical software STATA (v8) to identify a minimal set of tag SNPs using different criteria including percent diversity explained (PDE) and R2. PDE is an index that measures the total haplotype diversity if only the htSNPs are used. R2 is a variant of the coeffiecient of determination, that is the percentage of variance explained by regression . We used the exhaustive search algorithm htsearch to find the minimal set of tag SNPs that both maximized the percent diversity (PDE > 0.98) and the R2 (> 0.98). All analyses used a minor allele frequency (MAF) > 0.1%.
Nyholt's method uses spectral decomposition, SpD. The eigenvalues (λ) measure the variance of each SNP-SNP correlation, and the higher the correlation among the set of SNPs the greater the λ values. The program examines the factor loadings for each eigenvalue to determine which SNP captures the best information for each set of SNPs using an orthogonal rotation. The Meff option identifies the minimum subset of SNPs, which maximize the information of the SNP group.
The Meng method  uses a SpD matrix of pair-wise LD (R2) by calculating the eigenvalues, and applies a varimax-rotation procedure to the original set of eigenvectors. The rotation allows for an orthogonal transformation, thereby calculating the influence of each SNP on the eigenvector. To determine the number of the most influential SNPs contributing to the region, the proportion of variance explained was set to 90%. This high proportion was selected because the typical number of founders used is lower than those suggested by the authors. We implemented this method in the statistical program R.
We used HAPLOVIEW v2.05  to create blocks utilizing the Gabriel et al. algorithm . This algorithm uses the 95% confidence intervals (CIs) of pair-wise D' values to designate 2 SNPs as being in strong LD. The CI minima for 2 SNPs in strong LD are 0.98 (upper) and 0.70 (lower). A block is defined as a region over which 95% of informative comparisons are in strong LD. All markers with MAF < 5% were excluded and the minimum haplotype frequency was set to 0.1%. An accelerated EM algorithm, similar to Qin et al.  estimates haplotypes. Then all within-block SNPs are ranked in order of genotyping success rates and those SNPs that capture all haplotypes within a block are chosen as htSNPs.
We used HaploBlock Finder (v0.7) which utilizes the haplotype block definition proposed by Patil et al.  and the dynamic programming algorithm by Zhang et al.  to find the optimal block partition and tagSNPs. Using a set α, a block is defined if at least α percent of haplotypes are represented more than one time [13, 14]. For this analysis we report the results from 95% chromosomal coverage, MAF > 0.1%, and the default of 0.90 for htSNP coverage.
The percent agreement between selected SNPs across tag-SNP programs.
HB Finder c
Multipoint linkage analysis ordered by number of tagSNPs for the chromosome 21 region.
# of SNPs
LOD score (p-value) simulated trait (chromosome 21) rs2835626
LOD Score (p-value) simulated trait (chromosome 20) tsc0041859
Mean information content
We performed a comprehensive comparison of different tagging SNP programs to determine if the amount of LD or the size of the region influenced the selection of tagging SNPs. Overall, HaplotypeBlock Finder and the SpD method tend to choose more SNPs, and that the diversity or R2 measurements were more likely to choose fewer SNPs. However, there was consistency among the programs and it suggests that in regions with moderate to complete LD these programs perform similarly despite different parameters and/or algorithms. Additionally, all of the tag SNPs performed well for our multipoint linkage analysis of a single region on chromosome 21 despite vastly different numbers of total SNPs used. This region had very strong LD and although each of the programs reduced the number of SNPs in the region by picking a subset to be tag SNPs, there was still residual LD among the SNPs selected, which would still violate the assumption of linkage equilibrium between markers in multipoint linkage analysis. Therefore, we do not suggest using only these methods as a measure to remove LD prior to linkage analysis. Although it is difficult to reach a conclusion from one replicate, our study suggests that increased SNP density may improve the power to detect linkage but also may increase the associated type I error.
Collaborative Studies on the Genetics of Alcoholism
Genetic Analysis Workshop 14
Minor allele frequency
Percent diversity explained
- Reich T, Edenberg HJ, Goate A, Williams JT, Rice JP, Van Eerdewegh P, Foroud T, Hesselbrock V, Schuckit MA, Bucholz K, Porjesz B, Li TK, Conneally PM, Nurnberger JI, Tischfield JA, Crowe RR, Cloninger CR, Wu W, Shears S, Carr K, Crose C, Willig C, Begleiter H: Genome-wide search for genes affecting the risk for alcohol dependence. Am J Med Genet. 1998, 81: 207-215. 10.1002/(SICI)1096-8628(19980508)81:3<207::AID-AJMG1>3.0.CO;2-T.View ArticlePubMedGoogle Scholar
- SNPHAP. [http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt]
- SNP Tagger. [http://www.well.ox.ac.uk/~xiayi/haplotype/index.html]
- Ke X, Cardon LR: Efficient selective screening of haplotype tag SNPs. Bioinformatics. 2003, 19: 287-288. 10.1093/bioinformatics/19.2.287.View ArticlePubMedGoogle Scholar
- Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike MC: Choosing haplotype-tagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum Hered. 2003, 55: 27-36. 10.1159/000071807.View ArticlePubMedGoogle Scholar
- HTSNP, STATA program. [http://www-gene.cimr.cam.ac.uk/clayton/software/stata]
- Chapman JM, Cooper JD, Todd JA, Clayton DG: Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum Hered. 2003, 56: 18-31. 10.1159/000073729.View ArticlePubMedGoogle Scholar
- Nyholt DR: A simple correction for multiple testing for single-nucleotide polymorphism in linkage disequilibrium with each other. Am J Hum Genet. 2004, 74: 765-769. 10.1086/383251.PubMed CentralView ArticlePubMedGoogle Scholar
- Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG: Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet. 2003, 73: 115-130. 10.1086/376561.PubMed CentralView ArticlePubMedGoogle Scholar
- Haploview. [http://www.broad.mit.edu/personal/jcbarret/haploview/index.php]
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.View ArticlePubMedGoogle Scholar
- Qin ZS, Niu T, Liu JS: Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet. 2002, 71: 1242-1247. 10.1086/344207.PubMed CentralView ArticlePubMedGoogle Scholar
- Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SP, Cox DR: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001, 294: 1719-1723. 10.1126/science.1065573.View ArticlePubMedGoogle Scholar
- Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci. 2002, 99: 7335-7339. 10.1073/pnas.102186799.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.